Unbabel/OpenKiwi

View on GitHub

Showing 236 of 264 total issues

Similar blocks of code found in 4 locations. Consider refactoring.
Open

def parser_for_pipeline(pipeline):
    if pipeline == 'train':
        return ModelParser(
            'estimator',
            'train',
Severity: Major
Found in kiwi/cli/models/predictor_estimator.py and 3 other locations - About 3 hrs to fix
kiwi/cli/models/linear.py on lines 302..320
kiwi/cli/models/nuqe.py on lines 120..138
kiwi/cli/models/quetch.py on lines 360..378

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 69.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

Similar blocks of code found in 5 locations. Consider refactoring.
Open

        if self.config.predict_source:
            metrics.append(
                F1Metric(
                    prefix=const.SOURCE_TAGS,
                    target_name=const.SOURCE_TAGS,
Severity: Major
Found in kiwi/models/quetch.py and 4 other locations - About 3 hrs to fix
kiwi/models/predictor_estimator.py on lines 549..562
kiwi/models/predictor_estimator.py on lines 565..578
kiwi/models/quetch.py on lines 368..381
kiwi/models/quetch.py on lines 400..413

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 67.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

Similar blocks of code found in 5 locations. Consider refactoring.
Open

        if self.config.predict_target:
            metrics.append(
                F1Metric(
                    prefix=const.TARGET_TAGS,
                    target_name=const.TARGET_TAGS,
Severity: Major
Found in kiwi/models/quetch.py and 4 other locations - About 3 hrs to fix
kiwi/models/predictor_estimator.py on lines 549..562
kiwi/models/predictor_estimator.py on lines 565..578
kiwi/models/quetch.py on lines 384..397
kiwi/models/quetch.py on lines 400..413

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 67.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

Function __init__ has a Cognitive Complexity of 24 (exceeds 5 allowed). Consider refactoring.
Open

    def __init__(
        self, vocabs, predictor_tgt=None, predictor_src=None, **kwargs
    ):

        super().__init__(vocabs=vocabs, ConfigCls=EstimatorConfig, **kwargs)
Severity: Minor
Found in kiwi/models/predictor_estimator.py - About 3 hrs to fix

Cognitive Complexity

Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

A method's cognitive complexity is based on a few simple rules:

  • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
  • Code is considered more complex for each "break in the linear flow of the code"
  • Code is considered more complex when "flow breaking structures are nested"

Further reading

Similar blocks of code found in 5 locations. Consider refactoring.
Open

        if self.config.predict_gaps:
            metrics.append(
                F1Metric(
                    prefix=const.GAP_TAGS,
                    target_name=const.GAP_TAGS,
Severity: Major
Found in kiwi/models/predictor_estimator.py and 4 other locations - About 3 hrs to fix
kiwi/models/predictor_estimator.py on lines 549..562
kiwi/models/quetch.py on lines 368..381
kiwi/models/quetch.py on lines 384..397
kiwi/models/quetch.py on lines 400..413

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 67.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

Similar blocks of code found in 5 locations. Consider refactoring.
Open

        if self.config.predict_source:
            metrics.append(
                F1Metric(
                    prefix=const.SOURCE_TAGS,
                    target_name=const.SOURCE_TAGS,
Severity: Major
Found in kiwi/models/predictor_estimator.py and 4 other locations - About 3 hrs to fix
kiwi/models/predictor_estimator.py on lines 565..578
kiwi/models/quetch.py on lines 368..381
kiwi/models/quetch.py on lines 384..397
kiwi/models/quetch.py on lines 400..413

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 67.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

Similar blocks of code found in 5 locations. Consider refactoring.
Open

        if self.config.predict_gaps:
            metrics.append(
                F1Metric(
                    prefix=const.GAP_TAGS,
                    target_name=const.GAP_TAGS,
Severity: Major
Found in kiwi/models/quetch.py and 4 other locations - About 3 hrs to fix
kiwi/models/predictor_estimator.py on lines 549..562
kiwi/models/predictor_estimator.py on lines 565..578
kiwi/models/quetch.py on lines 368..381
kiwi/models/quetch.py on lines 384..397

Duplicated Code

Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

Tuning

This issue has a mass of 67.

We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

Refactorings

Further Reading

File quetch.py has 309 lines of code (exceeds 250 allowed). Consider refactoring.
Open

#  OpenKiwi: Open-Source Machine Translation Quality Estimation
#  Copyright (C) 2019 Unbabel <openkiwi@unbabel.com>
#
#  This program is free software: you can redistribute it and/or modify
#  it under the terms of the GNU Affero General Public License as published
Severity: Minor
Found in kiwi/models/quetch.py - About 3 hrs to fix

    File predictor.py has 299 lines of code (exceeds 250 allowed). Consider refactoring.
    Open

    #  OpenKiwi: Open-Source Machine Translation Quality Estimation
    #  Copyright (C) 2019 Unbabel <openkiwi@unbabel.com>
    #
    #  This program is free software: you can redistribute it and/or modify
    #  it under the terms of the GNU Affero General Public License as published
    Severity: Minor
    Found in kiwi/models/predictor.py - About 3 hrs to fix

      Function read_tabular_file has a Cognitive Complexity of 22 (exceeds 5 allowed). Consider refactoring.
      Open

          def read_tabular_file(file_path, sep='\t', extract_column=None):
              examples = []
              line_values = []
              with open(file_path, 'r', encoding='utf8') as f:
                  num_columns = None
      Severity: Minor
      Found in kiwi/data/corpus.py - About 3 hrs to fix

      Cognitive Complexity

      Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

      A method's cognitive complexity is based on a few simple rules:

      • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
      • Code is considered more complex for each "break in the linear flow of the code"
      • Code is considered more complex when "flow breaking structures are nested"

      Further reading

      Function predict has a Cognitive Complexity of 21 (exceeds 5 allowed). Consider refactoring.
      Open

          def predict(self, batch, class_name=const.BAD, unmask=True):
              model_out = self(batch)
              predictions = {}
              class_index = torch.tensor([const.LABELS.index(class_name)])
      
      
      Severity: Minor
      Found in kiwi/models/model.py - About 2 hrs to fix

      Cognitive Complexity

      Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

      A method's cognitive complexity is based on a few simple rules:

      • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
      • Code is considered more complex for each "break in the linear flow of the code"
      • Code is considered more complex when "flow breaking structures are nested"

      Further reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

          fs.add(
              name=const.SOURCE,
              field=QEField(
                  tokenize=tokenizer,
                  init_token=None,
      Severity: Major
      Found in kiwi/data/fieldsets/quetch.py and 1 other location - About 2 hrs to fix
      kiwi/data/fieldsets/quetch.py on lines 50..60

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 60.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

          fs.add(
              name=const.TARGET,
              field=QEField(
                  tokenize=tokenizer,
                  init_token=None,
      Severity: Major
      Found in kiwi/data/fieldsets/quetch.py and 1 other location - About 2 hrs to fix
      kiwi/data/fieldsets/quetch.py on lines 32..42

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 60.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

                      viterbi_scores[pos, current_state] = np.max(
                          viterbi_scores[pos - 1, :]
                          + transition_scores[pos - 1, current_state, :]
      Severity: Major
      Found in kiwi/models/linear/linear_word_qe_decoder.py and 1 other location - About 2 hrs to fix
      kiwi/models/linear/linear_word_qe_decoder.py on lines 199..201

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 59.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Identical blocks of code found in 2 locations. Consider refactoring.
      Open

                          for i, field_name in enumerate(file_fields):
                              if field_name not in fields:  # TODO
                                  continue
                              examples[field_name].append(
                                  ' '.join([values[i] for values in example_values])
      Severity: Major
      Found in kiwi/data/corpus.py and 1 other location - About 2 hrs to fix
      kiwi/data/corpus.py on lines 174..179

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 59.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Identical blocks of code found in 2 locations. Consider refactoring.
      Open

              if example_values:  # Add trailing lines before EOF.
                  for i, field_name in enumerate(file_fields):
                      if field_name not in fields:
                          continue
                      examples[field_name].append(
      Severity: Major
      Found in kiwi/data/corpus.py and 1 other location - About 2 hrs to fix
      kiwi/data/corpus.py on lines 167..171

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 59.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Function cache has a Cognitive Complexity of 20 (exceeds 5 allowed). Consider refactoring.
      Open

          def cache(self, name, cache, url=None, max_vectors=None):
              if self.emb_format in ['polyglot', 'glove']:
                  try:
                      from polyglot.mapping import Embedding
                  except ImportError:
      Severity: Minor
      Found in kiwi/data/vectors.py - About 2 hrs to fix

      Cognitive Complexity

      Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

      A method's cognitive complexity is based on a few simple rules:

      • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
      • Code is considered more complex for each "break in the linear flow of the code"
      • Code is considered more complex when "flow breaking structures are nested"

      Further reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

                      viterbi_paths[pos, current_state] = np.argmax(
                          viterbi_scores[pos - 1, :]
                          + transition_scores[pos - 1, current_state, :]
      Severity: Major
      Found in kiwi/models/linear/linear_word_qe_decoder.py and 1 other location - About 2 hrs to fix
      kiwi/models/linear/linear_word_qe_decoder.py on lines 192..194

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 59.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

              if self.config.predict_source:
                  weight = make_loss_weights(
                      self.nb_classes, const.BAD_ID, self.config.source_bad_weight
                  )
                  self.xents[const.SOURCE_TAGS] = nn.CrossEntropyLoss(
      Severity: Major
      Found in kiwi/models/predictor_estimator.py and 1 other location - About 2 hrs to fix
      kiwi/models/predictor_estimator.py on lines 238..243

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 58.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Similar blocks of code found in 2 locations. Consider refactoring.
      Open

              if self.config.predict_gaps:
                  weight = make_loss_weights(
                      self.nb_classes, const.BAD_ID, self.config.gaps_bad_weight
                  )
                  self.xents[const.GAP_TAGS] = nn.CrossEntropyLoss(
      Severity: Major
      Found in kiwi/models/predictor_estimator.py and 1 other location - About 2 hrs to fix
      kiwi/models/predictor_estimator.py on lines 231..236

      Duplicated Code

      Duplicated code can lead to software that is hard to understand and difficult to change. The Don't Repeat Yourself (DRY) principle states:

      Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

      When you violate DRY, bugs and maintenance problems are sure to follow. Duplicated code has a tendency to both continue to replicate and also to diverge (leaving bugs as two similar implementations differ in subtle ways).

      Tuning

      This issue has a mass of 58.

      We set useful threshold defaults for the languages we support but you may want to adjust these settings based on your project guidelines.

      The threshold configuration represents the minimum mass a code block must have to be analyzed for duplication. The lower the threshold, the more fine-grained the comparison.

      If the engine is too easily reporting duplication, try raising the threshold. If you suspect that the engine isn't catching enough duplication, try lowering the threshold. The best setting tends to differ from language to language.

      See codeclimate-duplication's documentation for more information about tuning the mass threshold in your .codeclimate.yml.

      Refactorings

      Further Reading

      Severity
      Category
      Status
      Source
      Language