tensorflow/models

View on GitHub
official/legacy/transformer/utils/tokenizer.py

Summary

Maintainability
D
2 days
Test Coverage

File tokenizer.py has 474 lines of code (exceeds 250 allowed). Consider refactoring.
Open

# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
Severity: Minor
Found in official/legacy/transformer/utils/tokenizer.py - About 7 hrs to fix

    Function _count_tokens has a Cognitive Complexity of 20 (exceeds 5 allowed). Consider refactoring.
    Open

    def _count_tokens(files,
                      file_byte_limit=1e6,
                      correct_strip=True,
                      master_char_set=None):
      """Return token counts of words in the files.
    Severity: Minor
    Found in official/legacy/transformer/utils/tokenizer.py - About 2 hrs to fix

    Cognitive Complexity

    Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

    A method's cognitive complexity is based on a few simple rules:

    • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
    • Code is considered more complex for each "break in the linear flow of the code"
    • Code is considered more complex when "flow breaking structures are nested"

    Further reading

    Function _gen_new_subtoken_list has a Cognitive Complexity of 14 (exceeds 5 allowed). Consider refactoring.
    Open

    def _gen_new_subtoken_list(subtoken_counts,
                               min_count,
                               alphabet,
                               reserved_tokens=None):
      """Generate candidate subtokens ordered by count, and new max subtoken length.
    Severity: Minor
    Found in official/legacy/transformer/utils/tokenizer.py - About 1 hr to fix

    Cognitive Complexity

    Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

    A method's cognitive complexity is based on a few simple rules:

    • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
    • Code is considered more complex for each "break in the linear flow of the code"
    • Code is considered more complex when "flow breaking structures are nested"

    Further reading

    Function _generate_subtokens_with_target_vocab_size has a Cognitive Complexity of 10 (exceeds 5 allowed). Consider refactoring.
    Open

    def _generate_subtokens_with_target_vocab_size(token_counts,
                                                   alphabet,
                                                   target_size,
                                                   threshold,
                                                   min_count=None,
    Severity: Minor
    Found in official/legacy/transformer/utils/tokenizer.py - About 1 hr to fix

    Cognitive Complexity

    Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

    A method's cognitive complexity is based on a few simple rules:

    • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
    • Code is considered more complex for each "break in the linear flow of the code"
    • Code is considered more complex when "flow breaking structures are nested"

    Further reading

    Function init_from_files has 8 arguments (exceeds 4 allowed). Consider refactoring.
    Open

      def init_from_files(vocab_file,
    Severity: Major
    Found in official/legacy/transformer/utils/tokenizer.py - About 1 hr to fix

      Avoid deeply nested control flow statements.
      Open

                if file_byte_budget < 0:
                  break
                if correct_strip:
      Severity: Major
      Found in official/legacy/transformer/utils/tokenizer.py - About 45 mins to fix

        Avoid deeply nested control flow statements.
        Open

                  if correct_strip:
                    line = native_to_unicode(line)
                  line = line.strip()
        Severity: Major
        Found in official/legacy/transformer/utils/tokenizer.py - About 45 mins to fix

          Avoid deeply nested control flow statements.
          Open

                    for token in _split_string_to_tokens(
                        native_to_unicode(line), master_char_set):
                      token_counts[token] += 1
            return token_counts
          Severity: Major
          Found in official/legacy/transformer/utils/tokenizer.py - About 45 mins to fix

            Function _split_string_to_tokens has a Cognitive Complexity of 8 (exceeds 5 allowed). Consider refactoring.
            Open

            def _split_string_to_tokens(text, master_char_set):
              """Splits text to a list of string tokens."""
              if not text:
                return []
              ret = []
            Severity: Minor
            Found in official/legacy/transformer/utils/tokenizer.py - About 45 mins to fix

            Cognitive Complexity

            Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

            A method's cognitive complexity is based on a few simple rules:

            • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
            • Code is considered more complex for each "break in the linear flow of the code"
            • Code is considered more complex when "flow breaking structures are nested"

            Further reading

            Function _generate_subtokens_with_target_vocab_size has 6 arguments (exceeds 4 allowed). Consider refactoring.
            Open

            def _generate_subtokens_with_target_vocab_size(token_counts,
            Severity: Minor
            Found in official/legacy/transformer/utils/tokenizer.py - About 45 mins to fix

              Function _generate_subtokens has 5 arguments (exceeds 4 allowed). Consider refactoring.
              Open

              def _generate_subtokens(token_counts,
              Severity: Minor
              Found in official/legacy/transformer/utils/tokenizer.py - About 35 mins to fix

                Function _split_token_to_subtokens has a Cognitive Complexity of 7 (exceeds 5 allowed). Consider refactoring.
                Open

                def _split_token_to_subtokens(token, subtoken_dict, max_subtoken_length):
                  """Splits a token into subtokens defined in the subtoken dict."""
                  ret = []
                  start = 0
                  token_len = len(token)
                Severity: Minor
                Found in official/legacy/transformer/utils/tokenizer.py - About 35 mins to fix

                Cognitive Complexity

                Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

                A method's cognitive complexity is based on a few simple rules:

                • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
                • Code is considered more complex for each "break in the linear flow of the code"
                • Code is considered more complex when "flow breaking structures are nested"

                Further reading

                Function _unescape_token has a Cognitive Complexity of 7 (exceeds 5 allowed). Consider refactoring.
                Open

                def _unescape_token(token):
                  r"""Replaces escaped characters in the token with their unescaped versions.
                
                  Applies inverse transformations as _escape_token():
                    1. Replace "\u" with "_", and "\\" with "\".
                Severity: Minor
                Found in official/legacy/transformer/utils/tokenizer.py - About 35 mins to fix

                Cognitive Complexity

                Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

                A method's cognitive complexity is based on a few simple rules:

                • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
                • Code is considered more complex for each "break in the linear flow of the code"
                • Code is considered more complex when "flow breaking structures are nested"

                Further reading

                Function _count_and_gen_subtokens has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.
                Open

                def _count_and_gen_subtokens(token_counts, alphabet, subtoken_dict,
                                             max_subtoken_length):
                  """Count number of times subtokens appear, and generate new subtokens.
                
                  Args:
                Severity: Minor
                Found in official/legacy/transformer/utils/tokenizer.py - About 25 mins to fix

                Cognitive Complexity

                Cognitive Complexity is a measure of how difficult a unit of code is to intuitively understand. Unlike Cyclomatic Complexity, which determines how difficult your code will be to test, Cognitive Complexity tells you how difficult your code will be to read and comprehend.

                A method's cognitive complexity is based on a few simple rules:

                • Code is not considered more complex when it uses shorthand that the language provides for collapsing multiple statements into one
                • Code is considered more complex for each "break in the linear flow of the code"
                • Code is considered more complex when "flow breaking structures are nested"

                Further reading

                There are no issues that match your filters.

                Category
                Status