Add Language Metrics and Tests #48

NeriCarcasci · 2025-08-28T16:04:33Z

Summary

Rewrote the core language metrics from the Java implementation trustyai-explainability/explainability-core into Python, along with corresponding unit tests.

Implemented language metrics:

BLEU
Word Error Rate (WER)
Match Error Rate (MER)
Word Information Lost (WIL)
Word Information Preserved (WIP)
Levenshtein
Fuzzy Match
ROUGE

Summary by Sourcery

Port core language evaluation metrics from the Java implementation to Python and add comprehensive unit tests for each metric

New Features:

Implement BLEU metric with sentence-level and corpus-level scoring and smoothing support
Implement Levenshtein distance calculator with insertion, deletion, and substitution counts
Implement word-level error rate metrics: Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL), and Word Information Preserved (WIP)
Implement ROUGE metric supporting rouge1, rouge2, rougeL, and rougeLsum variants
Add FuzzyMatch utility combining exact match, WER, BLEU, and sequence similarity measures

Build:

Add nltk and rouge-score dependencies for language metrics

Tests:

Add unit tests for BLEU metric covering various n-grams, weights, smoothing methods, and corpus scenarios
Add unit tests for word error rate metrics using multiple tokenizers
Add unit tests for Levenshtein-based error rate metrics (WER, MER, WIL, WIP)
Add unit tests for ROUGE metric across all supported types
Add unit tests for FuzzyMatch matching and threshold functions

sourcery-ai · 2025-08-28T16:04:39Z

Reviewer's Guide

Ports core language metrics from Java to Python, including BLEU, Levenshtein, FuzzyMatch, ROUGE, and word-based error rates, and adds comprehensive unit tests for each metric along with updating project dependencies for new NLP libraries.

Entity relationship diagram for ErrorRateResult and LevenshteinResult data structures

erDiagram
    ERRORRATERESULT {
        float value
        int insertions
        int deletions
        int substitutions
        int correct
        int reference_length
    }
    LEVENSHTEINRESULT {
        int distance
        int insertions
        int deletions
        int substitutions
        int reference_length
    }
    ERRORRATERESULT ||--o| LEVENSHTEINRESULT : uses

Class diagram for BLEUMetric and ROUGEMetric tokenization and scoring

classDiagram
    class BLEUMetric {
        +tokenizer: Callable
        +smoothing_function
        +calculate(references, hypothesis, max_ngram, weights)
        +calculate_corpus(references, hypotheses, max_ngram, weights)
    }
    class ROUGEMetric {
        +rouge_type: Literal
        +scorer
        +calculate(reference, hypothesis)
        +_rouge_lsum(reference, hypothesis)
    }
    BLEUMetric o-- "tokenizer" Callable
    ROUGEMetric o-- "scorer" RougeScorer

File-Level Changes

Change	Details	Files
Ported language metric implementations to Python	Implemented BLEU metric with sentence and corpus scoring and customizable smoothing and weights Developed Levenshtein core with both simple and aligned distance counters Introduced FuzzyMatch for exact, WER, BLEU, and similarity-based comparisons Added ROUGEMetric supporting multiple rouge types including rougeLsum Created word-based error rate classes: WER, MER, WIL, WIP Added clean_text utility for text normalization	`src/core/metrics/language/bleu/bleu.py` `src/core/metrics/language/Levenshtein/levenshtein.py` `src/core/metrics/language/match/fuzzymatch.py` `src/core/metrics/language/rogue/rogue.py` `src/core/metrics/language/error_rates/word_error_rate.py` `src/core/metrics/language/error_rates/match_error_rate.py` `src/core/metrics/language/error_rates/word_information_lost.py` `src/core/metrics/language/error_rates/word_information_preserved.py` `src/core/metrics/language/error_rates/base_result.py` `src/core/metrics/language/utils.py`
Added comprehensive unit tests for all metrics	Tested BLEU scoring across smoothing, weights, n-gram levels, edge cases, and corpus mode Validated WER under multiple tokenizers Covered FuzzyMatch methods for exact, WER, BLEU, and similarity thresholds Combined tests for WER, MER, WIL, WIP scenarios Tested ROUGE variants including rouge1, rouge2, rougeL, and rougeLsum	`tests/metrics/language/test_bleu.py` `tests/metrics/language/test_wordErrorRate.py` `tests/metrics/language/test_fuzzyMatch.py` `tests/metrics/language/test_levenshteinCommon.py` `tests/metrics/language/test_rogue.py`
Updated project dependencies	Added nltk for tokenization and smoothing functions Added rouge-score for ROUGE metric evaluation	`pyproject.toml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
- Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
- The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.

## Individual Comments

### Comment 1
<location> `src/core/metrics/language/Levenshtein/levenshtein.py:40` </location>
<code_context>
+        clean_ref = clean_text(reference)
+        clean_hyp = clean_text(hypothesis)
+
+        return edit_distance(clean_ref, clean_hyp)
+
+
</code_context>

<issue_to_address>
compute_ returns raw edit_distance instead of LevenshteinResult.

Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.

Suggested implementation:

```python
        clean_ref = clean_text(reference)
        clean_hyp = clean_text(hypothesis)

        distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True)
        reference_length = len(clean_ref)
        return LevenshteinResult(
            distance=distance,
            insertions=insertions,
            deletions=deletions,
            substitutions=substitutions,
            reference_length=reference_length
        )

```

- The `edit_distance` function must support a `return_operations=True` argument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to update `edit_distance` accordingly.
- If `edit_distance` only returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.
- Ensure that `LevenshteinResult` is imported or defined in the scope of `compute_`.
</issue_to_address>

### Comment 2
<location> `src/core/metrics/language/bleu/bleu.py:16` </location>
<code_context>
+                                 None means no smoothing.
+        :param tokenizer: A callable that tokenizes a string into a list of tokens. Defaults to str.split.
+        """
+        if isinstance(smoothing_method, int):
+            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
+        elif callable(smoothing_method):
+            self.smoothing_function = smoothing_method
</code_context>

<issue_to_address>
No validation for smoothing_method range.

Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        if isinstance(smoothing_method, int):
            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
=======
        if isinstance(smoothing_method, int):
            if 1 <= smoothing_method <= 7:
                self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
            else:
                raise ValueError(
                    f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}."
                )
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `src/core/metrics/language/bleu/bleu.py:31` </location>
<code_context>
+        """
+        Create uniform weights for BLEU-N scoring.
+        """
+        if not isinstance(max_ngram, int):
+            max_ngram = int(max_ngram)
+        return [1.0 / max_ngram] * max_ngram
</code_context>

<issue_to_address>
Implicit conversion of max_ngram may mask errors.

Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        if not isinstance(max_ngram, int):
            max_ngram = int(max_ngram)
        return [1.0 / max_ngram] * max_ngram
=======
        if not isinstance(max_ngram, int):
            raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}")
        return [1.0 / max_ngram] * max_ngram
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `src/core/metrics/language/rogue/rogue.py:50` </location>
<code_context>
+        :param hypothesis: A hypothesis paragraph consisting of multiple sentences.
+        :return: The average ROUGE-L F1 score over aligned sentence pairs.
+        """
+        ref_sents = self.simple_sent_tokenize(reference)
+        hyp_sents = self.simple_sent_tokenize(hypothesis)
+        ref_sents_cleaned = [clean_text(s) for s in ref_sents]
+        hyp_sents_cleaned = [clean_text(s) for s in hyp_sents]
+
+        total_score = 0.0
</code_context>

<issue_to_address>
Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.

The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.
</issue_to_address>

### Comment 5
<location> `tests/metrics/language/test_bleu.py:44` </location>
<code_context>
+    score = bleu.calculate(common_references, common_hypothesis, max_ngram=2)
+    assert pytest.approx(score, abs=0.01) == 0.4082
+
+def test_zero_matches():
+    references = [uncommon]
+    hypothesis = "John loves Mary"
+    bleu = BLEU()
+    for n in range(1, 6):
+        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
+
</code_context>

<issue_to_address>
Missing test for empty hypothesis and empty references in BLEU metric.

Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH

def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

=======

def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis():
    references = [common]
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_references():
    references = []
    hypothesis = "the cat is on the mat"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis_and_references():
    references = []
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/metrics/language/test_bleu.py:97` </location>
<code_context>
+    hypotheses = [hyp1, hyp2]
+    assert pytest.approx(bleu.calculate_corpus(references, hypotheses, max_ngram=4, weights=weights), 0.01) == 0.5818
+
+def test_bleu_corpus_multiple_weight_sets():
+    bleu = BLEU()
+    weight_sets = [
+        [0.5, 0.5],
+        [0.333, 0.333, 0.334],
+        [0.25, 0.25, 0.25, 0.25],
+        [0.2, 0.2, 0.2, 0.2, 0.2]
+    ]
+    expected_scores = [0.8242, 0.7067, 0.5920, 0.4719]
+    references = [[ref1a, ref1b, ref1c], [ref2a]]
+    hypotheses = [hyp1, hyp2]
+
+    for weights, expected in zip(weight_sets, expected_scores):
+        ngram = len(weights)
+        score = bleu.calculate_corpus(references, hypotheses, max_ngram=ngram, weights=weights)
+        assert pytest.approx(score, 0.02) == expected
\ No newline at end of file
</code_context>

<issue_to_address>
Missing test for invalid BLEU weights (e.g., negative, zero, or not summing to 1).

Add tests with invalid BLEU weights to verify that the metric properly handles or rejects these cases.
</issue_to_address>

### Comment 7
<location> `tests/metrics/language/test_wordErrorRate.py:43` </location>
<code_context>
+
+
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_commons))
+def test_commons_tokenizer(ref, hyp, expected):
+    wer = WordErrorRate(tokenizer=commons_tokenizer)
+    result = wer.calculate(ref, hyp)
+    assert pytest.approx(result.value, abs=TOLERANCE) == expected
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_whitespace))
</code_context>

<issue_to_address>
Missing test for empty reference and/or hypothesis in Word Error Rate.

Add test cases with empty reference and/or hypothesis strings to ensure correct Word Error Rate outputs and prevent crashes.
</issue_to_address>

### Comment 8
<location> `tests/metrics/language/test_fuzzyMatch.py:22` </location>
<code_context>
+    return [token for token in text.split() if token.strip()]
+
+
+def test_exact_match():
+    expected = [False, False, True]
+    fm = FuzzyMatch()
+    for i, (ref, hyp) in enumerate(zip(references, inputs)):
+        assert fm.calculate(ref, hyp) == expected[i]
+
+def test_wer_match_default_tokenizer():
</code_context>

<issue_to_address>
Missing test for FuzzyMatch with empty strings and high/low thresholds.

Please add tests for cases where the reference or input string is empty, and for threshold values at 0.0 and 1.0, to cover these edge scenarios.
</issue_to_address>

### Comment 9
<location> `tests/metrics/language/test_levenshteinCommon.py:26` </location>
<code_context>
+    assert pytest.approx(wip, abs=tol) == expected_wip, f"Expected WIP: {expected_wip}, got {wip}"
+
+
+def test_equal_reference_hypothesis():
+    ref = ["X"]
+    hyp = ["X"]
+    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
+
+
</code_context>

<issue_to_address>
Missing test for Levenshtein metrics with empty reference and/or hypothesis.

Add tests covering scenarios with empty reference and/or hypothesis to verify correct handling of these edge cases.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
=======
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_empty_reference_and_hypothesis():
    ref = []
    hyp = []
    # All metrics should be 0 except WIP, which should be 1 (perfect preservation)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_empty_reference_nonempty_hypothesis():
    ref = []
    hyp = ["X", "Y"]
    # WER and MER are undefined for empty reference, but typically set to 0.0 or handled gracefully.
    # WIL should be 0.0, WIP should be 1.0 (no information lost/preserved since nothing to compare)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_nonempty_reference_empty_hypothesis():
    ref = ["X", "Y"]
    hyp = []
    # WER and MER should be 1.0 (all words missed), WIL should be 1.0 (all information lost), WIP should be 0.0
    run_test_all(ref, hyp, expected_wer=1.0, expected_mer=1.0, expected_wil=1.0, expected_wip=0.0)
>>>>>>> REPLACE

</suggested_fix>

### Comment 10
<location> `tests/metrics/language/test_rogue.py:35` </location>
<code_context>
+    score = rouge.calculate("w1 w2 w3 w4 w5", "w1 w2 w6 w7 w8\nw1 w3 w8 w9 w5")
+    assert pytest.approx(score, abs=0.05) == 0.5
+
+def test_rougel_sum_non_word():
+    rouge = ROUGE(rouge_type="rougeLsum")
+    score = rouge.calculate("w1 w2 w3 w4 w5", "/")
+    assert score == 0
\ No newline at end of file
</code_context>

<issue_to_address>
Missing test for ROUGE metrics with empty reference.

Please add tests for all ROUGE types with an empty reference and a non-empty hypothesis to verify correct handling of this edge case.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-28T16:05:59Z

src/core/metrics/language/Levenshtein/levenshtein.py

+        clean_ref = clean_text(reference)
+        clean_hyp = clean_text(hypothesis)
+
+        return edit_distance(clean_ref, clean_hyp)


suggestion: compute_ returns raw edit_distance instead of LevenshteinResult.

Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.

Suggested implementation:

clean_ref = clean_text(reference) clean_hyp = clean_text(hypothesis) distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True) reference_length = len(clean_ref) return LevenshteinResult( distance=distance, insertions=insertions, deletions=deletions, substitutions=substitutions, reference_length=reference_length )

The edit_distance function must support a return_operations=True argument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to update edit_distance accordingly.

If edit_distance only returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.

Ensure that LevenshteinResult is imported or defined in the scope of compute_.

sourcery-ai · 2025-08-28T16:05:59Z

src/core/metrics/language/bleu/bleu.py

+        if isinstance(smoothing_method, int):
+            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
+        elif callable(smoothing_method):
+            self.smoothing_function = smoothing_method
+        else:
+            self.smoothing_function = None


suggestion (bug_risk): No validation for smoothing_method range.

Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.

Suggested change

if isinstance(smoothing_method, int):

self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")

elif callable(smoothing_method):

self.smoothing_function = smoothing_method

else:

self.smoothing_function = None

if isinstance(smoothing_method, int):

if 1 <= smoothing_method <= 7:

self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")

else:

raise ValueError(

f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}."

)

elif callable(smoothing_method):

self.smoothing_function = smoothing_method

else:

self.smoothing_function = None

sourcery-ai · 2025-08-28T16:05:59Z

src/core/metrics/language/bleu/bleu.py

+        if not isinstance(max_ngram, int):
+            max_ngram = int(max_ngram)
+        return [1.0 / max_ngram] * max_ngram


suggestion (bug_risk): Implicit conversion of max_ngram may mask errors.

Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.

Suggested change

if not isinstance(max_ngram, int):

max_ngram = int(max_ngram)

return [1.0 / max_ngram] * max_ngram

if not isinstance(max_ngram, int):

raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}")

return [1.0 / max_ngram] * max_ngram

sourcery-ai · 2025-08-28T16:05:59Z

src/core/metrics/language/rogue/rogue.py

+        ref_sents = self.simple_sent_tokenize(reference)
+        hyp_sents = self.simple_sent_tokenize(hypothesis)
+        ref_sents_cleaned = [clean_text(s) for s in ref_sents]
+        hyp_sents_cleaned = [clean_text(s) for s in hyp_sents]


suggestion: Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.

The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.

sourcery-ai · 2025-08-28T16:06:00Z

tests/metrics/language/test_bleu.py

+
+def test_zero_matches():
+    references = [uncommon]
+    hypothesis = "John loves Mary"
+    bleu = BLEU()
+    for n in range(1, 6):
+        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
+


suggestion (testing): Missing test for empty hypothesis and empty references in BLEU metric.

Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.

Suggested change

def test_zero_matches():

references = [uncommon]

hypothesis = "John loves Mary"

bleu = BLEU()

for n in range(1, 6):

assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_zero_matches():

references = [uncommon]

hypothesis = "John loves Mary"

bleu = BLEU()

for n in range(1, 6):

assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis():

references = [common]

hypothesis = ""

bleu = BLEU()

for n in range(1, 6):

assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_references():

references = []

hypothesis = "the cat is on the mat"

bleu = BLEU()

for n in range(1, 6):

assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis_and_references():

references = []

hypothesis = ""

bleu = BLEU()

for n in range(1, 6):

assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

sourcery-ai · 2025-08-28T16:06:01Z

tests/metrics/language/test_fuzzyMatch.py

+    for i, (ref, hyp) in enumerate(zip(references, inputs)):
+        assert fm.calculate_bleu([ref], hyp, threshold=0.8) == expected[i]


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-28T16:06:01Z

tests/metrics/language/test_rogue.py

+    for rouge_type in ["rouge1", "rouge2", "rougeL", "rougeLsum"]:
+        rouge = ROUGE(rouge_type=rouge_type)
+        score = rouge.calculate("testing one two", "")
+        assert score == 0


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-28T16:06:01Z

src/core/metrics/language/bleu/bleu.py

+        Create uniform weights for BLEU-N scoring.
+        """
+        if not isinstance(max_ngram, int):
+            max_ngram = int(max_ngram)


suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)

Suggested change

max_ngram = int(max_ngram)

max_ngram = max_ngram

sourcery-ai · 2025-08-28T16:06:01Z

src/core/metrics/language/error_rates/word_information_preserved.py

+        # Approximate WIP: (H / N_ref) * (H / N_hyp)
+        wip = 0.0
+        if N_ref > 0 and N_hyp > 0:
+            wip = (H / N_ref) * (H / N_hyp)
+


suggestion (code-quality): We've found these issues:

Move setting of default value for variable into else branch (introduce-default-else)

Replace if statement with if expression (assign-if-exp)

Suggested change

# Approximate WIP: (H / N_ref) * (H / N_hyp)

wip = 0.0

if N_ref > 0 and N_hyp > 0:

wip = (H / N_ref) * (H / N_hyp)

wip = (H / N_ref) * (H / N_hyp) if N_ref > 0 and N_hyp > 0 else 0.0

sourcery-ai · 2025-08-28T16:06:01Z

src/core/metrics/language/rogue/rogue.py

+        else:
+            reference = clean_text(reference)
+            hypothesis = clean_text(hypothesis)
+            score = self.scorer.score(reference, hypothesis)
+            return score[self.rouge_type].fmeasure


suggestion (code-quality): Remove unnecessary else after guard condition (remove-unnecessary-else)

Suggested change

else:

reference = clean_text(reference)

hypothesis = clean_text(hypothesis)

score = self.scorer.score(reference, hypothesis)

return score[self.rouge_type].fmeasure

reference = clean_text(reference)

hypothesis = clean_text(hypothesis)

score = self.scorer.score(reference, hypothesis)

return score[self.rouge_type].fmeasure

NeriCarcasci added 2 commits July 23, 2025 15:35

metrics created

98d12b0

updated metrics & unit tests

f2eec94

sourcery-ai bot reviewed Aug 28, 2025

View reviewed changes

		for i, (ref, hyp) in enumerate(zip(references, inputs)):
		assert fm.calculate_bleu([ref], hyp, threshold=0.8) == expected[i] No newline at end of file

Uh oh!

Add Language Metrics and Tests #48

Are you sure you want to change the base?

Add Language Metrics and Tests #48

Uh oh!

Conversation

NeriCarcasci commented Aug 28, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Entity relationship diagram for ErrorRateResult and LevenshteinResult data structures

Class diagram for BLEUMetric and ROUGEMetric tokenization and scoring

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NeriCarcasci commented Aug 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 28, 2025 •

edited

Loading