Skip to content

Conversation

@NeriCarcasci
Copy link

@NeriCarcasci NeriCarcasci commented Aug 28, 2025

Summary

Rewrote the core language metrics from the Java implementation trustyai-explainability/explainability-core into Python, along with corresponding unit tests.

Implemented language metrics:

  • BLEU
  • Word Error Rate (WER)
  • Match Error Rate (MER)
  • Word Information Lost (WIL)
  • Word Information Preserved (WIP)
  • Levenshtein
  • Fuzzy Match
  • ROUGE

Summary by Sourcery

Port core language evaluation metrics from the Java implementation to Python and add comprehensive unit tests for each metric

New Features:

  • Implement BLEU metric with sentence-level and corpus-level scoring and smoothing support
  • Implement Levenshtein distance calculator with insertion, deletion, and substitution counts
  • Implement word-level error rate metrics: Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL), and Word Information Preserved (WIP)
  • Implement ROUGE metric supporting rouge1, rouge2, rougeL, and rougeLsum variants
  • Add FuzzyMatch utility combining exact match, WER, BLEU, and sequence similarity measures

Build:

  • Add nltk and rouge-score dependencies for language metrics

Tests:

  • Add unit tests for BLEU metric covering various n-grams, weights, smoothing methods, and corpus scenarios
  • Add unit tests for word error rate metrics using multiple tokenizers
  • Add unit tests for Levenshtein-based error rate metrics (WER, MER, WIL, WIP)
  • Add unit tests for ROUGE metric across all supported types
  • Add unit tests for FuzzyMatch matching and threshold functions

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Aug 28, 2025

Reviewer's Guide

Ports core language metrics from Java to Python, including BLEU, Levenshtein, FuzzyMatch, ROUGE, and word-based error rates, and adds comprehensive unit tests for each metric along with updating project dependencies for new NLP libraries.

Entity relationship diagram for ErrorRateResult and LevenshteinResult data structures

erDiagram
    ERRORRATERESULT {
        float value
        int insertions
        int deletions
        int substitutions
        int correct
        int reference_length
    }
    LEVENSHTEINRESULT {
        int distance
        int insertions
        int deletions
        int substitutions
        int reference_length
    }
    ERRORRATERESULT ||--o| LEVENSHTEINRESULT : uses
Loading

Class diagram for BLEUMetric and ROUGEMetric tokenization and scoring

classDiagram
    class BLEUMetric {
        +tokenizer: Callable
        +smoothing_function
        +calculate(references, hypothesis, max_ngram, weights)
        +calculate_corpus(references, hypotheses, max_ngram, weights)
    }
    class ROUGEMetric {
        +rouge_type: Literal
        +scorer
        +calculate(reference, hypothesis)
        +_rouge_lsum(reference, hypothesis)
    }
    BLEUMetric o-- "tokenizer" Callable
    ROUGEMetric o-- "scorer" RougeScorer
Loading

File-Level Changes

Change Details Files
Ported language metric implementations to Python
  • Implemented BLEU metric with sentence and corpus scoring and customizable smoothing and weights
  • Developed Levenshtein core with both simple and aligned distance counters
  • Introduced FuzzyMatch for exact, WER, BLEU, and similarity-based comparisons
  • Added ROUGEMetric supporting multiple rouge types including rougeLsum
  • Created word-based error rate classes: WER, MER, WIL, WIP
  • Added clean_text utility for text normalization
src/core/metrics/language/bleu/bleu.py
src/core/metrics/language/Levenshtein/levenshtein.py
src/core/metrics/language/match/fuzzymatch.py
src/core/metrics/language/rogue/rogue.py
src/core/metrics/language/error_rates/word_error_rate.py
src/core/metrics/language/error_rates/match_error_rate.py
src/core/metrics/language/error_rates/word_information_lost.py
src/core/metrics/language/error_rates/word_information_preserved.py
src/core/metrics/language/error_rates/base_result.py
src/core/metrics/language/utils.py
Added comprehensive unit tests for all metrics
  • Tested BLEU scoring across smoothing, weights, n-gram levels, edge cases, and corpus mode
  • Validated WER under multiple tokenizers
  • Covered FuzzyMatch methods for exact, WER, BLEU, and similarity thresholds
  • Combined tests for WER, MER, WIL, WIP scenarios
  • Tested ROUGE variants including rouge1, rouge2, rougeL, and rougeLsum
tests/metrics/language/test_bleu.py
tests/metrics/language/test_wordErrorRate.py
tests/metrics/language/test_fuzzyMatch.py
tests/metrics/language/test_levenshteinCommon.py
tests/metrics/language/test_rogue.py
Updated project dependencies
  • Added nltk for tokenization and smoothing functions
  • Added rouge-score for ROUGE metric evaluation
pyproject.toml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
  • Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
  • The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
- Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
- The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.

## Individual Comments

### Comment 1
<location> `src/core/metrics/language/Levenshtein/levenshtein.py:40` </location>
<code_context>
+        clean_ref = clean_text(reference)
+        clean_hyp = clean_text(hypothesis)
+
+        return edit_distance(clean_ref, clean_hyp)
+
+
</code_context>

<issue_to_address>
compute_ returns raw edit_distance instead of LevenshteinResult.

Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.

Suggested implementation:

```python
        clean_ref = clean_text(reference)
        clean_hyp = clean_text(hypothesis)

        distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True)
        reference_length = len(clean_ref)
        return LevenshteinResult(
            distance=distance,
            insertions=insertions,
            deletions=deletions,
            substitutions=substitutions,
            reference_length=reference_length
        )

```

- The `edit_distance` function must support a `return_operations=True` argument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to update `edit_distance` accordingly.
- If `edit_distance` only returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.
- Ensure that `LevenshteinResult` is imported or defined in the scope of `compute_`.
</issue_to_address>

### Comment 2
<location> `src/core/metrics/language/bleu/bleu.py:16` </location>
<code_context>
+                                 None means no smoothing.
+        :param tokenizer: A callable that tokenizes a string into a list of tokens. Defaults to str.split.
+        """
+        if isinstance(smoothing_method, int):
+            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
+        elif callable(smoothing_method):
+            self.smoothing_function = smoothing_method
</code_context>

<issue_to_address>
No validation for smoothing_method range.

Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        if isinstance(smoothing_method, int):
            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
=======
        if isinstance(smoothing_method, int):
            if 1 <= smoothing_method <= 7:
                self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
            else:
                raise ValueError(
                    f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}."
                )
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `src/core/metrics/language/bleu/bleu.py:31` </location>
<code_context>
+        """
+        Create uniform weights for BLEU-N scoring.
+        """
+        if not isinstance(max_ngram, int):
+            max_ngram = int(max_ngram)
+        return [1.0 / max_ngram] * max_ngram
</code_context>

<issue_to_address>
Implicit conversion of max_ngram may mask errors.

Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        if not isinstance(max_ngram, int):
            max_ngram = int(max_ngram)
        return [1.0 / max_ngram] * max_ngram
=======
        if not isinstance(max_ngram, int):
            raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}")
        return [1.0 / max_ngram] * max_ngram
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `src/core/metrics/language/rogue/rogue.py:50` </location>
<code_context>
+        :param hypothesis: A hypothesis paragraph consisting of multiple sentences.
+        :return: The average ROUGE-L F1 score over aligned sentence pairs.
+        """
+        ref_sents = self.simple_sent_tokenize(reference)
+        hyp_sents = self.simple_sent_tokenize(hypothesis)
+        ref_sents_cleaned = [clean_text(s) for s in ref_sents]
+        hyp_sents_cleaned = [clean_text(s) for s in hyp_sents]
+
+        total_score = 0.0
</code_context>

<issue_to_address>
Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.

The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.
</issue_to_address>

### Comment 5
<location> `tests/metrics/language/test_bleu.py:44` </location>
<code_context>
+    score = bleu.calculate(common_references, common_hypothesis, max_ngram=2)
+    assert pytest.approx(score, abs=0.01) == 0.4082
+
+def test_zero_matches():
+    references = [uncommon]
+    hypothesis = "John loves Mary"
+    bleu = BLEU()
+    for n in range(1, 6):
+        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
+
</code_context>

<issue_to_address>
Missing test for empty hypothesis and empty references in BLEU metric.

Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH

def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

=======

def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis():
    references = [common]
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_references():
    references = []
    hypothesis = "the cat is on the mat"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

def test_empty_hypothesis_and_references():
    references = []
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/metrics/language/test_bleu.py:97` </location>
<code_context>
+    hypotheses = [hyp1, hyp2]
+    assert pytest.approx(bleu.calculate_corpus(references, hypotheses, max_ngram=4, weights=weights), 0.01) == 0.5818
+
+def test_bleu_corpus_multiple_weight_sets():
+    bleu = BLEU()
+    weight_sets = [
+        [0.5, 0.5],
+        [0.333, 0.333, 0.334],
+        [0.25, 0.25, 0.25, 0.25],
+        [0.2, 0.2, 0.2, 0.2, 0.2]
+    ]
+    expected_scores = [0.8242, 0.7067, 0.5920, 0.4719]
+    references = [[ref1a, ref1b, ref1c], [ref2a]]
+    hypotheses = [hyp1, hyp2]
+
+    for weights, expected in zip(weight_sets, expected_scores):
+        ngram = len(weights)
+        score = bleu.calculate_corpus(references, hypotheses, max_ngram=ngram, weights=weights)
+        assert pytest.approx(score, 0.02) == expected
\ No newline at end of file
</code_context>

<issue_to_address>
Missing test for invalid BLEU weights (e.g., negative, zero, or not summing to 1).

Add tests with invalid BLEU weights to verify that the metric properly handles or rejects these cases.
</issue_to_address>

### Comment 7
<location> `tests/metrics/language/test_wordErrorRate.py:43` </location>
<code_context>
+
+
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_commons))
+def test_commons_tokenizer(ref, hyp, expected):
+    wer = WordErrorRate(tokenizer=commons_tokenizer)
+    result = wer.calculate(ref, hyp)
+    assert pytest.approx(result.value, abs=TOLERANCE) == expected
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_whitespace))
</code_context>

<issue_to_address>
Missing test for empty reference and/or hypothesis in Word Error Rate.

Add test cases with empty reference and/or hypothesis strings to ensure correct Word Error Rate outputs and prevent crashes.
</issue_to_address>

### Comment 8
<location> `tests/metrics/language/test_fuzzyMatch.py:22` </location>
<code_context>
+    return [token for token in text.split() if token.strip()]
+
+
+def test_exact_match():
+    expected = [False, False, True]
+    fm = FuzzyMatch()
+    for i, (ref, hyp) in enumerate(zip(references, inputs)):
+        assert fm.calculate(ref, hyp) == expected[i]
+
+def test_wer_match_default_tokenizer():
</code_context>

<issue_to_address>
Missing test for FuzzyMatch with empty strings and high/low thresholds.

Please add tests for cases where the reference or input string is empty, and for threshold values at 0.0 and 1.0, to cover these edge scenarios.
</issue_to_address>

### Comment 9
<location> `tests/metrics/language/test_levenshteinCommon.py:26` </location>
<code_context>
+    assert pytest.approx(wip, abs=tol) == expected_wip, f"Expected WIP: {expected_wip}, got {wip}"
+
+
+def test_equal_reference_hypothesis():
+    ref = ["X"]
+    hyp = ["X"]
+    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
+
+
</code_context>

<issue_to_address>
Missing test for Levenshtein metrics with empty reference and/or hypothesis.

Add tests covering scenarios with empty reference and/or hypothesis to verify correct handling of these edge cases.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
=======
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_empty_reference_and_hypothesis():
    ref = []
    hyp = []
    # All metrics should be 0 except WIP, which should be 1 (perfect preservation)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_empty_reference_nonempty_hypothesis():
    ref = []
    hyp = ["X", "Y"]
    # WER and MER are undefined for empty reference, but typically set to 0.0 or handled gracefully.
    # WIL should be 0.0, WIP should be 1.0 (no information lost/preserved since nothing to compare)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)

def test_nonempty_reference_empty_hypothesis():
    ref = ["X", "Y"]
    hyp = []
    # WER and MER should be 1.0 (all words missed), WIL should be 1.0 (all information lost), WIP should be 0.0
    run_test_all(ref, hyp, expected_wer=1.0, expected_mer=1.0, expected_wil=1.0, expected_wip=0.0)
>>>>>>> REPLACE

</suggested_fix>

### Comment 10
<location> `tests/metrics/language/test_rogue.py:35` </location>
<code_context>
+    score = rouge.calculate("w1 w2 w3 w4 w5", "w1 w2 w6 w7 w8\nw1 w3 w8 w9 w5")
+    assert pytest.approx(score, abs=0.05) == 0.5
+
+def test_rougel_sum_non_word():
+    rouge = ROUGE(rouge_type="rougeLsum")
+    score = rouge.calculate("w1 w2 w3 w4 w5", "/")
+    assert score == 0
\ No newline at end of file
</code_context>

<issue_to_address>
Missing test for ROUGE metrics with empty reference.

Please add tests for all ROUGE types with an empty reference and a non-empty hypothesis to verify correct handling of this edge case.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

clean_ref = clean_text(reference)
clean_hyp = clean_text(hypothesis)

return edit_distance(clean_ref, clean_hyp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: compute_ returns raw edit_distance instead of LevenshteinResult.

Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.

Suggested implementation:

        clean_ref = clean_text(reference)
        clean_hyp = clean_text(hypothesis)

        distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True)
        reference_length = len(clean_ref)
        return LevenshteinResult(
            distance=distance,
            insertions=insertions,
            deletions=deletions,
            substitutions=substitutions,
            reference_length=reference_length
        )
  • The edit_distance function must support a return_operations=True argument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to update edit_distance accordingly.
  • If edit_distance only returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.
  • Ensure that LevenshteinResult is imported or defined in the scope of compute_.

Comment on lines +16 to +21
if isinstance(smoothing_method, int):
self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
elif callable(smoothing_method):
self.smoothing_function = smoothing_method
else:
self.smoothing_function = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): No validation for smoothing_method range.

Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.

Suggested change
if isinstance(smoothing_method, int):
self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
elif callable(smoothing_method):
self.smoothing_function = smoothing_method
else:
self.smoothing_function = None
if isinstance(smoothing_method, int):
if 1 <= smoothing_method <= 7:
self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
else:
raise ValueError(
f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}."
)
elif callable(smoothing_method):
self.smoothing_function = smoothing_method
else:
self.smoothing_function = None

Comment on lines +31 to +33
if not isinstance(max_ngram, int):
max_ngram = int(max_ngram)
return [1.0 / max_ngram] * max_ngram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Implicit conversion of max_ngram may mask errors.

Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.

Suggested change
if not isinstance(max_ngram, int):
max_ngram = int(max_ngram)
return [1.0 / max_ngram] * max_ngram
if not isinstance(max_ngram, int):
raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}")
return [1.0 / max_ngram] * max_ngram

Comment on lines +50 to +53
ref_sents = self.simple_sent_tokenize(reference)
hyp_sents = self.simple_sent_tokenize(hypothesis)
ref_sents_cleaned = [clean_text(s) for s in ref_sents]
hyp_sents_cleaned = [clean_text(s) for s in hyp_sents]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.

The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.

Comment on lines +43 to +50

def test_zero_matches():
references = [uncommon]
hypothesis = "John loves Mary"
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing test for empty hypothesis and empty references in BLEU metric.

Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.

Suggested change
def test_zero_matches():
references = [uncommon]
hypothesis = "John loves Mary"
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_zero_matches():
references = [uncommon]
hypothesis = "John loves Mary"
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_hypothesis():
references = [common]
hypothesis = ""
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_references():
references = []
hypothesis = "the cat is on the mat"
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_hypothesis_and_references():
references = []
hypothesis = ""
bleu = BLEU()
for n in range(1, 6):
assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0

Comment on lines +50 to +51
for i, (ref, hyp) in enumerate(zip(references, inputs)):
assert fm.calculate_bleu([ref], hyp, threshold=0.8) == expected[i] No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +10 to +13
for rouge_type in ["rouge1", "rouge2", "rougeL", "rougeLsum"]:
rouge = ROUGE(rouge_type=rouge_type)
score = rouge.calculate("testing one two", "")
assert score == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Create uniform weights for BLEU-N scoring.
"""
if not isinstance(max_ngram, int):
max_ngram = int(max_ngram)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)

Suggested change
max_ngram = int(max_ngram)
max_ngram = max_ngram

Comment on lines +20 to +24
# Approximate WIP: (H / N_ref) * (H / N_hyp)
wip = 0.0
if N_ref > 0 and N_hyp > 0:
wip = (H / N_ref) * (H / N_hyp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
# Approximate WIP: (H / N_ref) * (H / N_hyp)
wip = 0.0
if N_ref > 0 and N_hyp > 0:
wip = (H / N_ref) * (H / N_hyp)
wip = (H / N_ref) * (H / N_hyp) if N_ref > 0 and N_hyp > 0 else 0.0

Comment on lines +36 to +40
else:
reference = clean_text(reference)
hypothesis = clean_text(hypothesis)
score = self.scorer.score(reference, hypothesis)
return score[self.rouge_type].fmeasure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Remove unnecessary else after guard condition (remove-unnecessary-else)

Suggested change
else:
reference = clean_text(reference)
hypothesis = clean_text(hypothesis)
score = self.scorer.score(reference, hypothesis)
return score[self.rouge_type].fmeasure
reference = clean_text(reference)
hypothesis = clean_text(hypothesis)
score = self.scorer.score(reference, hypothesis)
return score[self.rouge_type].fmeasure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant