- 
        Couldn't load subscription status. 
- Fork 10
Add Language Metrics and Tests #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Language Metrics and Tests #48
Conversation
| Reviewer's GuidePorts core language metrics from Java to Python, including BLEU, Levenshtein, FuzzyMatch, ROUGE, and word-based error rates, and adds comprehensive unit tests for each metric along with updating project dependencies for new NLP libraries. Entity relationship diagram for ErrorRateResult and LevenshteinResult data structureserDiagram
    ERRORRATERESULT {
        float value
        int insertions
        int deletions
        int substitutions
        int correct
        int reference_length
    }
    LEVENSHTEINRESULT {
        int distance
        int insertions
        int deletions
        int substitutions
        int reference_length
    }
    ERRORRATERESULT ||--o| LEVENSHTEINRESULT : uses
Class diagram for BLEUMetric and ROUGEMetric tokenization and scoringclassDiagram
    class BLEUMetric {
        +tokenizer: Callable
        +smoothing_function
        +calculate(references, hypothesis, max_ngram, weights)
        +calculate_corpus(references, hypotheses, max_ngram, weights)
    }
    class ROUGEMetric {
        +rouge_type: Literal
        +scorer
        +calculate(reference, hypothesis)
        +_rouge_lsum(reference, hypothesis)
    }
    BLEUMetric o-- "tokenizer" Callable
    ROUGEMetric o-- "scorer" RougeScorer
File-Level Changes
 Tips and commandsInteracting with Sourcery
 Customizing Your ExperienceAccess your dashboard to: 
 Getting Help
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
- Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
- The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Remove the debug print statements in FuzzyMatch.calculate_wer to avoid unwanted console output during metric evaluation.
- Levenshtein.compute_ currently returns an integer but its docstring says it returns a LevenshteinResult—either update the implementation to return a LevenshteinResult or adjust the docs and naming to reflect the actual return type.
- The folder name ‘rogue’ doesn’t match the ROUGE metric naming—consider renaming it to ‘rouge’ for consistency in both code and imports.
## Individual Comments
### Comment 1
<location> `src/core/metrics/language/Levenshtein/levenshtein.py:40` </location>
<code_context>
+        clean_ref = clean_text(reference)
+        clean_hyp = clean_text(hypothesis)
+
+        return edit_distance(clean_ref, clean_hyp)
+
+
</code_context>
<issue_to_address>
compute_ returns raw edit_distance instead of LevenshteinResult.
Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.
Suggested implementation:
```python
        clean_ref = clean_text(reference)
        clean_hyp = clean_text(hypothesis)
        distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True)
        reference_length = len(clean_ref)
        return LevenshteinResult(
            distance=distance,
            insertions=insertions,
            deletions=deletions,
            substitutions=substitutions,
            reference_length=reference_length
        )
```
- The `edit_distance` function must support a `return_operations=True` argument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to update `edit_distance` accordingly.
- If `edit_distance` only returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.
- Ensure that `LevenshteinResult` is imported or defined in the scope of `compute_`.
</issue_to_address>
### Comment 2
<location> `src/core/metrics/language/bleu/bleu.py:16` </location>
<code_context>
+                                 None means no smoothing.
+        :param tokenizer: A callable that tokenizes a string into a list of tokens. Defaults to str.split.
+        """
+        if isinstance(smoothing_method, int):
+            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
+        elif callable(smoothing_method):
+            self.smoothing_function = smoothing_method
</code_context>
<issue_to_address>
No validation for smoothing_method range.
Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
        if isinstance(smoothing_method, int):
            self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
=======
        if isinstance(smoothing_method, int):
            if 1 <= smoothing_method <= 7:
                self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}")
            else:
                raise ValueError(
                    f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}."
                )
        elif callable(smoothing_method):
            self.smoothing_function = smoothing_method
        else:
            self.smoothing_function = None
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `src/core/metrics/language/bleu/bleu.py:31` </location>
<code_context>
+        """
+        Create uniform weights for BLEU-N scoring.
+        """
+        if not isinstance(max_ngram, int):
+            max_ngram = int(max_ngram)
+        return [1.0 / max_ngram] * max_ngram
</code_context>
<issue_to_address>
Implicit conversion of max_ngram may mask errors.
Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
        if not isinstance(max_ngram, int):
            max_ngram = int(max_ngram)
        return [1.0 / max_ngram] * max_ngram
=======
        if not isinstance(max_ngram, int):
            raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}")
        return [1.0 / max_ngram] * max_ngram
>>>>>>> REPLACE
</suggested_fix>
### Comment 4
<location> `src/core/metrics/language/rogue/rogue.py:50` </location>
<code_context>
+        :param hypothesis: A hypothesis paragraph consisting of multiple sentences.
+        :return: The average ROUGE-L F1 score over aligned sentence pairs.
+        """
+        ref_sents = self.simple_sent_tokenize(reference)
+        hyp_sents = self.simple_sent_tokenize(hypothesis)
+        ref_sents_cleaned = [clean_text(s) for s in ref_sents]
+        hyp_sents_cleaned = [clean_text(s) for s in hyp_sents]
+
+        total_score = 0.0
</code_context>
<issue_to_address>
Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.
The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.
</issue_to_address>
### Comment 5
<location> `tests/metrics/language/test_bleu.py:44` </location>
<code_context>
+    score = bleu.calculate(common_references, common_hypothesis, max_ngram=2)
+    assert pytest.approx(score, abs=0.01) == 0.4082
+
+def test_zero_matches():
+    references = [uncommon]
+    hypothesis = "John loves Mary"
+    bleu = BLEU()
+    for n in range(1, 6):
+        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
+
</code_context>
<issue_to_address>
Missing test for empty hypothesis and empty references in BLEU metric.
Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
=======
def test_zero_matches():
    references = [uncommon]
    hypothesis = "John loves Mary"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_hypothesis():
    references = [common]
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_references():
    references = []
    hypothesis = "the cat is on the mat"
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
def test_empty_hypothesis_and_references():
    references = []
    hypothesis = ""
    bleu = BLEU()
    for n in range(1, 6):
        assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0
>>>>>>> REPLACE
</suggested_fix>
### Comment 6
<location> `tests/metrics/language/test_bleu.py:97` </location>
<code_context>
+    hypotheses = [hyp1, hyp2]
+    assert pytest.approx(bleu.calculate_corpus(references, hypotheses, max_ngram=4, weights=weights), 0.01) == 0.5818
+
+def test_bleu_corpus_multiple_weight_sets():
+    bleu = BLEU()
+    weight_sets = [
+        [0.5, 0.5],
+        [0.333, 0.333, 0.334],
+        [0.25, 0.25, 0.25, 0.25],
+        [0.2, 0.2, 0.2, 0.2, 0.2]
+    ]
+    expected_scores = [0.8242, 0.7067, 0.5920, 0.4719]
+    references = [[ref1a, ref1b, ref1c], [ref2a]]
+    hypotheses = [hyp1, hyp2]
+
+    for weights, expected in zip(weight_sets, expected_scores):
+        ngram = len(weights)
+        score = bleu.calculate_corpus(references, hypotheses, max_ngram=ngram, weights=weights)
+        assert pytest.approx(score, 0.02) == expected
\ No newline at end of file
</code_context>
<issue_to_address>
Missing test for invalid BLEU weights (e.g., negative, zero, or not summing to 1).
Add tests with invalid BLEU weights to verify that the metric properly handles or rejects these cases.
</issue_to_address>
### Comment 7
<location> `tests/metrics/language/test_wordErrorRate.py:43` </location>
<code_context>
+
+
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_commons))
+def test_commons_tokenizer(ref, hyp, expected):
+    wer = WordErrorRate(tokenizer=commons_tokenizer)
+    result = wer.calculate(ref, hyp)
+    assert pytest.approx(result.value, abs=TOLERANCE) == expected
+
[email protected]("ref, hyp, expected", zip(references, inputs, ground_truth_whitespace))
</code_context>
<issue_to_address>
Missing test for empty reference and/or hypothesis in Word Error Rate.
Add test cases with empty reference and/or hypothesis strings to ensure correct Word Error Rate outputs and prevent crashes.
</issue_to_address>
### Comment 8
<location> `tests/metrics/language/test_fuzzyMatch.py:22` </location>
<code_context>
+    return [token for token in text.split() if token.strip()]
+
+
+def test_exact_match():
+    expected = [False, False, True]
+    fm = FuzzyMatch()
+    for i, (ref, hyp) in enumerate(zip(references, inputs)):
+        assert fm.calculate(ref, hyp) == expected[i]
+
+def test_wer_match_default_tokenizer():
</code_context>
<issue_to_address>
Missing test for FuzzyMatch with empty strings and high/low thresholds.
Please add tests for cases where the reference or input string is empty, and for threshold values at 0.0 and 1.0, to cover these edge scenarios.
</issue_to_address>
### Comment 9
<location> `tests/metrics/language/test_levenshteinCommon.py:26` </location>
<code_context>
+    assert pytest.approx(wip, abs=tol) == expected_wip, f"Expected WIP: {expected_wip}, got {wip}"
+
+
+def test_equal_reference_hypothesis():
+    ref = ["X"]
+    hyp = ["X"]
+    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
+
+
</code_context>
<issue_to_address>
Missing test for Levenshtein metrics with empty reference and/or hypothesis.
Add tests covering scenarios with empty reference and/or hypothesis to verify correct handling of these edge cases.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
=======
def test_equal_reference_hypothesis():
    ref = ["X"]
    hyp = ["X"]
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
def test_empty_reference_and_hypothesis():
    ref = []
    hyp = []
    # All metrics should be 0 except WIP, which should be 1 (perfect preservation)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
def test_empty_reference_nonempty_hypothesis():
    ref = []
    hyp = ["X", "Y"]
    # WER and MER are undefined for empty reference, but typically set to 0.0 or handled gracefully.
    # WIL should be 0.0, WIP should be 1.0 (no information lost/preserved since nothing to compare)
    run_test_all(ref, hyp, expected_wer=0.0, expected_mer=0.0, expected_wil=0.0, expected_wip=1.0)
def test_nonempty_reference_empty_hypothesis():
    ref = ["X", "Y"]
    hyp = []
    # WER and MER should be 1.0 (all words missed), WIL should be 1.0 (all information lost), WIP should be 0.0
    run_test_all(ref, hyp, expected_wer=1.0, expected_mer=1.0, expected_wil=1.0, expected_wip=0.0)
>>>>>>> REPLACE
</suggested_fix>
### Comment 10
<location> `tests/metrics/language/test_rogue.py:35` </location>
<code_context>
+    score = rouge.calculate("w1 w2 w3 w4 w5", "w1 w2 w6 w7 w8\nw1 w3 w8 w9 w5")
+    assert pytest.approx(score, abs=0.05) == 0.5
+
+def test_rougel_sum_non_word():
+    rouge = ROUGE(rouge_type="rougeLsum")
+    score = rouge.calculate("w1 w2 w3 w4 w5", "/")
+    assert score == 0
\ No newline at end of file
</code_context>
<issue_to_address>
Missing test for ROUGE metrics with empty reference.
Please add tests for all ROUGE types with an empty reference and a non-empty hypothesis to verify correct handling of this edge case.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| clean_ref = clean_text(reference) | ||
| clean_hyp = clean_text(hypothesis) | ||
|  | ||
| return edit_distance(clean_ref, clean_hyp) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: compute_ returns raw edit_distance instead of LevenshteinResult.
Please update compute_ to return a LevenshteinResult object instead of a raw integer for consistency with the rest of the codebase.
Suggested implementation:
        clean_ref = clean_text(reference)
        clean_hyp = clean_text(hypothesis)
        distance, insertions, deletions, substitutions = edit_distance(clean_ref, clean_hyp, return_operations=True)
        reference_length = len(clean_ref)
        return LevenshteinResult(
            distance=distance,
            insertions=insertions,
            deletions=deletions,
            substitutions=substitutions,
            reference_length=reference_length
        )- The edit_distancefunction must support areturn_operations=Trueargument and return a tuple: (distance, insertions, deletions, substitutions). If it does not, you will need to updateedit_distanceaccordingly.
- If edit_distanceonly returns an integer, you must refactor it to also compute and return insertions, deletions, and substitutions.
- Ensure that LevenshteinResultis imported or defined in the scope ofcompute_.
| if isinstance(smoothing_method, int): | ||
| self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}") | ||
| elif callable(smoothing_method): | ||
| self.smoothing_function = smoothing_method | ||
| else: | ||
| self.smoothing_function = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): No validation for smoothing_method range.
Invalid values for smoothing_method will cause AttributeError. Please validate that smoothing_method is within the allowed range (1-7) and handle errors appropriately.
| if isinstance(smoothing_method, int): | |
| self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}") | |
| elif callable(smoothing_method): | |
| self.smoothing_function = smoothing_method | |
| else: | |
| self.smoothing_function = None | |
| if isinstance(smoothing_method, int): | |
| if 1 <= smoothing_method <= 7: | |
| self.smoothing_function = getattr(SmoothingFunction(), f"method{smoothing_method}") | |
| else: | |
| raise ValueError( | |
| f"smoothing_method must be an integer between 1 and 7 (inclusive), got {smoothing_method}." | |
| ) | |
| elif callable(smoothing_method): | |
| self.smoothing_function = smoothing_method | |
| else: | |
| self.smoothing_function = None | 
| if not isinstance(max_ngram, int): | ||
| max_ngram = int(max_ngram) | ||
| return [1.0 / max_ngram] * max_ngram | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Implicit conversion of max_ngram may mask errors.
Instead of silently casting, raise an error or warning when max_ngram is not an integer to prevent hidden bugs.
| if not isinstance(max_ngram, int): | |
| max_ngram = int(max_ngram) | |
| return [1.0 / max_ngram] * max_ngram | |
| if not isinstance(max_ngram, int): | |
| raise TypeError(f"max_ngram must be an integer, got {type(max_ngram).__name__}") | |
| return [1.0 / max_ngram] * max_ngram | 
| ref_sents = self.simple_sent_tokenize(reference) | ||
| hyp_sents = self.simple_sent_tokenize(hypothesis) | ||
| ref_sents_cleaned = [clean_text(s) for s in ref_sents] | ||
| hyp_sents_cleaned = [clean_text(s) for s in hyp_sents] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Sentence alignment in ROUGE-Lsum may not handle unequal sentence counts robustly.
The current approach may overlook extra sentences in either input. Please review if this averaging method is appropriate, or consider alternative handling for unaligned sentences.
|  | ||
| def test_zero_matches(): | ||
| references = [uncommon] | ||
| hypothesis = "John loves Mary" | ||
| bleu = BLEU() | ||
| for n in range(1, 6): | ||
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Missing test for empty hypothesis and empty references in BLEU metric.
Please add tests with empty strings or lists for hypothesis and references to verify BLEU handles these cases correctly.
| def test_zero_matches(): | |
| references = [uncommon] | |
| hypothesis = "John loves Mary" | |
| bleu = BLEU() | |
| for n in range(1, 6): | |
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | |
| def test_zero_matches(): | |
| references = [uncommon] | |
| hypothesis = "John loves Mary" | |
| bleu = BLEU() | |
| for n in range(1, 6): | |
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | |
| def test_empty_hypothesis(): | |
| references = [common] | |
| hypothesis = "" | |
| bleu = BLEU() | |
| for n in range(1, 6): | |
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | |
| def test_empty_references(): | |
| references = [] | |
| hypothesis = "the cat is on the mat" | |
| bleu = BLEU() | |
| for n in range(1, 6): | |
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | |
| def test_empty_hypothesis_and_references(): | |
| references = [] | |
| hypothesis = "" | |
| bleu = BLEU() | |
| for n in range(1, 6): | |
| assert bleu.calculate(references, hypothesis, max_ngram=n) == 0.0 | |
| for i, (ref, hyp) in enumerate(zip(references, inputs)): | ||
| assert fm.calculate_bleu([ref], hyp, threshold=0.8) == expected[i] No newline at end of file | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| for rouge_type in ["rouge1", "rouge2", "rougeL", "rougeLsum"]: | ||
| rouge = ROUGE(rouge_type=rouge_type) | ||
| score = rouge.calculate("testing one two", "") | ||
| assert score == 0 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| Create uniform weights for BLEU-N scoring. | ||
| """ | ||
| if not isinstance(max_ngram, int): | ||
| max_ngram = int(max_ngram) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)
| max_ngram = int(max_ngram) | |
| max_ngram = max_ngram | 
| # Approximate WIP: (H / N_ref) * (H / N_hyp) | ||
| wip = 0.0 | ||
| if N_ref > 0 and N_hyp > 0: | ||
| wip = (H / N_ref) * (H / N_hyp) | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Move setting of default value for variable into elsebranch (introduce-default-else)
- Replace if statement with if expression (assign-if-exp)
| # Approximate WIP: (H / N_ref) * (H / N_hyp) | |
| wip = 0.0 | |
| if N_ref > 0 and N_hyp > 0: | |
| wip = (H / N_ref) * (H / N_hyp) | |
| wip = (H / N_ref) * (H / N_hyp) if N_ref > 0 and N_hyp > 0 else 0.0 | 
| else: | ||
| reference = clean_text(reference) | ||
| hypothesis = clean_text(hypothesis) | ||
| score = self.scorer.score(reference, hypothesis) | ||
| return score[self.rouge_type].fmeasure | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Remove unnecessary else after guard condition (remove-unnecessary-else)
| else: | |
| reference = clean_text(reference) | |
| hypothesis = clean_text(hypothesis) | |
| score = self.scorer.score(reference, hypothesis) | |
| return score[self.rouge_type].fmeasure | |
| reference = clean_text(reference) | |
| hypothesis = clean_text(hypothesis) | |
| score = self.scorer.score(reference, hypothesis) | |
| return score[self.rouge_type].fmeasure | 
Summary
Rewrote the core language metrics from the Java implementation trustyai-explainability/explainability-core into Python, along with corresponding unit tests.
Implemented language metrics:
Summary by Sourcery
Port core language evaluation metrics from the Java implementation to Python and add comprehensive unit tests for each metric
New Features:
Build:
Tests: