Skip to content

Commit 36a8c4c

Browse files
authored
Merge pull request #294 from saitiger/METEOR
Implementation of METEOR
2 parents bfc3014 + e0ea9ac commit 36a8c4c

File tree

2 files changed

+187
-0
lines changed

2 files changed

+187
-0
lines changed

Problems/110_METEOR/Learn.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
METEOR(Metric for Evaluation of Translation with Explicit ORdering) is a metric generally used for
2+
machine translation and evaluating the text output of generative AI models. METEOR build was introduced to address
3+
the limitations in earlier metrics like BLEU.
4+
5+
## Key Characteristics
6+
- Considers semantic similarity beyond exact word matching
7+
- Accounts for word order and translation variations
8+
- Provides more human-aligned translation assessment
9+
10+
# Implementation
11+
1. **Tokenization**
12+
13+
2. **Frequency of matching words** : Matching needs to be exact
14+
15+
3. **Calculate Precision, Recall and F-mean**
16+
```
17+
F_mean = (Precision * Recall) / (α * Precision + (1 - α) * Recall)
18+
```
19+
- α typically set to 0.9
20+
- Balances precision and recall
21+
22+
4. **Fragmentation Penalty**
23+
```
24+
Chunks = Count of contiguous matched word sequences
25+
Penalty = γ * (Chunks / Matches)^β
26+
```
27+
- β controls penalty weight (typically 3)
28+
- γ limits maximum penalty (typically 0.5)
29+
30+
5. **Final METEOR Score**
31+
```
32+
METEOR = F_mean * (1 - Penalty)
33+
```
34+
- Ranges from 0 (no match) to 1 (perfect match)
35+
36+
**__Note__** : The [paper](https://aclanthology.org/W05-0909/) that introduced the metric doesn't have the parameters (α,β, and γ) as tunable parameters, but implementation in other libraries like NLTK offers this flexibility.
37+
38+
# Example
39+
40+
- Reference: "The quick brown fox jumps over the lazy dog"
41+
- Candidate: "A quick brown fox jumps over a lazy dog"
42+
43+
### 1. Tokenization
44+
- Reference Tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
45+
- Candidate Tokens: ['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'a', 'lazy', 'dog']
46+
47+
### 2. Unigram Matching
48+
- Matching tokens: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
49+
- Matches: 7
50+
51+
### 3. Unigram Precision and Recall Calculation
52+
- Precision = Matches / Candidate Length = 7 / 9 ≈ 0.778
53+
54+
- Recall = Matches / Reference Length = 7 / 9 ≈ 0.778
55+
56+
### 4. F-mean Calculation (α = 0.9)
57+
```
58+
F_mean = (Precision * Recall) / (α * Precision + (1 - α) * Recall)
59+
= (0.778 * 0.778) / (0.9 * 0.778 + (1 - 0.9) * 0.778)
60+
= 0.606 / (0.7 + 0.078)
61+
= 0.606 / 0.778
62+
≈ 0.779
63+
```
64+
65+
### 5. Chunk Calculation
66+
- Contiguous matched sequences:
67+
1. ['quick', 'brown', 'fox']
68+
2. ['jumps', 'over']
69+
3. ['lazy', 'dog']
70+
- Number of Chunks: 3
71+
- Total Number of Unigram Matches: 7
72+
73+
### 6. Penalty Calculation (β = 3, γ = 0.5)
74+
```
75+
Penalty = γ * (Number of Chunks / Total Number of Unigram Matches)^β
76+
= 0.5 * (3 / 7)^3
77+
= 0.5 * (0.429)^3
78+
≈ 0.039
79+
```
80+
81+
### 7. Final METEOR Score
82+
```
83+
METEOR = F_mean * (1 - Penalty)
84+
= 0.779 * (1 - 0.039)
85+
= 0.779 * 0.961
86+
≈ 0.749
87+
```

Problems/110_METEOR/solution.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import numpy as np
2+
from collections import Counter
3+
4+
def meteor_score(reference, candidate, alpha=0.9, beta=3, gamma=0.5):
5+
if not reference or not candidate:
6+
raise ValueError("Reference and candidate cannot be empty")
7+
8+
# Tokenize and count
9+
ref_tokens = reference.lower().split()
10+
cand_tokens = candidate.lower().split()
11+
12+
# Counter for unigram for reference and candidate
13+
ref_counts = Counter(ref_tokens)
14+
cand_counts = Counter(cand_tokens)
15+
16+
# Calculate matches
17+
num_matches = sum((ref_counts & cand_counts).values()) # Number of matching words in candidate and reference
18+
ref_len = len(ref_tokens)
19+
cand_len = len(cand_tokens)
20+
21+
# Unigram Precision and Recall
22+
precision = num_matches / cand_len if cand_len > 0 else 0 # Avoiding Division by zero
23+
recall = num_matches / ref_len if ref_len > 0 else 0 # Avoiding Division by zero
24+
25+
if num_matches == 0:
26+
return 0.0
27+
28+
fmean = (precision * recall) / (alpha * precision + (1 - alpha) * recall)
29+
30+
# Chunk calculation
31+
matched_positions = []
32+
ref_positions = {} # Store positions of words in reference
33+
used_positions = set() # Track already used indices
34+
35+
# Populate reference positions for word alignment tracking
36+
for i, word in enumerate(ref_tokens):
37+
ref_positions.setdefault(word, []).append(i)
38+
39+
# Determine the sequence of matched positions in reference
40+
for word in cand_tokens:
41+
if word in ref_positions:
42+
for pos in ref_positions[word]:
43+
if pos not in used_positions:
44+
matched_positions.append(pos)
45+
used_positions.add(pos)
46+
break # Ensure each match is used only once
47+
48+
# Count chunks by detecting breaks in position sequence
49+
num_chunks = 1 if matched_positions else 0
50+
for i in range(1, len(matched_positions)):
51+
if matched_positions[i] != matched_positions[i - 1] + 1:
52+
num_chunks += 1 # Break in sequence → new chunk
53+
54+
# Fragmentation penalty
55+
penalty = gamma * ((num_chunks / num_matches) ** beta) if num_matches > 0 else 0
56+
57+
# Final score
58+
return round(fmean * (1 - penalty), 3) # Rounding to 3 Decimal places
59+
60+
def test_meteor_score():
61+
# Test Case 1: Identical translations
62+
ref_test1 = "The cat sits on the mat"
63+
cand_test1 = "The cat sits on the mat"
64+
expected1 = 1.0
65+
assert meteor_score(ref_test1, cand_test1) == expected1, "Test Case 1 Failed"
66+
67+
# Test Case 2: Similar translations
68+
ref_test2 = "The quick brown fox jumps over the lazy dog"
69+
cand_test2 = "A quick brown fox jumps over a lazy dog"
70+
expected2 = 0.991
71+
assert meteor_score(ref_test2, cand_test2) == expected2, "Test Case 2 Failed"
72+
73+
# Test Case 3: Completely different translations
74+
ref_test3 = "The cat sits on the mat"
75+
cand_test3 = "Dogs run in the park"
76+
expected3 = 0.0
77+
assert meteor_score(ref_test3, cand_test3) == expected3, "Test Case 3 Failed"
78+
79+
# Test Case 4: Partially matching translations
80+
ref_test4 = "Machine learning is an exciting field"
81+
cand_test4 = "Machine learning algorithms are fascinating"
82+
expected4 = 0.667
83+
assert meteor_score(ref_test4, cand_test4) == expected4, "Test Case 4 Failed"
84+
85+
# Test Case 5: Empty input handling
86+
try:
87+
meteor_score("", "Some text")
88+
assert False, "Test Case 5 Failed"
89+
except ValueError:
90+
pass
91+
92+
# Test Case 6: Partial match with penalty
93+
ref_test6 = "The cat sits on the mat"
94+
cand_test6 = "The cat on the mat sits"
95+
expected6 = 0.933
96+
assert meteor_score(ref_test6, cand_test6) == expected6, "Test Case 6 Failed"
97+
98+
if __name__ == "__main__":
99+
test_meteor_score()
100+
print("All Test Cases Passed!")

0 commit comments

Comments
 (0)