Skip to content

Commit 727821f

Browse files
authored
Update Metrics info and minor bug fixes (#24)
* update readme * update metrics info * fix type checks * add citations * update readme
1 parent df83fb1 commit 727821f

File tree

4 files changed

+104
-7
lines changed

4 files changed

+104
-7
lines changed

README.md

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,71 @@
3232
<p>
3333
</h4>
3434

35-
## What
35+
36+
## Quickstart
37+
38+
This is a small example program you can run to see ragas in action!
39+
```python
40+
from datasets import load_dataset
41+
from ragas.metrics import (
42+
Evaluation,
43+
rouge1,
44+
bert_score,
45+
entailment_score,
46+
) # import the metrics you want to use
47+
48+
# load the dataset
49+
ds = load_dataset("explodinggradients/eli5-test", split="test_eli5").select(range(100))
50+
51+
# init the evaluator, this takes in the metrics you want to use
52+
# and performs the evaluation
53+
e = Evaluation(
54+
metrics=[rouge1, bert_score, entailment_score,],
55+
batched=False,
56+
batch_size=30,
57+
)
58+
59+
# run the evaluation
60+
results = e.eval(ds["ground_truth"], ds["generated_text"])
61+
print(results)
62+
```
63+
If you want a more in-depth explanation of core components, check out our quick-start notebook
64+
## Metrics
65+
66+
### Character based
67+
68+
- **Levenshtein distance** the number of single character edits (additional, insertion, deletion) required to change your generated text to ground truth text.
69+
- **Levenshtein** **ratio** is obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth. This type of metrics is suitable where one works with short and precise texts.
70+
71+
### N-Gram based
72+
73+
N-gram based metrics as name indicates uses n-grams for comparing generated answer with ground truth. It is suitable to extractive and abstractive tasks but has its limitations in long free form answers due to the word based comparison.
74+
75+
- **ROGUE** (Recall-Oriented Understudy for Gisting Evaluation):
76+
- **ROUGE-N** measures the number of matching ‘n-grams’ between generated text and ground truth. These matches do not consider the ordering of words.
77+
- **ROUGE-L** measures the longest common subsequence (LCS) between generated text and ground truth. This means is that we count the longest sequence of tokens that is shared between both
78+
79+
- **BLEU** (BiLingual Evaluation Understudy)
80+
81+
It measures precision by comparing  clipped n-grams in generated text to ground truth text. These matches do not consider the ordering of words.
82+
83+
### Model Based
84+
85+
Model based methods uses language models combined with NLP techniques to compare generated text with ground truth. It is well suited for free form long or short answer types.
86+
87+
- **BertScore**
88+
89+
Bert Score measures the similarity between ground truth text answers and generated text using SBERT vector embeddings. The common choice of similarity measure is cosine similarity for which values range between 0 to 1. It shows good correlation with human judgement.
90+
91+
92+
- **EntailmentScore**
93+
94+
Textual entailment to measure factual consistency in generated text given ground truth. Score can range from 0 to 1 with latter indicating perfect factual entailment for all samples. Entailment score is highly correlated with human judgement.
95+
96+
97+
- **$Q^2$**
98+
99+
Best used to measure factual consistencies between ground truth and generated text. Scores can range from 0 to 1. Higher score indicates better factual consistency between ground truth and generated answer. Employs QA-QG paradigm followed by NLI to compare ground truth and generated answer. Q2Score score is highly correlated with human judgement.
100+
101+
Checkout [citations](./citations.md) for related publications.
36102

citations.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
```
2+
@misc{
3+
title={ROUGE: A Package for Automatic Evaluation of Summaries},
4+
author={Chin-Yew Lin.}, year={2004},
5+
DOI=""
6+
}
7+
8+
@misc{
9+
title={Bleu: a Method for Automatic Evaluation of Machine Translation},
10+
author={Papineni et al.}, year={2002},
11+
DOI=10.3115/1073083.1073135
12+
}
13+
14+
@misc{
15+
title={BERTScore: Evaluating Text Generation with BERT},
16+
author={Zhang et al.}, year={2019},
17+
DOI=https://doi.org/10.48550/arXiv.1904.09675
18+
}
19+
20+
@misc{
21+
title={On Faithfulness and Factuality in Abstractive Summarization},
22+
author={Maynez* et al.}, year={2020},
23+
DOI=https://doi.org/10.48550/arXiv.2005.00661
24+
}
25+
26+
@misc{
27+
title={Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering},
28+
author={Honovich et al.}, year={2021},
29+
DOI=https://doi.org/10.48550/arXiv.2104.08202
30+
}
31+
32+
```

ragas/metrics/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def _get_score(self, row: dict[str, list[t.Any]] | dict[str, t.Any]):
7777
scores = metric.score(ground_truths, generated_texts)
7878
score = np.max(scores)
7979

80-
row[f"{metric.name}_score"] = score
80+
row[f"{metric.name}"] = score
8181

8282
return row
8383

ragas/metrics/similarity.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,6 @@
99

1010
from ragas.metrics.base import Metric
1111

12-
if t.TYPE_CHECKING:
13-
from torch import Tensor
14-
1512
BERT_METRIC = t.Literal["cosine", "euclidean"]
1613

1714

@@ -45,9 +42,11 @@ def score(
4542
gentext_emb = self.model.encode(
4643
generated_text, batch_size=self.batch_size, convert_to_numpy=True
4744
)
48-
assert isinstance(gentext_emb, Tensor) and isinstance(gndtruth_emb, Tensor), (
45+
assert isinstance(gentext_emb, np.ndarray) and isinstance(
46+
gndtruth_emb, np.ndarray
47+
), (
4948
f"Both gndtruth_emb[{type(gentext_emb)}], gentext_emb[{type(gentext_emb)}]"
50-
" should be Tensor."
49+
" should be numpy ndarray."
5150
)
5251

5352
if self.similarity_metric == "cosine":

0 commit comments

Comments
 (0)