Update Metrics info and minor bug fixes (#24)

shahules786 · web-flow · commit 727821fd429b · 2023-05-15T01:48:07.000+05:30
* update readme

* update metrics info

* fix type checks

* add citations

* update readme
diff --git a/README.md b/README.md
@@ -32,5 +32,71 @@
     <p>
 </h4>
 
-## What
+
+## Quickstart 
+
+This is a small example program you can run to see ragas in action!
+```python
+from datasets import load_dataset
+from ragas.metrics import (
+    Evaluation,
+    rouge1,
+    bert_score,
+    entailment_score,
+) # import the metrics you want to use
+
+# load the dataset
+ds = load_dataset("explodinggradients/eli5-test", split="test_eli5").select(range(100))
+
+# init the evaluator, this takes in the metrics you want to use
+# and performs the evaluation
+e = Evaluation(
+    metrics=[rouge1, bert_score, entailment_score,],
+    batched=False,
+    batch_size=30,
+)
+
+# run the evaluation
+results = e.eval(ds["ground_truth"], ds["generated_text"])
+print(results)
+```
+If you want a more in-depth explanation of core components, check out our quick-start notebook
+## Metrics
+
+### Character based
+
+- **Levenshtein distance** the number of single character edits (additional, insertion, deletion) required to change your generated text to ground truth text.
+- **Levenshtein** **ratio** is obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth. This type of metrics is suitable where one works with short and precise texts.
+
+### N-Gram based
+
+N-gram based metrics as name indicates uses n-grams for comparing generated answer with ground truth. It is suitable to extractive and abstractive tasks but has its limitations in long free form answers due to the word based comparison.
+
+- **ROGUE** (Recall-Oriented Understudy for Gisting Evaluation):
+    - **ROUGE-N** measures the number of matching ‘n-grams’ between generated text and ground truth. These matches do not consider the ordering of words.
+    - **ROUGE-L** measures the longest common subsequence (LCS) between generated text and ground truth. This means is that we count the longest sequence of tokens that is shared between both
+
+- **BLEU** (BiLingual Evaluation Understudy)
+
+It measures precision by comparing  clipped n-grams in generated text to ground truth text. These matches do not consider the ordering of words.
+
+### Model Based
+
+Model based methods uses language models combined with NLP techniques to compare generated text with ground truth.  It is well suited for free form long or short answer types. 
+
+- **BertScore**
+    
+    Bert Score measures the similarity between ground truth text answers and generated text using SBERT vector embeddings. The common choice of similarity measure is cosine similarity for which values range between 0 to 1. It shows good correlation with human judgement.
+    
+
+- **EntailmentScore**
+    
+    Textual entailment to measure factual consistency in generated text given ground truth. Score can range from 0 to 1 with latter indicating perfect factual entailment for all samples. Entailment score is highly correlated with human judgement.
+    
+
+- **$Q^2$**
+    
+    Best used to measure factual consistencies between ground truth and generated text. Scores can range from 0 to 1. Higher score indicates better factual consistency between ground truth and generated answer. Employs QA-QG paradigm followed by NLI to compare ground truth and generated answer. Q2Score score is highly correlated with human judgement.
+
+Checkout [citations](./citations.md) for related publications.
 
diff --git a/citations.md b/citations.md
@@ -0,0 +1,32 @@
+```
+@misc{
+title={ROUGE: A Package for Automatic Evaluation of Summaries},
+author={Chin-Yew Lin.}, year={2004},
+DOI=""
+}
+
+@misc{
+title={Bleu: a Method for Automatic Evaluation of Machine Translation},
+author={Papineni et al.}, year={2002},
+DOI=10.3115/1073083.1073135
+}
+
+@misc{
+title={BERTScore: Evaluating Text Generation with BERT},
+author={Zhang et al.}, year={2019},
+DOI=https://doi.org/10.48550/arXiv.1904.09675
+}
+
+@misc{
+title={On Faithfulness and Factuality in Abstractive Summarization},
+author={Maynez* et al.}, year={2020},
+DOI=https://doi.org/10.48550/arXiv.2005.00661
+}
+
+@misc{
+title={Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering},
+author={Honovich et al.}, year={2021},
+DOI=https://doi.org/10.48550/arXiv.2104.08202
+}
+
+```
diff --git a/ragas/metrics/base.py b/ragas/metrics/base.py
@@ -77,7 +77,7 @@ def _get_score(self, row: dict[str, list[t.Any]] | dict[str, t.Any]):
                 scores = metric.score(ground_truths, generated_texts)
                 score = np.max(scores)
 
-            row[f"{metric.name}_score"] = score
+            row[f"{metric.name}"] = score
 
         return row
 
diff --git a/ragas/metrics/similarity.py b/ragas/metrics/similarity.py
@@ -9,9 +9,6 @@
 
 from ragas.metrics.base import Metric
 
-if t.TYPE_CHECKING:
-    from torch import Tensor
-
 BERT_METRIC = t.Literal["cosine", "euclidean"]
 
 
@@ -45,9 +42,11 @@ def score(
         gentext_emb = self.model.encode(
             generated_text, batch_size=self.batch_size, convert_to_numpy=True
         )
-        assert isinstance(gentext_emb, Tensor) and isinstance(gndtruth_emb, Tensor), (
+        assert isinstance(gentext_emb, np.ndarray) and isinstance(
+            gndtruth_emb, np.ndarray
+        ), (
             f"Both gndtruth_emb[{type(gentext_emb)}], gentext_emb[{type(gentext_emb)}]"
-            " should be Tensor."
+            " should be numpy ndarray."
         )
 
         if self.similarity_metric == "cosine":