Update Readme (#35)

shahules786 · web-flow · commit 322bd133e05b · 2023-06-07T23:32:43.000+05:30
* added readme

* added how to use

* added metrics to docs

* update readme
diff --git a/README.md b/README.md
@@ -29,6 +29,7 @@
         <a href="#shield-installation">Installation</a> |
         <a href="#fire-quickstart">Quickstart</a> |
         <a href="#luggage-metrics">Metrics</a> |
+        <a href="#raising_hand_man-faq">FAQ</a> |
         <a href="https://huggingface.co/explodinggradients">Hugging Face</a>
     <p>
 </h4>
@@ -52,66 +53,49 @@ pip install -e .
 
 This is a small example program you can run to see ragas in action!
 ```python
-from datasets import load_dataset
-from ragas.metrics import (
-    Evaluation,
-    rouge1,
-    bert_score,
-    entailment_score,
-) # import the metrics you want to use
-
-# load the dataset
-ds = load_dataset("explodinggradients/eli5-test", split="test_eli5").select(range(100))
-
-# init the evaluator, this takes in the metrics you want to use
-# and performs the evaluation
-e = Evaluation(
-    metrics=[rouge1, bert_score, entailment_score,],
-    batched=False,
-    batch_size=30,
-)
-
-# run the evaluation
-results = e.eval(ds["ground_truth"], ds["generated_text"])
-print(results)
+
+from ragas.metrics import factuality, answer_relevancy, context_relevancy
+from ragas import evaluate
+import os
+
+os.environ["OPENAI_API_KEY"] = "your-openai-key"
+
+ds = Dataset({
+    features: ['question','context','answer'],
+    num_rows: 25
+})
+results = evaluate(ds, metrics=[nli, answer_relevancy, context_relevancy])
+
 ```
 If you want a more in-depth explanation of core components, check out our quick-start notebook
 ## :luggage: Metrics
 
-### :3rd_place_medal: Character based 
-
-- **Levenshtein distance** the number of single character edits (additional, insertion, deletion) required to change your generated text to ground truth text.
-- **Levenshtein** **ratio** is obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth. This type of metrics is suitable where one works with short and precise texts.
+Ragas measures your pipeline's performance against two dimensions
+1. **Factuality**: measures the factual consistency of the generated answer against the given context.
+2. **Relevancy**:  measures how relevant retrieved contexts and the generated answer are to the question. 
 
-### :2nd_place_medal: N-Gram based
+Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors. 
 
-N-gram based metrics as name indicates uses n-grams for comparing generated answer with ground truth. It is suitable to extractive and abstractive tasks but has its limitations in long free form answers due to the word based comparison.
+To read more about our metrics, checkout [docs](/docs/metrics.md).
+## :question: How to use Ragas to improve your pipeline?
+*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*
 
-- **ROGUE** (Recall-Oriented Understudy for Gisting Evaluation):
-    - **ROUGE-N** measures the number of matching ‘n-grams’ between generated text and ground truth. These matches do not consider the ordering of words.
-    - **ROUGE-L** measures the longest common subsequence (LCS) between generated text and ground truth. This means is that we count the longest sequence of tokens that is shared between both
+Here we assume that you already have your RAG pipeline ready. When is comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.
 
-- **BLEU** (BiLingual Evaluation Understudy)
+1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K. 
+2. Collect a set of sample prompts (min 20) to form your test set.
+3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
+4. Run ragas evaluation for each of them to generate evaluation scores. 
+5. Compare the scores and you will know how much the change has affected your pipelines's performance.
 
-    It measures precision by comparing  clipped n-grams in generated text to ground truth text. These matches do not consider the ordering of words.
 
-### :1st_place_medal: Model Based
+## :raising_hand_man: FAQ
+1. Why harmonic mean?
+Harmonic mean penalizes extreme values. For example if your generated answer is fully factually consistent with the context (factuality = 1) but is not relevant to the question (relevancy = 0), simple average would give you a score of 0.5 but harmonic mean will give you 0.0
 
-Model based methods uses language models combined with NLP techniques to compare generated text with ground truth.  It is well suited for free form long or short answer types. 
 
-- **BertScore**
-    
-    Bert Score measures the similarity between ground truth text answers and generated text using SBERT vector embeddings. The common choice of similarity measure is cosine similarity for which values range between 0 to 1. It shows good correlation with human judgement.
-    
 
-- **EntailmentScore**
-    
-    Textual entailment to measure factual consistency in generated text given ground truth. Score can range from 0 to 1 with latter indicating perfect factual entailment for all samples. Entailment score is highly correlated with human judgement.
-    
 
-- **$Q^2$**
-    
-    Best used to measure factual consistencies between ground truth and generated text. Scores can range from 0 to 1. Higher score indicates better factual consistency between ground truth and generated answer. Employs QA-QG paradigm followed by NLI to compare ground truth and generated answer. $Q^2$ score is highly correlated with human judgement. :warning: time and resource hungry metrics. 
+ 
 
-📜 Checkout [citations](./references.md) for related publications.
 
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -0,0 +1,7 @@
+# Metrics
+
+1. `factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
+
+2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
+
+3. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is quantified using a custom trained cross encoder model. Values range (0,1), higher the better.