You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you want a more in-depth explanation of core components, check out our quick-start notebook
79
71
## :luggage: Metrics
80
72
81
-
### :3rd_place_medal: Character based
82
-
83
-
-**Levenshtein distance** the number of single character edits (additional, insertion, deletion) required to change your generated text to ground truth text.
84
-
-**Levenshtein****ratio** is obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth. This type of metrics is suitable where one works with short and precise texts.
73
+
Ragas measures your pipeline's performance against two dimensions
74
+
1.**Factuality**: measures the factual consistency of the generated answer against the given context.
75
+
2.**Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question.
85
76
86
-
### :2nd_place_medal: N-Gram based
77
+
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
87
78
88
-
N-gram based metrics as name indicates uses n-grams for comparing generated answer with ground truth. It is suitable to extractive and abstractive tasks but has its limitations in long free form answers due to the word based comparison.
79
+
To read more about our metrics, checkout [docs](/docs/metrics.md).
80
+
## :question: How to use Ragas to improve your pipeline?
81
+
*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*
89
82
90
-
-**ROGUE** (Recall-Oriented Understudy for Gisting Evaluation):
91
-
-**ROUGE-N** measures the number of matching ‘n-grams’ between generated text and ground truth. These matches do not consider the ordering of words.
92
-
-**ROUGE-L** measures the longest common subsequence (LCS) between generated text and ground truth. This means is that we count the longest sequence of tokens that is shared between both
83
+
Here we assume that you already have your RAG pipeline ready. When is comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.
93
84
94
-
-**BLEU** (BiLingual Evaluation Understudy)
85
+
1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K.
86
+
2. Collect a set of sample prompts (min 20) to form your test set.
87
+
3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
88
+
4. Run ragas evaluation for each of them to generate evaluation scores.
89
+
5. Compare the scores and you will know how much the change has affected your pipelines's performance.
95
90
96
-
It measures precision by comparing clipped n-grams in generated text to ground truth text. These matches do not consider the ordering of words.
97
91
98
-
### :1st_place_medal: Model Based
92
+
## :raising_hand_man: FAQ
93
+
1. Why harmonic mean?
94
+
Harmonic mean penalizes extreme values. For example if your generated answer is fully factually consistent with the context (factuality = 1) but is not relevant to the question (relevancy = 0), simple average would give you a score of 0.5 but harmonic mean will give you 0.0
99
95
100
-
Model based methods uses language models combined with NLP techniques to compare generated text with ground truth. It is well suited for free form long or short answer types.
101
96
102
-
-**BertScore**
103
-
104
-
Bert Score measures the similarity between ground truth text answers and generated text using SBERT vector embeddings. The common choice of similarity measure is cosine similarity for which values range between 0 to 1. It shows good correlation with human judgement.
105
-
106
97
107
-
-**EntailmentScore**
108
-
109
-
Textual entailment to measure factual consistency in generated text given ground truth. Score can range from 0 to 1 with latter indicating perfect factual entailment for all samples. Entailment score is highly correlated with human judgement.
110
-
111
98
112
-
-**$Q^2$**
113
-
114
-
Best used to measure factual consistencies between ground truth and generated text. Scores can range from 0 to 1. Higher score indicates better factual consistency between ground truth and generated answer. Employs QA-QG paradigm followed by NLI to compare ground truth and generated answer. $Q^2$ score is highly correlated with human judgement. :warning: time and resource hungry metrics.
99
+
115
100
116
-
📜 Checkout [citations](./references.md) for related publications.
1.`factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
4
+
5
+
2.`answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
6
+
7
+
3.`context_relevancy`: measures how relevant is the retrieved context to the prompt. This is quantified using a custom trained cross encoder model. Values range (0,1), higher the better.
0 commit comments