Skip to content

Commit 440d641

Browse files
authored
updated docs (#64)
## What Updated metrics and README for improving docs and including changes in metrics.
1 parent 4e2d520 commit 440d641

File tree

2 files changed

+19
-11
lines changed

2 files changed

+19
-11
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,8 @@ If you want a more in-depth explanation of core components, check out our [quick
8282

8383
Ragas measures your pipeline's performance against two dimensions
8484
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized.
85-
2. **Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question. The presence of extra or redundant information is penalized.
85+
2. **Context Relevancy**: measures how relevant retrieved contexts is to the question. Ideally the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
86+
3. **Answer Relevancy**: measures how relevant generated answer is to the question. This do not ensure factuality of the generated answer rather penalizes the presence of redundant information in the generated answer.
8687

8788
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
8889

docs/metrics.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,45 @@
22

33
1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
44
```python
5-
from ragas.metrics import faithfulness
5+
from ragas.metrics.factuality import Faithfulness
6+
faithfulness = Faithfulness()
7+
68
# Dataset({
79
# features: ['question','contexts','answer'],
810
# num_rows: 25
911
# })
1012
dataset: Dataset
1113

12-
results = evaluate(dataset, metrics=[faithfulness])
14+
results = faithfulness.score(dataset)
1315
```
14-
2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
16+
17+
2. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is done using a combination of OpenAI models and cross-encoder models. To improve the score one can try to optimize the amount of information present in the retrieved context.
1518
```python
16-
from ragas.metrics import answer_relevancy
19+
from ragas.metrics.context_relevancy import ContextRelevancy
20+
context_rel = ContextRelevancy(strictness=3)
1721
# Dataset({
18-
# features: ['question','answer'],
22+
# features: ['question','contexts'],
1923
# num_rows: 25
2024
# })
2125
dataset: Dataset
2226

23-
results = evaluate(dataset, metrics=[answer_relevancy])
27+
results = context_rel.score(dataset)
2428
```
2529

26-
3. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is quantified using a custom trained cross encoder model. Values range (0,1), higher the better.
30+
3. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
2731
```python
28-
from ragas.metrics import context_relevancy
32+
from ragas.metrics.answer_relevancy import AnswerRelevancy
33+
answer_relevancy = AnswerRelevancy(model_name="t5-small")
2934
# Dataset({
30-
# features: ['question','contexts'],
35+
# features: ['question','answer'],
3136
# num_rows: 25
3237
# })
3338
dataset: Dataset
3439

35-
results = evaluate(dataset, metrics=[context_relevancy])
40+
results = answer_relevancy.score(dataset)
3641
```
42+
43+
3744
## Why is ragas better than scoring using GPT 3.5 directly.
3845
LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For instance, these models would always only generate integer scores and these scores vary when invoked differently. Advanced paradigms and techniques leveraging LLMs to minimize this bias is the solution ragas presents.
3946
<h1 align="center">

0 commit comments

Comments
 (0)