|
1 | 1 | # Metrics |
2 | 2 |
|
3 | 3 | 1. `factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better. |
| 4 | +```python |
| 5 | +from ragas.metrics import factuality |
| 6 | +# Dataset({ |
| 7 | +# features: ['question','contexts','answer'], |
| 8 | +# num_rows: 25 |
| 9 | +# }) |
| 10 | +dataset: Dataset |
4 | 11 |
|
| 12 | +results = evaluate(dataset, metrics=[factuality]) |
| 13 | +``` |
5 | 14 | 2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better. |
| 15 | +```python |
| 16 | +from ragas.metrics import answer_relevancy |
| 17 | +# Dataset({ |
| 18 | +# features: ['question','answer'], |
| 19 | +# num_rows: 25 |
| 20 | +# }) |
| 21 | +dataset: Dataset |
6 | 22 |
|
7 | | -3. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is quantified using a custom trained cross encoder model. Values range (0,1), higher the better. |
| 23 | +results = evaluate(dataset, metrics=[answer_relevancy]) |
| 24 | +``` |
| 25 | + |
| 26 | +3. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is quantified using a custom trained cross encoder model. Values range (0,1), higher the better. |
| 27 | +```python |
| 28 | +from ragas.metrics import context_relevancy |
| 29 | +# Dataset({ |
| 30 | +# features: ['question','contexts'], |
| 31 | +# num_rows: 25 |
| 32 | +# }) |
| 33 | +dataset: Dataset |
| 34 | + |
| 35 | +results = evaluate(dataset, metrics=[context_relevancy]) |
| 36 | +``` |
| 37 | +## Why is ragas better than scoring using GPT 3.5 directly. |
| 38 | +LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For instance, these models would always only generate integer scores and these scores vary when invoked differently. Advanced paradigms and techniques leveraging LLMs to minimize this bias is the solution ragas presents. |
| 39 | +<h1 align="center"> |
| 40 | + <img style="vertical-align:middle" height="350" |
| 41 | + src="./assets/bar-graph.svg"> |
| 42 | +</h1> |
| 43 | + |
| 44 | +Take a look at our experiments [here](/experiments/assesments/metrics_assesments.ipynb) |
0 commit comments