Skip to content

Commit 5d3c459

Browse files
Update README with explanation of Ragas metrics (#124)
Added to the readme which results each metric is calculated from. Personally, this information can be found by reading deeper, but for those new to ragas, it will make it easier to get started! If you think it too much, please just close it! thanks in advance --------- Co-authored-by: Shahul ES <[email protected]>
1 parent ff015d6 commit 5d3c459

File tree

2 files changed

+10
-10
lines changed

2 files changed

+10
-10
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,15 +89,15 @@ If you want a more in-depth explanation of core components, check out our [quick
8989

9090
Ragas measures your pipeline's performance against different dimensions
9191

92-
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context is penalized.
92+
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context is penalized. It is calculated from `answer` and `retrieved context`.
9393

94-
2. **Context Relevancy**: measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
94+
2. **Context Relevancy**: measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized. It is calculated from `question` and `retrieved context`.
9595

96-
3. **Context Recall**: measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context.
96+
3. **Context Recall**: measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context. It is calculated from `ground truth` and `retrieved context`.
9797

98-
4. **Answer Relevancy**: refers to the degree to which a response directly addresses and is appropriate for a given question or context. This does not take the factuality of the answer into consideration but rather penalizes the present of redundant information or incomplete answers given a question.
98+
4. **Answer Relevancy**: refers to the degree to which a response directly addresses and is appropriate for a given question or context. This does not take the factuality of the answer into consideration but rather penalizes the present of redundant information or incomplete answers given a question. It is calculated from `question` and `answer`.
9999

100-
5. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.
100+
5. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary. It is calculated from `answer`.
101101

102102
The final `ragas_score` is the harmonic mean of individual metric scores.
103103

docs/metrics.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
### `Faithfulness`
44

5-
This measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
5+
This measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. It is calculated from `answer` and `retrieved context`. The answer is scaled to (0,1) range. Higher the better.
66
```python
77
from ragas.metrics.factuality import Faithfulness
88
faithfulness = Faithfulness()
@@ -17,7 +17,7 @@ results = faithfulness.score(dataset)
1717
```
1818
### `ContextRelevancy`
1919

20-
This measures how relevant is the retrieved context to the prompt. This is done using a combination of OpenAI models and cross-encoder models. To improve the score one can try to optimize the amount of information present in the retrieved context.
20+
This measures how relevant is the retrieved context to the prompt. This is done using a combination of OpenAI models and cross-encoder models. To improve the score one can try to optimize the amount of information present in the retrieved context. It is calculated from `question` and `retrieved context`.
2121
```python
2222
from ragas.metrics.context_relevancy import ContextRelevancy
2323
context_rel = ContextRelevancy(strictness=3)
@@ -31,7 +31,7 @@ results = context_rel.score(dataset)
3131
```
3232

3333
### `Context Recall`
34-
measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context.
34+
measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context. It is calculated from `ground truth` and `retrieved context`.
3535

3636
```python
3737
from ragas.metrics.context_recall import ContextRecall
@@ -48,7 +48,7 @@ results = context_recall.score(dataset)
4848

4949
### `AnswerRelevancy`
5050

51-
This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. Values range (0,1), higher the better.
51+
This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. It is calculated from `question` and `answer`. Values range (0,1), higher the better.
5252
```python
5353
from ragas.metrics.answer_relevancy import AnswerRelevancy
5454
answer_relevancy = AnswerRelevancy()
@@ -64,7 +64,7 @@ results = answer_relevancy.score(dataset)
6464

6565
### `AspectCritique`
6666

67-
`Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
67+
`Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. It is calculated from `answer`. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
6868
- List of predefined aspects:
6969
`correctness`,`harmfulness`,`coherence`,`conciseness`,`maliciousness`
7070

0 commit comments

Comments
 (0)