Skip to content

Commit f265cf5

Browse files
docs: updated the formula for some metrics (#1834)
1 parent 91393e6 commit f265cf5

13 files changed

+168
-56
lines changed

docs/concepts/metrics/available_metrics/agents.md

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -52,17 +52,24 @@ AIMessage(content="I found a great recipe for chocolate cake! Would you like the
5252

5353

5454
sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
55-
scorer = TopicAdherenceScore(mode="precision")
56-
scorer.llm = openai_model
55+
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="precision")
5756
await scorer.multi_turn_ascore(sample)
5857
```
58+
Output
59+
```
60+
0.6666666666444444
61+
```
5962

6063

6164
To change the mode to recall, set the `mode` parameter to `recall`.
6265

6366
```python
64-
scorer = TopicAdherenceScore(mode="recall")
67+
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
6568
```
69+
Output
70+
```
71+
0.99999999995
72+
```
6673

6774

6875

@@ -96,10 +103,13 @@ sample = MultiTurnSample(
96103
]
97104
)
98105

99-
scorer = ToolCallAccuracy()
100-
scorer.llm = your_llm
106+
scorer = ToolCallAccuracy(llm = evaluator_llm)
101107
await scorer.multi_turn_ascore(sample)
102108
```
109+
Output
110+
```
111+
1.0
112+
```
103113

104114
The tool call sequence specified in `reference_tool_calls` is used as the ideal outcome. If the tool calls made by the AI does not match the order or sequence of the `reference_tool_calls`, the metric will return a score of 0. This helps to ensure that the AI is able to identify and call the required tools in the correct order to complete a given task.
105115

@@ -109,7 +119,7 @@ By default the tool names and arguments are compared using exact string matching
109119
from ragas.metrics._string import NonLLMStringSimilarity
110120
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
111121

112-
metric = ToolCallAccuracy()
122+
metric = ToolCallAccuracy(llm = evaluator_llm)
113123
metric.arg_comparison_metric = NonLLMStringSimilarity()
114124
```
115125

@@ -146,10 +156,13 @@ sample = MultiTurnSample(user_input=[
146156
],
147157
reference="Table booked at one of the chinese restaurants at 8 pm")
148158

149-
scorer = AgentGoalAccuracyWithReference()
150-
scorer.llm = your_llm
159+
scorer = AgentGoalAccuracyWithReference(llm = evaluator_llm)
151160
await scorer.multi_turn_ascore(sample)
152161

162+
```
163+
Output
164+
```
165+
1.0
153166
```
154167

155168
### Without reference
@@ -181,7 +194,11 @@ sample = MultiTurnSample(user_input=[
181194
HumanMessage(content="thanks"),
182195
])
183196

184-
scorer = AgentGoalAccuracyWithoutReference()
185-
await metric.multi_turn_ascore(sample)
197+
scorer = AgentGoalAccuracyWithoutReference(llm = evaluator_llm)
198+
await scorer.multi_turn_ascore(sample)
186199

187200
```
201+
Output
202+
```
203+
1.0
204+
```

docs/concepts/metrics/available_metrics/answer_relevance.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,29 @@
11
## Response Relevancy
22

3-
`ResponseRelevancy` metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`.
3+
The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
44

5-
The Answer Relevancy is defined as the mean cosine similarity of the original `user_input` to a number of artificial questions, which where generated (reverse engineered) based on the `response`:
5+
This metric is calculated using the `user_input` and the `response` as follows:
66

7-
$$
8-
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
9-
$$
7+
1. Generate a set of artificial questions (default is 3) based on the response. These questions are designed to reflect the content of the response.
8+
2. Compute the cosine similarity between the embedding of the user input ($E_o$) and the embedding of each generated question ($E_{g_i}$).
9+
3. Take the average of these cosine similarity scores to get the **Answer Relevancy**:
1010

1111
$$
12-
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
13-
$$
14-
15-
Where:
12+
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine similarity}(E_{g_i}, E_o)
13+
$$
1614

17-
* $E_{g_i}$ is the embedding of the generated question $i$.
18-
* $E_o$ is the embedding of the original question.
19-
* $N$ is the number of generated questions, which is 3 default.
15+
$$
16+
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}
17+
$$
2018

21-
Please note, that even though in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
19+
Where:
20+
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
21+
- $E_o$: Embedding of the user input.
22+
- $N$: Number of generated questions (default is 3).
2223

23-
An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
24+
**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
2425

26+
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
2527

2628
### Example
2729

@@ -37,9 +39,13 @@ sample = SingleTurnSample(
3739
]
3840
)
3941

40-
scorer = ResponseRelevancy()
42+
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embedding)
4143
await scorer.single_turn_ascore(sample)
4244
```
45+
Output
46+
```
47+
0.9165088378587264
48+
```
4349

4450
### How It’s Calculated
4551

docs/concepts/metrics/available_metrics/aspect_critic.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ scorer.llm = openai_model
3838
await scorer.single_turn_ascore(sample)
3939
```
4040

41+
4142
## Calculation
4243

4344
Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:

docs/concepts/metrics/available_metrics/context_entities_recall.md

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,15 @@
22

33
`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
44

5-
To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `reference` and set of entities present in `retrieved_contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:
5+
To compute this metric, we use two sets:
6+
7+
- **$RE$**: The set of entities in the reference.
8+
- **$RCE$**: The set of entities in the retrieved contexts.
9+
10+
We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:
611

712
$$
8-
\text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
13+
\text{Context Entity Recall} = \frac{\text{Number of common entities between $RCE$ and $RE$}}{\text{Total number of entities in $RE$}}
914
$$
1015

1116

@@ -20,10 +25,14 @@ sample = SingleTurnSample(
2025
retrieved_contexts=["The Eiffel Tower is located in Paris."],
2126
)
2227

23-
scorer = ContextEntityRecall()
28+
scorer = ContextEntityRecall(llm=evaluator_llm)
2429

2530
await scorer.single_turn_ascore(sample)
2631
```
32+
Output
33+
```
34+
0.999999995
35+
```
2736

2837
### How It’s Calculated
2938

@@ -34,25 +43,25 @@ await scorer.single_turn_ascore(sample)
3443
**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
3544
**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
3645

37-
Let us consider the ground truth and the contexts given above.
46+
Let us consider the refrence and the retrieved contexts given above.
3847

39-
- **Step-1**: Find entities present in the ground truths.
40-
- Entities in ground truth (GE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
41-
- **Step-2**: Find entities present in the context.
42-
- Entities in context (CE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
43-
- Entities in context (CE2) - ['Taj Mahal', 'UNESCO', 'India']
48+
- **Step-1**: Find entities present in the refrence.
49+
- Entities in ground truth (RE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
50+
- **Step-2**: Find entities present in the retrieved contexts.
51+
- Entities in context (RCE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
52+
- Entities in context (RCE2) - ['Taj Mahal', 'UNESCO', 'India']
4453
- **Step-3**: Use the formula given above to calculate entity-recall
4554

4655
$$
47-
\text{context entity recall 1} = \frac{| CE1 \cap GE |}{| GE |}
56+
\text{context entity recall 1} = \frac{| RCE1 \cap RE |}{| RE |}
4857
= 4/6
4958
= 0.666
5059
$$
5160

5261
$$
53-
\text{context entity recall 2} = \frac{| CE2 \cap GE |}{| GE |}
62+
\text{context entity recall 2} = \frac{| RCE2 \cap RE |}{| RE |}
5463
= 1/6
5564
$$
5665

57-
We can see that the first context had a high entity recall, because it has a better entity coverage given the ground truth. If these two contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
66+
We can see that the first context had a high entity recall, because it has a better entity coverage given the refrence. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
5867

docs/concepts/metrics/available_metrics/context_precision.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The following metrics uses LLM to identify if a retrieved context is relevant or
2525
from ragas import SingleTurnSample
2626
from ragas.metrics import LLMContextPrecisionWithoutReference
2727

28-
context_precision = LLMContextPrecisionWithoutReference()
28+
context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)
2929

3030
sample = SingleTurnSample(
3131
user_input="Where is the Eiffel Tower located?",
@@ -36,6 +36,10 @@ sample = SingleTurnSample(
3636

3737
await context_precision.single_turn_ascore(sample)
3838
```
39+
Output
40+
```
41+
0.9999999999
42+
```
3943

4044
### Context Precision with reference
4145

@@ -47,7 +51,7 @@ await context_precision.single_turn_ascore(sample)
4751
from ragas import SingleTurnSample
4852
from ragas.metrics import LLMContextPrecisionWithReference
4953

50-
context_precision = LLMContextPrecisionWithReference()
54+
context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
5155

5256
sample = SingleTurnSample(
5357
user_input="Where is the Eiffel Tower located?",
@@ -57,6 +61,10 @@ sample = SingleTurnSample(
5761

5862
await context_precision.single_turn_ascore(sample)
5963
```
64+
Output
65+
```
66+
0.9999999999
67+
```
6068

6169
## Non LLM Based Context Precision
6270

@@ -80,4 +88,8 @@ sample = SingleTurnSample(
8088
)
8189

8290
await context_precision.single_turn_ascore(sample)
91+
```
92+
Output
93+
```
94+
0.9999999999
8395
```

docs/concepts/metrics/available_metrics/context_recall.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ In short, recall is about not missing anything important. Since it is about not
1313
The formula for calculating context recall is as follows:
1414

1515
$$
16-
\text{context recall} = {|\text{GT claims that can be attributed to context}| \over |\text{Number of claims in GT}|}
16+
\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
1717
$$
1818

1919
### Example
@@ -29,9 +29,13 @@ sample = SingleTurnSample(
2929
retrieved_contexts=["Paris is the capital of France."],
3030
)
3131

32-
context_recall = LLMContextRecall()
32+
context_recall = LLMContextRecall(llm=evaluator_llm)
3333
await context_recall.single_turn_ascore(sample)
3434

35+
```
36+
Output
37+
```
38+
1.0
3539
```
3640

3741
## Non LLM Based Context Recall
@@ -61,4 +65,8 @@ context_recall = NonLLMContextRecall()
6165
await context_recall.single_turn_ascore(sample)
6266

6367

68+
```
69+
Output
70+
```
71+
0.5
6472
```

docs/concepts/metrics/available_metrics/factual_correctness.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,22 @@ sample = SingleTurnSample(
4242
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
4343
)
4444

45-
scorer = FactualCorrectness()
46-
scorer.llm = openai_model
45+
scorer = FactualCorrectness(llm = evaluator_llm)
4746
await scorer.single_turn_ascore(sample)
4847
```
48+
Output
49+
```
50+
0.67
51+
```
4952

5053
By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
5154

5255
```python
53-
scorer = FactualCorrectness(mode="precision")
56+
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
57+
```
58+
Output
59+
```
60+
1.0
5461
```
5562

5663
### Controlling the Number of Claims
@@ -63,6 +70,10 @@ Each sentence in the response and reference can be broken down into one or more
6370
```python
6471
scorer = FactualCorrectness(mode="precision",atomicity="low")
6572
```
73+
Output
74+
```
75+
1.0
76+
```
6677

6778

6879
#### Understanding Atomicity and Coverage

docs/concepts/metrics/available_metrics/faithfulness.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
## Faithfulness
22

3-
`Faithfulness` metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
3+
The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.
44

5-
The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:
5+
A response is considered **faithful** if all its claims can be supported by the retrieved context.
6+
7+
To calculate this:
8+
1. Identify all the claims in the response.
9+
2. Check each claim to see if it can be inferred from the retrieved context.
10+
3. Compute the faithfulness score using the formula:
611

712
$$
8-
\text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}
13+
\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
914
$$
1015

1116

@@ -22,9 +27,13 @@ sample = SingleTurnSample(
2227
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
2328
]
2429
)
25-
scorer = Faithfulness()
30+
scorer = Faithfulness(llm=evaluator_llm)
2631
await scorer.single_turn_ascore(sample)
2732
```
33+
Output
34+
```
35+
1.0
36+
```
2837

2938

3039
## Faithfullness with HHEM-2.1-Open
@@ -43,7 +52,7 @@ sample = SingleTurnSample(
4352
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
4453
]
4554
)
46-
scorer = FaithfulnesswithHHEM()
55+
scorer = FaithfulnesswithHHEM(llm=evaluator_llm)
4756
await scorer.single_turn_ascore(sample)
4857

4958
```

0 commit comments

Comments
 (0)