Skip to content

Commit 5cb5339

Browse files
sanjeed5anistark
authored andcommitted
docs: complete collections API documentation for remaining metrics (#2420)
## Issue Link / Problem Description - Follow-up to PR #2407 - Completes the migration to collections-based API documentation for metrics that were not covered in the initial PR ## Changes Made - Updated **ContextRecall** documentation to showcase `ragas.metrics.collections.ContextRecall` as the primary example - Updated **FactualCorrectness** documentation to showcase `ragas.metrics.collections.FactualCorrectness` with configuration options (mode, atomicity, coverage) - Updated **ResponseGroundedness** documentation in nvidia_metrics.md to showcase `ragas.metrics.collections.ResponseGroundedness` as the primary example - Moved all legacy API examples to "Legacy Metrics API" sections with deprecation warnings - Added synchronous usage notes (`.score()` method) for all three metrics - Preserved all conceptual explanations and "How It's Calculated" sections ## Testing ### How to Test - [x] Automated tests added/updated: N/A (documentation only) - [x] Manual testing steps: 1. Verified `make build-docs` succeeds without errors ✓ 2. Tested all new code examples to ensure they work as documented 3. Confirmed output values match expected results 4. Verified consistency with PR #2407 documentation style ## References - Related issues: Follow-up to PR #2407 - Documentation: - Updated: `docs/concepts/metrics/available_metrics/context_recall.md` - Updated: `docs/concepts/metrics/available_metrics/factual_correctness.md` - Updated: `docs/concepts/metrics/available_metrics/nvidia_metrics.md` (ResponseGroundedness section) - Pattern reference: PR #2407 (faithfulness.md, context_precision.md, answer_correctness.md) ## Screenshots/Examples (if applicable) All three metrics now follow the consistent pattern: 1. **Primary Example**: Collections-based API (modern, recommended) 2. **Concepts**: Implementation-agnostic explanation 3. **Synchronous Usage Note**: `.score()` method alternative 4. **Legacy Section**: Original API with deprecation timeline warnings
1 parent 1197ef9 commit 5cb5339

File tree

3 files changed

+233
-50
lines changed

3 files changed

+233
-50
lines changed

docs/concepts/metrics/available_metrics/context_recall.md

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,59 @@
11
# Context Recall
22

3-
Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.
4-
In short, recall is about not missing anything important. Since it is about not missing anything, calculating context recall always requires a reference to compare against.
5-
6-
7-
8-
## LLM Based Context Recall
9-
10-
`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
3+
Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important.
114

5+
Since it is about not missing anything, calculating context recall always requires a reference to compare against. The LLM-based Context Recall metric uses `reference` as a proxy to `reference_contexts`, which makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims, and each claim is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
126

137
The formula for calculating context recall is as follows:
148

159
$$
1610
\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
1711
$$
1812

19-
### Example
13+
## Example
14+
15+
```python
16+
from openai import AsyncOpenAI
17+
from ragas.llms import llm_factory
18+
from ragas.metrics.collections import ContextRecall
19+
20+
# Setup LLM
21+
client = AsyncOpenAI()
22+
llm = llm_factory("gpt-4o-mini", client=client)
23+
24+
# Create metric
25+
scorer = ContextRecall(llm=llm)
26+
27+
# Evaluate
28+
result = await scorer.ascore(
29+
user_input="Where is the Eiffel Tower located?",
30+
retrieved_contexts=["Paris is the capital of France."],
31+
reference="The Eiffel Tower is located in Paris."
32+
)
33+
print(f"Context Recall Score: {result.value}")
34+
```
35+
36+
Output:
37+
38+
```
39+
Context Recall Score: 1.0
40+
```
41+
42+
!!! note "Synchronous Usage"
43+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
44+
45+
```python
46+
result = scorer.score(
47+
user_input="Where is the Eiffel Tower located?",
48+
retrieved_contexts=["Paris is the capital of France."],
49+
reference="The Eiffel Tower is located in Paris."
50+
)
51+
```
52+
53+
## LLM Based Context Recall (Legacy API)
54+
55+
!!! warning "Legacy API"
56+
The following example uses the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. This API will be deprecated in version 0.4 and removed in version 1.0.
2057

2158
```python
2259
from ragas.dataset_schema import SingleTurnSample
@@ -31,9 +68,9 @@ sample = SingleTurnSample(
3168

3269
context_recall = LLMContextRecall(llm=evaluator_llm)
3370
await context_recall.single_turn_ascore(sample)
34-
3571
```
36-
Output
72+
73+
Output:
3774
```
3875
1.0
3976
```

docs/concepts/metrics/available_metrics/factual_correctness.md

Lines changed: 125 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,76 @@
22

33
`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
44

5+
### Example
6+
7+
```python
8+
from openai import AsyncOpenAI
9+
from ragas.llms import llm_factory
10+
from ragas.metrics.collections import FactualCorrectness
11+
12+
# Setup LLM
13+
client = AsyncOpenAI()
14+
llm = llm_factory("gpt-4o-mini", client=client)
15+
16+
# Create metric
17+
scorer = FactualCorrectness(llm=llm)
18+
19+
# Evaluate
20+
result = await scorer.ascore(
21+
response="The Eiffel Tower is located in Paris.",
22+
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
23+
)
24+
print(f"Factual Correctness Score: {result.value}")
25+
```
26+
27+
Output:
28+
29+
```
30+
Factual Correctness Score: 0.67
31+
```
32+
33+
By default, the mode is set to `f1`. You can change the mode to `precision` or `recall` by setting the `mode` parameter:
34+
35+
```python
36+
# Precision mode - measures what fraction of response claims are supported by reference
37+
scorer = FactualCorrectness(llm=llm, mode="precision")
38+
result = await scorer.ascore(
39+
response="The Eiffel Tower is located in Paris.",
40+
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
41+
)
42+
print(f"Precision Score: {result.value}")
43+
```
44+
45+
Output:
46+
47+
```
48+
Precision Score: 1.0
49+
```
50+
51+
You can also configure the claim decomposition granularity using `atomicity` and `coverage` parameters:
52+
53+
```python
54+
# High granularity - more detailed claim decomposition
55+
scorer = FactualCorrectness(
56+
llm=llm,
57+
mode="f1",
58+
atomicity="high", # More atomic claims
59+
coverage="high" # Comprehensive coverage
60+
)
61+
```
62+
63+
!!! note "Synchronous Usage"
64+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
65+
66+
```python
67+
result = scorer.score(
68+
response="The Eiffel Tower is located in Paris.",
69+
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
70+
)
71+
```
72+
73+
### How It's Calculated
74+
575
The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:
676

777
$$
@@ -30,36 +100,6 @@ $$
30100
\text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})}
31101
$$
32102

33-
### Example
34-
35-
```python
36-
from ragas.dataset_schema import SingleTurnSample
37-
from ragas.metrics._factual_correctness import FactualCorrectness
38-
39-
40-
sample = SingleTurnSample(
41-
response="The Eiffel Tower is located in Paris.",
42-
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
43-
)
44-
45-
scorer = FactualCorrectness(llm = evaluator_llm)
46-
await scorer.single_turn_ascore(sample)
47-
```
48-
Output
49-
```
50-
0.67
51-
```
52-
53-
By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
54-
55-
```python
56-
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
57-
```
58-
Output
59-
```
60-
1.0
61-
```
62-
63103
### Controlling the Number of Claims
64104

65105
Each sentence in the response and reference can be broken down into one or more claims. The number of claims that are generated from a single sentence is determined by the level of `atomicity` and `coverage` required for your application.
@@ -161,3 +201,58 @@ By adjusting both atomicity and coverage, you can customize the level of detail
161201
- Use **Low Atomicity and Low Coverage** when only the key information is necessary, such as for summarization.
162202

163203
This flexibility in controlling the number of claims helps ensure that the information is presented at the right level of granularity for your application's requirements.
204+
205+
## Legacy Metrics API
206+
207+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
208+
209+
!!! warning "Deprecation Timeline"
210+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
211+
212+
### Example with SingleTurnSample
213+
214+
```python
215+
from ragas.dataset_schema import SingleTurnSample
216+
from ragas.metrics._factual_correctness import FactualCorrectness
217+
218+
219+
sample = SingleTurnSample(
220+
response="The Eiffel Tower is located in Paris.",
221+
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
222+
)
223+
224+
scorer = FactualCorrectness(llm = evaluator_llm)
225+
await scorer.single_turn_ascore(sample)
226+
```
227+
228+
Output:
229+
230+
```
231+
0.67
232+
```
233+
234+
### Changing the Mode
235+
236+
By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
237+
238+
```python
239+
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
240+
```
241+
242+
Output:
243+
244+
```
245+
1.0
246+
```
247+
248+
### Controlling Atomicity
249+
250+
```python
251+
scorer = FactualCorrectness(mode="precision", atomicity="low")
252+
```
253+
254+
Output:
255+
256+
```
257+
1.0
258+
```

docs/concepts/metrics/available_metrics/nvidia_metrics.md

Lines changed: 60 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -248,28 +248,47 @@ Output:
248248
- **1** → The response is partially grounded.
249249
- **2** → The response is fully grounded (every statement can be found or inferred from the retrieved context).
250250

251+
### Example
251252

252253
```python
253-
from ragas.dataset_schema import SingleTurnSample
254-
from ragas.metrics import ResponseGroundedness
254+
from openai import AsyncOpenAI
255+
from ragas.llms import llm_factory
256+
from ragas.metrics.collections import ResponseGroundedness
255257

256-
sample = SingleTurnSample(
258+
# Setup LLM
259+
client = AsyncOpenAI()
260+
llm = llm_factory("gpt-4o-mini", client=client)
261+
262+
# Create metric
263+
scorer = ResponseGroundedness(llm=llm)
264+
265+
# Evaluate
266+
result = await scorer.ascore(
257267
response="Albert Einstein was born in 1879.",
258268
retrieved_contexts=[
259269
"Albert Einstein was born March 14, 1879.",
260270
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
261271
]
262272
)
263-
264-
scorer = ResponseGroundedness(llm=evaluator_llm)
265-
score = await scorer.single_turn_ascore(sample)
266-
print(score)
273+
print(f"Response Groundedness Score: {result.value}")
267274
```
268-
Output
275+
276+
Output:
277+
269278
```
270-
1.0
279+
Response Groundedness Score: 1.0
271280
```
272281

282+
!!! note "Synchronous Usage"
283+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
284+
285+
```python
286+
result = scorer.score(
287+
response="Albert Einstein was born in 1879.",
288+
retrieved_contexts=[...]
289+
)
290+
```
291+
273292
### How It’s Calculated
274293

275294
**Step 1:** The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of **0**, **1**, or **2**.
@@ -299,3 +318,35 @@ In this example, the retrieved contexts provide both the birthdate and location
299318
- **Token Usage:** Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient.
300319
- **Explainability:** Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score.
301320
- **Robust Evaluation:** Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations.
321+
322+
### Legacy Metrics API
323+
324+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
325+
326+
!!! warning "Deprecation Timeline"
327+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
328+
329+
#### Example with SingleTurnSample
330+
331+
```python
332+
from ragas.dataset_schema import SingleTurnSample
333+
from ragas.metrics import ResponseGroundedness
334+
335+
sample = SingleTurnSample(
336+
response="Albert Einstein was born in 1879.",
337+
retrieved_contexts=[
338+
"Albert Einstein was born March 14, 1879.",
339+
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
340+
]
341+
)
342+
343+
scorer = ResponseGroundedness(llm=evaluator_llm)
344+
score = await scorer.single_turn_ascore(sample)
345+
print(score)
346+
```
347+
348+
Output:
349+
350+
```
351+
1.0
352+
```

0 commit comments

Comments
 (0)