Skip to content

Commit 5b64f89

Browse files
authored
docs: Add documentation for metrics.collections API (#2407)
## Summary This PR updates the documentation for metrics to showcase the new `ragas.metrics.collections` API as the primary recommended approach, while preserving legacy API documentation for backward compatibility. ## Changes ### Metrics - [x] AnswerAccuracy - [x] AnswerCorrectness - [x] AnswerRelevancy - [x] AnswerSimilarity - [x] BleuScore - [x] ContextEntityRecall - [x] ContextPrecision - [x] ContextUtilization - [x] Faithfulness - [x] ContextRelevance - [x] NoiseSensitivity - [x] RougeScore - [x] SemanticSimilarity - [x] String metrics (ExactMatch, StringPresence, NonLLMStringSimilarity, DistanceMeasure) - [x] SummaryScore ## Documentation Pattern Each metric documentation follows this structure: 1. **Primary Example**: Collections-based API (modern, recommended) 2. **Concepts/How It's Calculated**: Conceptual explanation (implementation-agnostic) 3. **Legacy Section**: Original API for backward compatibility ## Test Plan - [x] Build docs locally to verify formatting - [x] Test code examples to ensure they work - [x] Verify all metrics from collections are documented
1 parent fd30d94 commit 5b64f89

File tree

11 files changed

+964
-193
lines changed

11 files changed

+964
-193
lines changed

docs/concepts/metrics/available_metrics/answer_correctness.md

Lines changed: 58 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,44 @@ Answer correctness encompasses two critical aspects: semantic similarity between
1616
### Example
1717

1818
```python
19-
from datasets import Dataset
20-
from ragas.metrics import answer_correctness
21-
from ragas import evaluate
19+
from openai import AsyncOpenAI
20+
from ragas.llms import llm_factory
21+
from ragas.embeddings.base import embedding_factory
22+
from ragas.metrics.collections import AnswerCorrectness
23+
24+
# Setup LLM and embeddings
25+
client = AsyncOpenAI()
26+
llm = llm_factory("gpt-4o-mini", client=client)
27+
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client)
28+
29+
# Create metric
30+
scorer = AnswerCorrectness(llm=llm, embeddings=embeddings)
31+
32+
# Evaluate
33+
result = await scorer.ascore(
34+
user_input="When was the first super bowl?",
35+
response="The first superbowl was held on Jan 15, 1967",
36+
reference="The first superbowl was held on January 15, 1967"
37+
)
38+
print(f"Answer Correctness Score: {result.value}")
39+
```
2240

23-
data_samples = {
24-
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
25-
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
26-
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
27-
}
28-
dataset = Dataset.from_dict(data_samples)
29-
score = evaluate(dataset,metrics=[answer_correctness])
30-
score.to_pandas()
41+
Output:
3142

3243
```
44+
Answer Correctness Score: 0.95
45+
```
46+
47+
!!! note "Synchronous Usage"
48+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
49+
50+
```python
51+
result = scorer.score(
52+
user_input="When was the first super bowl?",
53+
response="The first superbowl was held on Jan 15, 1967",
54+
reference="The first superbowl was held on January 15, 1967"
55+
)
56+
```
3357

3458
### Calculation
3559

@@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the
5781

5882
Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.
5983

84+
## Legacy Metrics API
85+
86+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
87+
88+
!!! warning "Deprecation Timeline"
89+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
90+
91+
### Example with Dataset
92+
93+
```python
94+
from datasets import Dataset
95+
from ragas.metrics import answer_correctness
96+
from ragas import evaluate
97+
98+
data_samples = {
99+
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
100+
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
101+
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
102+
}
103+
dataset = Dataset.from_dict(data_samples)
104+
score = evaluate(dataset,metrics=[answer_correctness])
105+
score.to_pandas()
106+
```

docs/concepts/metrics/available_metrics/answer_relevance.md

Lines changed: 69 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
## Response Relevancy
1+
## Answer Relevancy
22

3-
The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
3+
The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.
4+
5+
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
46

57
This metric is calculated using the `user_input` and the `response` as follows:
68

@@ -19,34 +21,50 @@ $$
1921
Where:
2022
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
2123
- $E_o$: Embedding of the user input.
22-
- $N$: Number of generated questions (default is 3).
24+
- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).
2325

2426
**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
2527

26-
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
27-
2828
### Example
2929

3030
```python
31-
from ragas import SingleTurnSample
32-
from ragas.metrics import ResponseRelevancy
31+
from openai import AsyncOpenAI
32+
from ragas.llms import llm_factory
33+
from ragas.embeddings.base import embedding_factory
34+
from ragas.metrics.collections import AnswerRelevancy
35+
36+
# Setup LLM and embeddings
37+
client = AsyncOpenAI()
38+
llm = llm_factory("gpt-4o-mini", client=client)
39+
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client, interface="modern")
40+
41+
# Create metric
42+
scorer = AnswerRelevancy(llm=llm, embeddings=embeddings)
43+
44+
# Evaluate
45+
result = await scorer.ascore(
46+
user_input="When was the first super bowl?",
47+
response="The first superbowl was held on Jan 15, 1967"
48+
)
49+
print(f"Answer Relevancy Score: {result.value}")
50+
```
3351

34-
sample = SingleTurnSample(
35-
user_input="When was the first super bowl?",
36-
response="The first superbowl was held on Jan 15, 1967",
37-
retrieved_contexts=[
38-
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
39-
]
40-
)
52+
Output:
4153

42-
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
43-
await scorer.single_turn_ascore(sample)
44-
```
45-
Output
4654
```
47-
0.9165088378587264
55+
Answer Relevancy Score: 0.9165088378587264
4856
```
4957

58+
!!! note "Synchronous Usage"
59+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
60+
61+
```python
62+
result = scorer.score(
63+
user_input="When was the first super bowl?",
64+
response="The first superbowl was held on Jan 15, 1967"
65+
)
66+
```
67+
5068
### How It’s Calculated
5169

5270
!!! example
@@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
6785
- **Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.
6886

6987
The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.
88+
89+
90+
## Legacy Metrics API
91+
92+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
93+
94+
!!! warning "Deprecation Timeline"
95+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
96+
97+
### Example with SingleTurnSample
98+
99+
```python
100+
from ragas import SingleTurnSample
101+
from ragas.metrics import ResponseRelevancy
102+
103+
sample = SingleTurnSample(
104+
user_input="When was the first super bowl?",
105+
response="The first superbowl was held on Jan 15, 1967",
106+
retrieved_contexts=[
107+
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
108+
]
109+
)
110+
111+
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
112+
await scorer.single_turn_ascore(sample)
113+
```
114+
115+
Output:
116+
117+
```
118+
0.9165088378587264
119+
```

docs/concepts/metrics/available_metrics/context_entities_recall.md

Lines changed: 53 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,40 @@ $$
1717
### Example
1818

1919
```python
20-
from ragas import SingleTurnSample
21-
from ragas.metrics import ContextEntityRecall
20+
from openai import AsyncOpenAI
21+
from ragas.llms import llm_factory
22+
from ragas.metrics.collections import ContextEntityRecall
2223

23-
sample = SingleTurnSample(
24-
reference="The Eiffel Tower is located in Paris.",
25-
retrieved_contexts=["The Eiffel Tower is located in Paris."],
26-
)
24+
# Setup LLM
25+
client = AsyncOpenAI()
26+
llm = llm_factory("gpt-4o-mini", client=client)
2727

28-
scorer = ContextEntityRecall(llm=evaluator_llm)
28+
# Create metric
29+
scorer = ContextEntityRecall(llm=llm)
2930

30-
await scorer.single_turn_ascore(sample)
31+
# Evaluate
32+
result = await scorer.ascore(
33+
reference="The Eiffel Tower is located in Paris.",
34+
retrieved_contexts=["The Eiffel Tower is located in Paris."]
35+
)
36+
print(f"Context Entity Recall Score: {result.value}")
3137
```
32-
Output
38+
39+
Output:
3340
```
34-
0.999999995
41+
Context Entity Recall Score: 0.999999995
3542
```
3643

44+
!!! note "Synchronous Usage"
45+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
46+
47+
```python
48+
result = scorer.score(
49+
reference="The Eiffel Tower is located in Paris.",
50+
retrieved_contexts=["The Eiffel Tower is located in Paris."]
51+
)
52+
```
53+
3754
### How It’s Calculated
3855

3956

@@ -65,3 +82,29 @@ Let us consider the reference and the retrieved contexts given above.
6582

6683
We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
6784

85+
## Legacy Metrics API
86+
87+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
88+
89+
!!! warning "Deprecation Timeline"
90+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
91+
92+
### Example with SingleTurnSample
93+
94+
```python
95+
from ragas import SingleTurnSample
96+
from ragas.metrics import ContextEntityRecall
97+
98+
sample = SingleTurnSample(
99+
reference="The Eiffel Tower is located in Paris.",
100+
retrieved_contexts=["The Eiffel Tower is located in Paris."],
101+
)
102+
103+
scorer = ContextEntityRecall(llm=evaluator_llm)
104+
105+
await scorer.single_turn_ascore(sample)
106+
```
107+
Output:
108+
```
109+
0.999999995
110+
```

0 commit comments

Comments
 (0)