Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
c648f73
docs: update faithfulness documentation to showcase collections API
sanjeed5 Nov 6, 2025
c2f7acb
docs: add deprecation timeline for legacy metrics API
sanjeed5 Nov 7, 2025
13a588e
docs: fix model name to gpt-4o-mini in faithfulness example
sanjeed5 Nov 7, 2025
a700a0b
docs: update AnswerAccuracy documentation to showcase collections API
sanjeed5 Nov 7, 2025
9f5ae4d
docs: update BleuScore documentation to showcase collections API
sanjeed5 Nov 7, 2025
dc7f5a0
docs: update ContextEntityRecall documentation to showcase collection…
sanjeed5 Nov 7, 2025
c2edd85
docs: update AnswerSimilarity to collections-based API
sanjeed5 Nov 7, 2025
b67fef9
docs: update AnswerRelevancy documentation to showcase collections API
sanjeed5 Nov 7, 2025
f7ba1dc
formating fixes
sanjeed5 Nov 7, 2025
077361d
docs: update ContextPrecision to collections API
sanjeed5 Nov 7, 2025
850602f
docs: update Context Relevance to collections-based API
sanjeed5 Nov 7, 2025
a27f5cc
docs: update RougeScore to collections-based API
sanjeed5 Nov 7, 2025
48265b4
docs: update semantic_similarity.md to use collections API
sanjeed5 Nov 7, 2025
e967181
docs: update SummaryScore metric documentation to collections-based API
sanjeed5 Nov 7, 2025
5179ed0
docs: update NoiseSensitivity metric to collections API
sanjeed5 Nov 7, 2025
5046604
docs: update AnswerCorrectness to collections-based API
sanjeed5 Nov 7, 2025
20e3e2e
docs: update string metrics (ExactMatch, StringPresence, NonLLMString…
sanjeed5 Nov 7, 2025
b83ab5e
docs: fix import ordering in Opik integration guide
sanjeed5 Nov 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 58 additions & 11 deletions docs/concepts/metrics/available_metrics/answer_correctness.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,44 @@ Answer correctness encompasses two critical aspects: semantic similarity between
### Example

```python
from datasets import Dataset
from ragas.metrics import answer_correctness
from ragas import evaluate
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics.collections import AnswerCorrectness

# Setup LLM and embeddings
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client)

# Create metric
scorer = AnswerCorrectness(llm=llm, embeddings=embeddings)

# Evaluate
result = await scorer.ascore(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967",
reference="The first superbowl was held on January 15, 1967"
)
print(f"Answer Correctness Score: {result.value}")
```

data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness])
score.to_pandas()
Output:

```
Answer Correctness Score: 0.95
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967",
reference="The first superbowl was held on January 15, 1967"
)
```

### Calculation

Expand Down Expand Up @@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the

Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.

## Legacy Metrics API

The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.

!!! warning "Deprecation Timeline"
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.

### Example with Dataset

```python
from datasets import Dataset
from ragas.metrics import answer_correctness
from ragas import evaluate

data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness])
score.to_pandas()
```
88 changes: 69 additions & 19 deletions docs/concepts/metrics/available_metrics/answer_relevance.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
## Response Relevancy
## Answer Relevancy

The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.

An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.

This metric is calculated using the `user_input` and the `response` as follows:

Expand All @@ -19,34 +21,50 @@ $$
Where:
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
- $E_o$: Embedding of the user input.
- $N$: Number of generated questions (default is 3).
- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).

**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.

An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.

### Example

```python
from ragas import SingleTurnSample
from ragas.metrics import ResponseRelevancy
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.embeddings.base import embedding_factory
from ragas.metrics.collections import AnswerRelevancy

# Setup LLM and embeddings
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client, interface="modern")

# Create metric
scorer = AnswerRelevancy(llm=llm, embeddings=embeddings)

# Evaluate
result = await scorer.ascore(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967"
)
print(f"Answer Relevancy Score: {result.value}")
```

sample = SingleTurnSample(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967",
retrieved_contexts=[
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
]
)
Output:

scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
await scorer.single_turn_ascore(sample)
```
Output
```
0.9165088378587264
Answer Relevancy Score: 0.9165088378587264
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967"
)
```

### How It’s Calculated

!!! example
Expand All @@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
- **Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.

The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.


## Legacy Metrics API

The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.

!!! warning "Deprecation Timeline"
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.

### Example with SingleTurnSample

```python
from ragas import SingleTurnSample
from ragas.metrics import ResponseRelevancy

sample = SingleTurnSample(
user_input="When was the first super bowl?",
response="The first superbowl was held on Jan 15, 1967",
retrieved_contexts=[
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
]
)

scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
await scorer.single_turn_ascore(sample)
```

Output:

```
0.9165088378587264
```
63 changes: 53 additions & 10 deletions docs/concepts/metrics/available_metrics/context_entities_recall.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,40 @@ $$
### Example

```python
from ragas import SingleTurnSample
from ragas.metrics import ContextEntityRecall
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import ContextEntityRecall

sample = SingleTurnSample(
reference="The Eiffel Tower is located in Paris.",
retrieved_contexts=["The Eiffel Tower is located in Paris."],
)
# Setup LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

scorer = ContextEntityRecall(llm=evaluator_llm)
# Create metric
scorer = ContextEntityRecall(llm=llm)

await scorer.single_turn_ascore(sample)
# Evaluate
result = await scorer.ascore(
reference="The Eiffel Tower is located in Paris.",
retrieved_contexts=["The Eiffel Tower is located in Paris."]
)
print(f"Context Entity Recall Score: {result.value}")
```
Output

Output:
```
0.999999995
Context Entity Recall Score: 0.999999995
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
reference="The Eiffel Tower is located in Paris.",
retrieved_contexts=["The Eiffel Tower is located in Paris."]
)
```

### How It’s Calculated


Expand Down Expand Up @@ -65,3 +82,29 @@ Let us consider the reference and the retrieved contexts given above.

We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.

## Legacy Metrics API

The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.

!!! warning "Deprecation Timeline"
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.

### Example with SingleTurnSample

```python
from ragas import SingleTurnSample
from ragas.metrics import ContextEntityRecall

sample = SingleTurnSample(
reference="The Eiffel Tower is located in Paris.",
retrieved_contexts=["The Eiffel Tower is located in Paris."],
)

scorer = ContextEntityRecall(llm=evaluator_llm)

await scorer.single_turn_ascore(sample)
```
Output:
```
0.999999995
```
Loading
Loading