You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
49
+
50
+
```python
51
+
result = scorer.score(
52
+
user_input="When was the first super bowl?",
53
+
response="The first superbowl was held on Jan 15, 1967",
54
+
reference="The first superbowl was held on January 15, 1967"
55
+
)
56
+
```
33
57
34
58
### Calculation
35
59
@@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the
57
81
58
82
Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.
59
83
84
+
## Legacy Metrics API
85
+
86
+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
87
+
88
+
!!! warning "Deprecation Timeline"
89
+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
90
+
91
+
### Example with Dataset
92
+
93
+
```python
94
+
from datasets import Dataset
95
+
from ragas.metrics import answer_correctness
96
+
from ragas import evaluate
97
+
98
+
data_samples = {
99
+
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
100
+
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
101
+
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
Copy file name to clipboardExpand all lines: docs/concepts/metrics/available_metrics/answer_relevance.md
+69-19Lines changed: 69 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
-
## Response Relevancy
1
+
## Answer Relevancy
2
2
3
-
The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
3
+
The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.
4
+
5
+
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
4
6
5
7
This metric is calculated using the `user_input` and the `response` as follows:
6
8
@@ -19,34 +21,50 @@ $$
19
21
Where:
20
22
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
21
23
- $E_o$: Embedding of the user input.
22
-
- $N$: Number of generated questions (default is 3).
24
+
- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).
23
25
24
26
**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
25
27
26
-
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
27
-
28
28
### Example
29
29
30
30
```python
31
-
from ragas import SingleTurnSample
32
-
from ragas.metrics import ResponseRelevancy
31
+
from openai import AsyncOpenAI
32
+
from ragas.llms import llm_factory
33
+
from ragas.embeddings.base import embedding_factory
34
+
from ragas.metrics.collections import AnswerRelevancy
response="The first superbowl was held on Jan 15, 1967"
48
+
)
49
+
print(f"Answer Relevancy Score: {result.value}")
50
+
```
33
51
34
-
sample = SingleTurnSample(
35
-
user_input="When was the first super bowl?",
36
-
response="The first superbowl was held on Jan 15, 1967",
37
-
retrieved_contexts=[
38
-
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
60
+
61
+
```python
62
+
result = scorer.score(
63
+
user_input="When was the first super bowl?",
64
+
response="The first superbowl was held on Jan 15, 1967"
65
+
)
66
+
```
67
+
50
68
### How It’s Calculated
51
69
52
70
!!! example
@@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
67
85
-**Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.
68
86
69
87
The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.
88
+
89
+
90
+
## Legacy Metrics API
91
+
92
+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
93
+
94
+
!!! warning "Deprecation Timeline"
95
+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
96
+
97
+
### Example with SingleTurnSample
98
+
99
+
```python
100
+
from ragas import SingleTurnSample
101
+
from ragas.metrics import ResponseRelevancy
102
+
103
+
sample = SingleTurnSample(
104
+
user_input="When was the first super bowl?",
105
+
response="The first superbowl was held on Jan 15, 1967",
106
+
retrieved_contexts=[
107
+
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
46
+
47
+
```python
48
+
result = scorer.score(
49
+
reference="The Eiffel Tower is located in Paris.",
50
+
retrieved_contexts=["The Eiffel Tower is located in Paris."]
51
+
)
52
+
```
53
+
37
54
### How It’s Calculated
38
55
39
56
@@ -65,3 +82,29 @@ Let us consider the reference and the retrieved contexts given above.
65
82
66
83
We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
67
84
85
+
## Legacy Metrics API
86
+
87
+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
88
+
89
+
!!! warning "Deprecation Timeline"
90
+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
91
+
92
+
### Example with SingleTurnSample
93
+
94
+
```python
95
+
from ragas import SingleTurnSample
96
+
from ragas.metrics import ContextEntityRecall
97
+
98
+
sample = SingleTurnSample(
99
+
reference="The Eiffel Tower is located in Paris.",
100
+
retrieved_contexts=["The Eiffel Tower is located in Paris."],
0 commit comments