Skip to content

Commit 0a32987

Browse files
nvidia docs (#1940)
1 parent 48b82ab commit 0a32987

File tree

5 files changed

+208
-3
lines changed

5 files changed

+208
-3
lines changed

docs/concepts/metrics/available_metrics/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ Each metric are essentially paradigms that are designed to evaluate a particular
1414
- [Multimodal Faithfulness](multi_modal_faithfulness.md)
1515
- [Multimodal Relevance](multi_modal_relevance.md)
1616

17+
## Nvidia Metrics
18+
- [Answer Accuracy](nvidia_metrics.md#answer-accuracy)
19+
- [Context Relevance](nvidia_metrics.md#context-relevance)
20+
- [Response Groundedness](nvidia_metrics.md#response-groundedness)
21+
1722
## Agents or Tool use cases
1823

1924
- [Topic adherence](agents.md#topic_adherence)
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Nvidia Metrics
2+
3+
## Answer Accuracy
4+
5+
**Answer Accuracy** measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-judge" prompts that each return a rating (0, 2, or 4). The metric converts these ratings into a [0,1] scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference.
6+
7+
- **0** → The **response** is inaccurate or does not address the same question as the **reference**.
8+
- **2** → The **response** partially align with the **reference**.
9+
- **4** → The **response** exactly aligns with the **reference**.
10+
11+
12+
```python
13+
from ragas.dataset_schema import SingleTurnSample
14+
from ragas.metrics import AnswerAccuracy
15+
16+
sample = SingleTurnSample(
17+
user_input="When was Einstein born?",
18+
response="Albert Einstein was born in 1879.",
19+
reference="Albert Einstein was born in 1879."
20+
)
21+
22+
scorer = AnswerAccuracy(llm=evaluator_llm) # evaluator_llm wrapped with ragas LLM Wrapper
23+
score = await scorer.single_turn_ascore(sample)
24+
print(score)
25+
```
26+
Output
27+
```
28+
1.0
29+
```
30+
31+
### How It’s Calculated
32+
33+
**Step 1:** The LLM generates ratings using two distinct templates to ensure robustness:
34+
35+
- **Template 1:** The LLM compares the **response** with the **reference** and rates it on a scale of **0, 2, or 4**.
36+
- **Template 2:** The LLM evaluates the same question again, but this time the roles of the **response** and the **reference** are swapped.
37+
38+
This dual-perspective approach guarantees a fair assessment of the answer's accuracy.
39+
40+
**Step 2:** If both ratings are valid, the final score is average of score1 and score2; otherwise, it takes the valid one.
41+
42+
**Example Calculation:**
43+
44+
- **User Input:** "When was Einstein born?"
45+
- **Response:** "Albert Einstein was born in 1879."
46+
- **Reference:** "Albert Einstein was born in 1879."
47+
48+
Assuming both templates return a rating of **4** (indicating an exact match), the conversion is as follows:
49+
50+
- A rating of **4** corresponds to **1** on the [0,1] scale.
51+
- Averaging the two scores: (1 + 1) / 2 = **1**.
52+
53+
Thus, the final **Answer Accuracy** score is **1**.
54+
55+
### Similar Ragas Metrics
56+
57+
1. [Answer Correctness](answer_correctness.md): This metric gauges the accuracy of the generated answer compared to the ground truth by considering both semantic and factual similarity.
58+
59+
2. [Rubric Score](general_purpose.md#rubrics-based-criteria-scoring): The Rubric-Based Criteria Scoring Metric allows evaluations based on user-defined rubrics, where each rubric outlines specific scoring criteria. The LLM assesses responses according to these customized descriptions, ensuring a consistent and objective evaluation process.
60+
61+
### Comparison of Metrics
62+
63+
#### Answer Correctness vs. Answer Accuracy
64+
65+
- **LLM Calls:** Answer Correctness requires three LLM calls (two for decomposing the response and reference into standalone statements and one for classifying them), while Answer Accuracy uses two independent LLM judgments.
66+
- **Token Usage:** Answer Correctness consumes lot more tokens due to its detailed breakdown and classification process.
67+
- **Explainability:** Answer Correctness offers high explainability by providing detailed insights into factual correctness and semantic similarity, whereas Answer Accuracy provides a straightforward raw score.
68+
- **Robust Evaluation:** Answer Accuracy ensures consistency through dual LLM evaluations, while Answer Correctness offers a holistic view by deeply assessing the quality of the response.
69+
70+
#### Answer Accuracy vs. Rubric Score
71+
72+
- **LLM Calls**: Answer Accuracy makes two calls (one per LLM judge), while Rubric Score requires only one.
73+
- **Token Usage**: Answer Accuracy is minimal since it outputs just a score, whereas Rubric Score generates reasoning, increasing token consumption.
74+
- **Explainability**: Answer Accuracy provides a raw score without justification, while Rubric Score offers reasoning with verdict.
75+
- **Efficiency**: Answer Accuracy is lightweight and works very well with smaller models.
76+
77+
## Context Relevance
78+
79+
**Context Relevance** evaluates whether the **retrieved_contexts** (chunks or passages) are pertinent to the **user_input**. This is done via two independent "LLM-as-a-judge" prompt calls that each rate the relevance on a scale of **0, 1, or 2**. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
80+
81+
- **0** → The retrieved contexts are not relevant to the user’s query at all.
82+
- **1** → The contexts are partially relevant.
83+
- **2** → The contexts are completely relevant.
84+
85+
86+
```python
87+
from ragas.dataset_schema import SingleTurnSample
88+
from ragas.metrics import ContextRelevance
89+
90+
sample = SingleTurnSample(
91+
user_input="When and Where Albert Einstein was born?",
92+
retrieved_contexts=[
93+
"Albert Einstein was born March 14, 1879.",
94+
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
95+
]
96+
)
97+
98+
scorer = ContextRelevance(llm=evaluator_llm)
99+
score = await scorer.single_turn_ascore(sample)
100+
print(score)
101+
```
102+
Output
103+
```
104+
1.0
105+
```
106+
107+
### How It’s Calculated
108+
109+
**Step 1:** The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of **0**, **1**, or **2**.
110+
111+
**Step 2:** Each rating is normalized to a [0,1] scale by dividing by 2. If both ratings are valid, the final score is the average of these normalized values; if only one is valid, that score is used.
112+
113+
**Example Calculation:**
114+
115+
- **User Input:** "When and Where Albert Einstein was born?"
116+
- **Retrieved Contexts:**
117+
- "Albert Einstein was born March 14, 1879."
118+
- "Albert Einstein was born at Ulm, in Württemberg, Germany."
119+
120+
In this example, the two retrieved contexts together fully address the user's query by providing both the birth date and location of Albert Einstein. Consequently, both prompts would rate the combined contexts as **2** (fully relevant). Normalizing each score yields **1.0** (2/2), and averaging the two results maintains the final Context Relevance score at **1**.
121+
122+
### Similar Ragas Metrics
123+
124+
1. [Context Precision](context_precision.md): It measures the proportion of retrieved contexts that are relevant to answering a user's query. It is computed as the mean precision@k across all retrieved chunks, indicating how accurately the retrieval system ranks relevant information.
125+
126+
2. [Context Recall](context_recall.md): It quantifies the extent to which the relevant information is successfully retrieved. It is calculated as the ratio of the number of relevant claims (or contexts) found in the retrieved results to the total number of relevant claims in the reference, ensuring that important information is not missed.
127+
128+
3. [Rubric Score](general_purpose.md#rubrics-based-criteria-scoring): The Rubric-Based Criteria Scoring Metric evaluates responses based on user-defined rubrics with customizable scoring criteria, ensuring consistent and objective assessments. The scoring scale is flexible to suit user needs.
129+
130+
#### Context Precision and Context Recall vs. Context Relevance
131+
132+
- **LLM Calls:** Context Precision and Context Recall each require one LLM call each, one verifies context usefulness to get reference (verdict "1" or "0") and one classifies each answer sentence as attributable (binary 'Yes' (1) or 'No' (0)) while Context Relevance uses two LLM calls for increased robustness.
133+
- **Token Usage:** Context Precision and Context Recall consume lot more tokens, whereas Context Relevance is more token-efficient.
134+
- **Explainability:** Context Precision and Context Recall offer high explainability with detailed reasoning, while Context Relevance provides a raw score without explanations.
135+
- **Robust Evaluation:** Context Relevance delivers a more robust evaluation through dual LLM judgments compared to the single-call approach of Context Precision and Context Recall.
136+
137+
## Response Groundedness
138+
139+
**Response Groundedness** measures how well a response is supported or "grounded" by the retrieved contexts. It assesses whether each claim in the response can be found, either wholly or partially, in the provided contexts.
140+
141+
- **0** → The response is **not** grounded in the context at all.
142+
- **1** → The response is partially grounded.
143+
- **2** → The response is fully grounded (every statement can be found or inferred from the retrieved context).
144+
145+
146+
```python
147+
from ragas.dataset_schema import SingleTurnSample
148+
from ragas.metrics import ResponseGroundedness
149+
150+
sample = SingleTurnSample(
151+
response="Albert Einstein was born in 1879.",
152+
retrieved_contexts=[
153+
"Albert Einstein was born March 14, 1879.",
154+
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
155+
]
156+
)
157+
158+
scorer = ResponseGroundedness(llm=evaluator_llm)
159+
score = await scorer.single_turn_ascore(sample)
160+
print(score)
161+
```
162+
Output
163+
```
164+
1.0
165+
```
166+
167+
### How It’s Calculated
168+
169+
**Step 1:** The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of **0**, **1**, or **2**.
170+
171+
**Step 2:** Each rating is normalized to a [0,1] scale by dividing by 2 (i.e., 0 becomes 0.0, 1 becomes 0.5, and 2 becomes 1.0). If both ratings are valid, the final score is computed as the average of these normalized values; if only one is valid, that score is used.
172+
173+
**Example Calculation:**
174+
175+
- **Response:** "Albert Einstein was born in 1879."
176+
- **Retrieved Contexts:**
177+
- "Albert Einstein was born March 14, 1879."
178+
- "Albert Einstein was born at Ulm, in Württemberg, Germany."
179+
180+
In this example, the retrieved contexts provide both the birth date and location of Albert Einstein. Since the response's claim is supported by the context (even though the date is partially provided), both prompts would likely rate the grounding as **2** (fully grounded). Normalizing a score of 2 gives **1.0** (2/2), and averaging the two normalized ratings maintains the final Response Groundedness score at **1**.
181+
182+
### Similar Ragas Metrics
183+
184+
1. [Faithfulness](faithfulness.md): This metric measures how factually consistent a response is with the retrieved context, ensuring that every claim in the response is supported by the provided information. The Faithfulness score ranges from 0 to 1, with higher scores indicating better consistency.
185+
186+
2. [Rubric Score](general_purpose.md#rubrics-based-criteria-scoring): This is a general-purpose metric that evaluates responses based on user-defined criteria and can be adapted to assess Answer Accuracy, Context Relevance or Response Groundedness by aligning the rubric with the requirements.
187+
188+
### Comparison of Metrics
189+
190+
#### Faithfulness vs. Response Groundedness
191+
192+
- **LLM Calls:** Faithfulness requires two calls for detailed claim breakdown and verdict, while Response Groundedness uses two independent LLM judgments.
193+
- **Token Usage:** Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient.
194+
- **Explainability:** Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score.
195+
- **Robust Evaluation:** Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations.

mkdocs.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ nav:
3535
- Noise Sensitivity: concepts/metrics/available_metrics/noise_sensitivity.md
3636
- Response Relevancy: concepts/metrics/available_metrics/answer_relevance.md
3737
- Faithfulness: concepts/metrics/available_metrics/faithfulness.md
38+
- Nvidia Metrics:
39+
- Answer Accuracy: concepts/metrics/available_metrics/nvidia_metrics.md#answer-accuracy
40+
- Context Relevance: concepts/metrics/available_metrics/nvidia_metrics.md#context-relevance
41+
- Response Groundedness: concepts/metrics/available_metrics/nvidia_metrics.md#response-groundedness
3842
- Agents or Tool Use Cases:
3943
- concepts/metrics/available_metrics/agents.md
4044
- Topic Adherence: concepts/metrics/available_metrics/agents/#topic-adherence

src/ragas/metrics/_nv_metrics.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ class AnswerAccuracy(MetricWithLLM, SingleTurnMetric):
7979
"{answer1}: {sentence_true}\n\n"
8080
"Rating: "
8181
)
82-
retry = 5 # Number of retries if rating is not in the first 8 tokens.
82+
retry = 5 # Number of retries if rating is not in the first 8 tokens.
8383

8484
def process_score(self, response):
8585
for i in range(5):
@@ -214,7 +214,7 @@ class ContextRelevance(MetricWithLLM, SingleTurnMetric):
214214
"Do not try to explain.\n"
215215
"Based on the provided Question and Context, the Relevance score is ["
216216
)
217-
retry = 5 # Number of retries if rating is not in the first 8 tokens.
217+
retry = 5 # Number of retries if rating is not in the first 8 tokens.
218218

219219
def process_score(self, response):
220220
for i in [2, 1, 0]:
@@ -348,7 +348,7 @@ class ResponseGroundedness(MetricWithLLM, SingleTurnMetric):
348348
"Do not explain."
349349
"Based on the provided context and response, the Groundedness score is:"
350350
)
351-
retry = 5 # Number of retries if rating is not in the first 8 tokens.
351+
retry = 5 # Number of retries if rating is not in the first 8 tokens.
352352

353353
def process_score(self, response):
354354
for i in [2, 1, 0]:

src/ragas/metrics/base.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -732,6 +732,7 @@ def from_discrete(
732732

733733
return verdict_agg
734734

735+
735736
@t.runtime_checkable
736737
class ModeMetric(t.Protocol):
737738
name: str

0 commit comments

Comments
 (0)