From b9d1f47dc2ea0d8c5b8a6d7b311e18c5aa364f56 Mon Sep 17 00:00:00 2001 From: sanjeed5 Date: Wed, 12 Nov 2025 16:33:25 +0530 Subject: [PATCH 1/3] docs: update ResponseGroundedness metric documentation to Collections API - Added new primary example using collections-based API with ResponseGroundedness - Added synchronous usage note with .score() method - Moved legacy SingleTurnSample example to Legacy Metrics API section - Tested new example and verified it produces expected output (score: 1.0) --- .../available_metrics/nvidia_metrics.md | 69 ++++++++++++++++--- 1 file changed, 60 insertions(+), 9 deletions(-) diff --git a/docs/concepts/metrics/available_metrics/nvidia_metrics.md b/docs/concepts/metrics/available_metrics/nvidia_metrics.md index 7cdb7f202..56c1ba2b1 100644 --- a/docs/concepts/metrics/available_metrics/nvidia_metrics.md +++ b/docs/concepts/metrics/available_metrics/nvidia_metrics.md @@ -248,28 +248,47 @@ Output: - **1** → The response is partially grounded. - **2** → The response is fully grounded (every statement can be found or inferred from the retrieved context). +### Example ```python -from ragas.dataset_schema import SingleTurnSample -from ragas.metrics import ResponseGroundedness +from openai import AsyncOpenAI +from ragas.llms import llm_factory +from ragas.metrics.collections import ResponseGroundedness -sample = SingleTurnSample( +# Setup LLM +client = AsyncOpenAI() +llm = llm_factory("gpt-4o-mini", client=client) + +# Create metric +scorer = ResponseGroundedness(llm=llm) + +# Evaluate +result = await scorer.ascore( response="Albert Einstein was born in 1879.", retrieved_contexts=[ "Albert Einstein was born March 14, 1879.", "Albert Einstein was born at Ulm, in Württemberg, Germany.", ] ) - -scorer = ResponseGroundedness(llm=evaluator_llm) -score = await scorer.single_turn_ascore(sample) -print(score) +print(f"Response Groundedness Score: {result.value}") ``` -Output + +Output: + ``` -1.0 +Response Groundedness Score: 1.0 ``` +!!! note "Synchronous Usage" + If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`: + + ```python + result = scorer.score( + response="Albert Einstein was born in 1879.", + retrieved_contexts=[...] + ) + ``` + ### How It’s Calculated **Step 1:** The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of **0**, **1**, or **2**. @@ -299,3 +318,35 @@ In this example, the retrieved contexts provide both the birthdate and location - **Token Usage:** Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient. - **Explainability:** Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score. - **Robust Evaluation:** Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations. + +### Legacy Metrics API + +The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. + +!!! warning "Deprecation Timeline" + This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above. + +#### Example with SingleTurnSample + +```python +from ragas.dataset_schema import SingleTurnSample +from ragas.metrics import ResponseGroundedness + +sample = SingleTurnSample( + response="Albert Einstein was born in 1879.", + retrieved_contexts=[ + "Albert Einstein was born March 14, 1879.", + "Albert Einstein was born at Ulm, in Württemberg, Germany.", + ] +) + +scorer = ResponseGroundedness(llm=evaluator_llm) +score = await scorer.single_turn_ascore(sample) +print(score) +``` + +Output: + +``` +1.0 +``` From 78bfab11723df80df962bec780c68d0134019696 Mon Sep 17 00:00:00 2001 From: sanjeed5 Date: Wed, 12 Nov 2025 16:44:41 +0530 Subject: [PATCH 2/3] docs: update FactualCorrectness to collections-based API --- .../available_metrics/factual_correctness.md | 155 ++++++++++++++---- 1 file changed, 125 insertions(+), 30 deletions(-) diff --git a/docs/concepts/metrics/available_metrics/factual_correctness.md b/docs/concepts/metrics/available_metrics/factual_correctness.md index 70edd8d76..d74bdf97e 100644 --- a/docs/concepts/metrics/available_metrics/factual_correctness.md +++ b/docs/concepts/metrics/available_metrics/factual_correctness.md @@ -2,6 +2,76 @@ `FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter. +### Example + +```python +from openai import AsyncOpenAI +from ragas.llms import llm_factory +from ragas.metrics.collections import FactualCorrectness + +# Setup LLM +client = AsyncOpenAI() +llm = llm_factory("gpt-4o-mini", client=client) + +# Create metric +scorer = FactualCorrectness(llm=llm) + +# Evaluate +result = await scorer.ascore( + response="The Eiffel Tower is located in Paris.", + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." +) +print(f"Factual Correctness Score: {result.value}") +``` + +Output: + +``` +Factual Correctness Score: 0.67 +``` + +By default, the mode is set to `f1`. You can change the mode to `precision` or `recall` by setting the `mode` parameter: + +```python +# Precision mode - measures what fraction of response claims are supported by reference +scorer = FactualCorrectness(llm=llm, mode="precision") +result = await scorer.ascore( + response="The Eiffel Tower is located in Paris.", + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." +) +print(f"Precision Score: {result.value}") +``` + +Output: + +``` +Precision Score: 1.0 +``` + +You can also configure the claim decomposition granularity using `atomicity` and `coverage` parameters: + +```python +# High granularity - more detailed claim decomposition +scorer = FactualCorrectness( + llm=llm, + mode="f1", + atomicity="high", # More atomic claims + coverage="high" # Comprehensive coverage +) +``` + +!!! note "Synchronous Usage" + If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`: + + ```python + result = scorer.score( + response="The Eiffel Tower is located in Paris.", + reference="The Eiffel Tower is located in Paris. It has a height of 1000ft." + ) + ``` + +### How It's Calculated + The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows: $$ @@ -30,36 +100,6 @@ $$ \text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})} $$ -### Example - -```python -from ragas.dataset_schema import SingleTurnSample -from ragas.metrics._factual_correctness import FactualCorrectness - - -sample = SingleTurnSample( - response="The Eiffel Tower is located in Paris.", - reference="The Eiffel Tower is located in Paris. I has a height of 1000ft." -) - -scorer = FactualCorrectness(llm = evaluator_llm) -await scorer.single_turn_ascore(sample) -``` -Output -``` -0.67 -``` - -By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter. - -```python -scorer = FactualCorrectness(llm = evaluator_llm, mode="precision") -``` -Output -``` -1.0 -``` - ### Controlling the Number of Claims Each sentence in the response and reference can be broken down into one or more claims. The number of claims that are generated from a single sentence is determined by the level of `atomicity` and `coverage` required for your application. @@ -161,3 +201,58 @@ By adjusting both atomicity and coverage, you can customize the level of detail - Use **Low Atomicity and Low Coverage** when only the key information is necessary, such as for summarization. This flexibility in controlling the number of claims helps ensure that the information is presented at the right level of granularity for your application's requirements. + +## Legacy Metrics API + +The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. + +!!! warning "Deprecation Timeline" + This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above. + +### Example with SingleTurnSample + +```python +from ragas.dataset_schema import SingleTurnSample +from ragas.metrics._factual_correctness import FactualCorrectness + + +sample = SingleTurnSample( + response="The Eiffel Tower is located in Paris.", + reference="The Eiffel Tower is located in Paris. I has a height of 1000ft." +) + +scorer = FactualCorrectness(llm = evaluator_llm) +await scorer.single_turn_ascore(sample) +``` + +Output: + +``` +0.67 +``` + +### Changing the Mode + +By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter. + +```python +scorer = FactualCorrectness(llm = evaluator_llm, mode="precision") +``` + +Output: + +``` +1.0 +``` + +### Controlling Atomicity + +```python +scorer = FactualCorrectness(mode="precision", atomicity="low") +``` + +Output: + +``` +1.0 +``` From 6168755486c3e4e34f3883fb8a03202136433366 Mon Sep 17 00:00:00 2001 From: sanjeed5 Date: Wed, 12 Nov 2025 19:32:21 +0530 Subject: [PATCH 3/3] docs: update ContextRecall metric documentation to collections API - Add primary example using ContextRecall from ragas.metrics.collections - Include synchronous usage note with .score() method - Move LLMContextRecall to legacy section with deprecation warning - Keep NonLLMContextRecall and IDBasedContextRecall as valid alternatives (no collections API equivalents) - Tested example and verified output --- .../available_metrics/context_recall.md | 59 +++++++++++++++---- 1 file changed, 48 insertions(+), 11 deletions(-) diff --git a/docs/concepts/metrics/available_metrics/context_recall.md b/docs/concepts/metrics/available_metrics/context_recall.md index 822d6a4c7..ebfd7eba3 100644 --- a/docs/concepts/metrics/available_metrics/context_recall.md +++ b/docs/concepts/metrics/available_metrics/context_recall.md @@ -1,14 +1,8 @@ # Context Recall -Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. -In short, recall is about not missing anything important. Since it is about not missing anything, calculating context recall always requires a reference to compare against. - - - -## LLM Based Context Recall - -`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context. +Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important. +Since it is about not missing anything, calculating context recall always requires a reference to compare against. The LLM-based Context Recall metric uses `reference` as a proxy to `reference_contexts`, which makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims, and each claim is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context. The formula for calculating context recall is as follows: @@ -16,7 +10,50 @@ $$ \text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}} $$ -### Example +## Example + +```python +from openai import AsyncOpenAI +from ragas.llms import llm_factory +from ragas.metrics.collections import ContextRecall + +# Setup LLM +client = AsyncOpenAI() +llm = llm_factory("gpt-4o-mini", client=client) + +# Create metric +scorer = ContextRecall(llm=llm) + +# Evaluate +result = await scorer.ascore( + user_input="Where is the Eiffel Tower located?", + retrieved_contexts=["Paris is the capital of France."], + reference="The Eiffel Tower is located in Paris." +) +print(f"Context Recall Score: {result.value}") +``` + +Output: + +``` +Context Recall Score: 1.0 +``` + +!!! note "Synchronous Usage" + If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`: + + ```python + result = scorer.score( + user_input="Where is the Eiffel Tower located?", + retrieved_contexts=["Paris is the capital of France."], + reference="The Eiffel Tower is located in Paris." + ) + ``` + +## LLM Based Context Recall (Legacy API) + +!!! warning "Legacy API" + The following example uses the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. This API will be deprecated in version 0.4 and removed in version 1.0. ```python from ragas.dataset_schema import SingleTurnSample @@ -31,9 +68,9 @@ sample = SingleTurnSample( context_recall = LLMContextRecall(llm=evaluator_llm) await context_recall.single_turn_ascore(sample) - ``` -Output + +Output: ``` 1.0 ```