docs: complete collections API documentation for remaining metrics (#2420)

sanjeed5 · anistark · commit 5cb53393632f · 2025-11-17T11:18:46.000+05:30
## Issue Link / Problem Description - Follow-up to PR #2407 - Completes the migration to collections-based API documentation for metrics that were not covered in the initial PR ## Changes Made - Updated **ContextRecall** documentation to showcase `ragas.metrics.collections.ContextRecall` as the primary example - Updated **FactualCorrectness** documentation to showcase `ragas.metrics.collections.FactualCorrectness` with configuration options (mode, atomicity, coverage) - Updated **ResponseGroundedness** documentation in nvidia_metrics.md to showcase `ragas.metrics.collections.ResponseGroundedness` as the primary example - Moved all legacy API examples to "Legacy Metrics API" sections with deprecation warnings - Added synchronous usage notes (`.score()` method) for all three metrics - Preserved all conceptual explanations and "How It's Calculated" sections ## Testing ### How to Test - [x] Automated tests added/updated: N/A (documentation only) - [x] Manual testing steps: 1. Verified `make build-docs` succeeds without errors ✓ 2. Tested all new code examples to ensure they work as documented 3. Confirmed output values match expected results 4. Verified consistency with PR #2407 documentation style ## References - Related issues: Follow-up to PR #2407 - Documentation: - Updated: `docs/concepts/metrics/available_metrics/context_recall.md` - Updated: `docs/concepts/metrics/available_metrics/factual_correctness.md` - Updated: `docs/concepts/metrics/available_metrics/nvidia_metrics.md` (ResponseGroundedness section) - Pattern reference: PR #2407 (faithfulness.md, context_precision.md, answer_correctness.md) ## Screenshots/Examples (if applicable) All three metrics now follow the consistent pattern: 1. **Primary Example**: Collections-based API (modern, recommended) 2. **Concepts**: Implementation-agnostic explanation 3. **Synchronous Usage Note**: `.score()` method alternative 4. **Legacy Section**: Original API with deprecation timeline warnings
diff --git a/docs/concepts/metrics/available_metrics/context_recall.md b/docs/concepts/metrics/available_metrics/context_recall.md
@@ -1,22 +1,59 @@
 # Context Recall
 
-Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.
-In short, recall is about not missing anything important. Since it is about not missing anything, calculating context recall always requires a reference to compare against.
-
-
-
-## LLM Based Context Recall
-
-`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
+Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important. 
 
+Since it is about not missing anything, calculating context recall always requires a reference to compare against. The LLM-based Context Recall metric uses `reference` as a proxy to `reference_contexts`, which makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims, and each claim is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
 
 The formula for calculating context recall is as follows:
 
 $$
 \text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
 $$
 
-### Example
+## Example
+
+```python
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.metrics.collections import ContextRecall
+
+# Setup LLM
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+
+# Create metric
+scorer = ContextRecall(llm=llm)
+
+# Evaluate
+result = await scorer.ascore(
+    user_input="Where is the Eiffel Tower located?",
+    retrieved_contexts=["Paris is the capital of France."],
+    reference="The Eiffel Tower is located in Paris."
+)
+print(f"Context Recall Score: {result.value}")
+```
+
+Output:
+
+```
+Context Recall Score: 1.0
+```
+
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+    
+    ```python
+    result = scorer.score(
+        user_input="Where is the Eiffel Tower located?",
+        retrieved_contexts=["Paris is the capital of France."],
+        reference="The Eiffel Tower is located in Paris."
+    )
+    ```
+
+## LLM Based Context Recall (Legacy API)
+
+!!! warning "Legacy API"
+    The following example uses the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. This API will be deprecated in version 0.4 and removed in version 1.0.
 
 ```python
 from ragas.dataset_schema import SingleTurnSample
@@ -31,9 +68,9 @@ sample = SingleTurnSample(
 
 context_recall = LLMContextRecall(llm=evaluator_llm)
 await context_recall.single_turn_ascore(sample)
-
 ```
-Output
+
+Output:
 ```
 1.0
 ```
diff --git a/docs/concepts/metrics/available_metrics/factual_correctness.md b/docs/concepts/metrics/available_metrics/factual_correctness.md
@@ -2,6 +2,76 @@
 
 `FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.
 
+### Example
+
+```python
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.metrics.collections import FactualCorrectness
+
+# Setup LLM
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+
+# Create metric
+scorer = FactualCorrectness(llm=llm)
+
+# Evaluate
+result = await scorer.ascore(
+    response="The Eiffel Tower is located in Paris.",
+    reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
+)
+print(f"Factual Correctness Score: {result.value}")
+```
+
+Output:
+
+```
+Factual Correctness Score: 0.67
+```
+
+By default, the mode is set to `f1`. You can change the mode to `precision` or `recall` by setting the `mode` parameter:
+
+```python
+# Precision mode - measures what fraction of response claims are supported by reference
+scorer = FactualCorrectness(llm=llm, mode="precision")
+result = await scorer.ascore(
+    response="The Eiffel Tower is located in Paris.",
+    reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
+)
+print(f"Precision Score: {result.value}")
+```
+
+Output:
+
+```
+Precision Score: 1.0
+```
+
+You can also configure the claim decomposition granularity using `atomicity` and `coverage` parameters:
+
+```python
+# High granularity - more detailed claim decomposition
+scorer = FactualCorrectness(
+    llm=llm,
+    mode="f1",
+    atomicity="high",  # More atomic claims
+    coverage="high"    # Comprehensive coverage
+)
+```
+
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+    
+    ```python
+    result = scorer.score(
+        response="The Eiffel Tower is located in Paris.",
+        reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
+    )
+    ```
+
+### How It's Calculated
+
 The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:
 
 $$
@@ -30,36 +100,6 @@ $$
 \text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})}
 $$
 
-### Example
-
-```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics._factual_correctness import FactualCorrectness
-
-
-sample = SingleTurnSample(
-    response="The Eiffel Tower is located in Paris.",
-    reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
-)
-
-scorer = FactualCorrectness(llm = evaluator_llm)
-await scorer.single_turn_ascore(sample)
-```
-Output
-```
-0.67
-```
-
-By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
-
-```python
-scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
-```
-Output
-```
-1.0
-```
-
 ### Controlling the Number of Claims
 
 Each sentence in the response and reference can be broken down into one or more claims. The number of claims that are generated from a single sentence is determined by the level of `atomicity` and `coverage` required for your application.
@@ -161,3 +201,58 @@ By adjusting both atomicity and coverage, you can customize the level of detail
 - Use **Low Atomicity and Low Coverage** when only the key information is necessary, such as for summarization.
 
 This flexibility in controlling the number of claims helps ensure that the information is presented at the right level of granularity for your application's requirements.
+
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with SingleTurnSample
+
+```python
+from ragas.dataset_schema import SingleTurnSample
+from ragas.metrics._factual_correctness import FactualCorrectness
+
+
+sample = SingleTurnSample(
+    response="The Eiffel Tower is located in Paris.",
+    reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
+)
+
+scorer = FactualCorrectness(llm = evaluator_llm)
+await scorer.single_turn_ascore(sample)
+```
+
+Output:
+
+```
+0.67
+```
+
+### Changing the Mode
+
+By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
+
+```python
+scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
+```
+
+Output:
+
+```
+1.0
+```
+
+### Controlling Atomicity
+
+```python
+scorer = FactualCorrectness(mode="precision", atomicity="low")
+```
+
+Output:
+
+```
+1.0
+```
diff --git a/docs/concepts/metrics/available_metrics/nvidia_metrics.md b/docs/concepts/metrics/available_metrics/nvidia_metrics.md
@@ -248,28 +248,47 @@ Output:
 - **1** → The response is partially grounded.
 - **2** → The response is fully grounded (every statement can be found or inferred from the retrieved context).
 
+### Example
 
 ```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import ResponseGroundedness
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.metrics.collections import ResponseGroundedness
 
-sample = SingleTurnSample(
+# Setup LLM
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+
+# Create metric
+scorer = ResponseGroundedness(llm=llm)
+
+# Evaluate
+result = await scorer.ascore(
     response="Albert Einstein was born in 1879.",
     retrieved_contexts=[
         "Albert Einstein was born March 14, 1879.",
         "Albert Einstein was born at Ulm, in Württemberg, Germany.",
     ]
 )
-
-scorer = ResponseGroundedness(llm=evaluator_llm)
-score = await scorer.single_turn_ascore(sample)
-print(score)
+print(f"Response Groundedness Score: {result.value}")
 ```
-Output
+
+Output:
+
 ```
-1.0
+Response Groundedness Score: 1.0
 ```
 
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+    
+    ```python
+    result = scorer.score(
+        response="Albert Einstein was born in 1879.",
+        retrieved_contexts=[...]
+    )
+    ```
+
 ### How It’s Calculated
 
 **Step 1:** The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of **0**, **1**, or **2**.
@@ -299,3 +318,35 @@ In this example, the retrieved contexts provide both the birthdate and location
 - **Token Usage:** Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient.
 - **Explainability:** Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score.
 - **Robust Evaluation:** Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations.
+
+### Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+#### Example with SingleTurnSample
+
+```python
+from ragas.dataset_schema import SingleTurnSample
+from ragas.metrics import ResponseGroundedness
+
+sample = SingleTurnSample(
+    response="Albert Einstein was born in 1879.",
+    retrieved_contexts=[
+        "Albert Einstein was born March 14, 1879.",
+        "Albert Einstein was born at Ulm, in Württemberg, Germany.",
+    ]
+)
+
+scorer = ResponseGroundedness(llm=evaluator_llm)
+score = await scorer.single_turn_ascore(sample)
+print(score)
+```
+
+Output:
+
+```
+1.0
+```