explodinggradients · anistark · Nov 7, 2025 · Nov 6, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/docs/concepts/metrics/available_metrics/answer_correctness.md b/docs/concepts/metrics/available_metrics/answer_correctness.md
@@ -16,20 +16,44 @@ Answer correctness encompasses two critical aspects: semantic similarity between
 ### Example
 
 ```python
-from datasets import Dataset 
-from ragas.metrics import answer_correctness
-from ragas import evaluate
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.embeddings.base import embedding_factory
+from ragas.metrics.collections import AnswerCorrectness
+
+# Setup LLM and embeddings
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client)
+
+# Create metric
+scorer = AnswerCorrectness(llm=llm, embeddings=embeddings)
+
+# Evaluate
+result = await scorer.ascore(
+    user_input="When was the first super bowl?",
+    response="The first superbowl was held on Jan 15, 1967",
+    reference="The first superbowl was held on January 15, 1967"
+)
+print(f"Answer Correctness Score: {result.value}")
+```
 
-data_samples = {
-    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
-    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
-    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
-}
-dataset = Dataset.from_dict(data_samples)
-score = evaluate(dataset,metrics=[answer_correctness])
-score.to_pandas()
+Output:
 
 ```
+Answer Correctness Score: 0.95
+```
+
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+
+    ```python
+    result = scorer.score(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967",
+        reference="The first superbowl was held on January 15, 1967"
+    )
+    ```
 
 ### Calculation
 
@@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the
 
 Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.
 
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with Dataset
+
+```python
+from datasets import Dataset 
+from ragas.metrics import answer_correctness
+from ragas import evaluate
+
+data_samples = {
+    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
+    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
+    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
+}
+dataset = Dataset.from_dict(data_samples)
+score = evaluate(dataset,metrics=[answer_correctness])
+score.to_pandas()
+```
diff --git a/docs/concepts/metrics/available_metrics/answer_relevance.md b/docs/concepts/metrics/available_metrics/answer_relevance.md
@@ -1,6 +1,8 @@
-## Response Relevancy
+## Answer Relevancy
 
-The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.  
+The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.
+
+An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
 
 This metric is calculated using the `user_input` and the `response` as follows:  
 
@@ -19,34 +21,50 @@ $$
 Where:  
 - $E_{g_i}$: Embedding of the $i^{th}$ generated question.  
 - $E_o$: Embedding of the user input.  
-- $N$: Number of generated questions (default is 3).  
+- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).  
 
 **Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
 
-An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
-
 ### Example
 
 ```python
-from ragas import SingleTurnSample 
-from ragas.metrics import ResponseRelevancy
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.embeddings.base import embedding_factory
+from ragas.metrics.collections import AnswerRelevancy
+
+# Setup LLM and embeddings
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client, interface="modern")
+
+# Create metric
+scorer = AnswerRelevancy(llm=llm, embeddings=embeddings)
+
+# Evaluate
+result = await scorer.ascore(
+    user_input="When was the first super bowl?",
+    response="The first superbowl was held on Jan 15, 1967"
+)
+print(f"Answer Relevancy Score: {result.value}")
+```
 
-sample = SingleTurnSample(
-        user_input="When was the first super bowl?",
-        response="The first superbowl was held on Jan 15, 1967",
-        retrieved_contexts=[
-            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
-        ]
-    )
+Output:
 
-scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
-await scorer.single_turn_ascore(sample)
-```
-Output
 ```
-0.9165088378587264
+Answer Relevancy Score: 0.9165088378587264
 ```
 
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+
+    ```python
+    result = scorer.score(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967"
+    )
+    ```
+
 ### How It’s Calculated
 
 !!! example
@@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
 - **Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.
 
 The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.
+
+
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with SingleTurnSample
+
+```python
+from ragas import SingleTurnSample 
+from ragas.metrics import ResponseRelevancy
+
+sample = SingleTurnSample(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967",
+        retrieved_contexts=[
+            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
+        ]
+    )
+
+scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
+await scorer.single_turn_ascore(sample)
+```
+
+Output:
+
+```
+0.9165088378587264
+```
diff --git a/docs/concepts/metrics/available_metrics/context_entities_recall.md b/docs/concepts/metrics/available_metrics/context_entities_recall.md
@@ -17,23 +17,40 @@ $$
 ### Example
 
 ```python
-from ragas import SingleTurnSample
-from ragas.metrics import ContextEntityRecall
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.metrics.collections import ContextEntityRecall
 
-sample = SingleTurnSample(
-    reference="The Eiffel Tower is located in Paris.",
-    retrieved_contexts=["The Eiffel Tower is located in Paris."],
-)
+# Setup LLM
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
 
-scorer = ContextEntityRecall(llm=evaluator_llm)
+# Create metric
+scorer = ContextEntityRecall(llm=llm)
 
-await scorer.single_turn_ascore(sample)
+# Evaluate
+result = await scorer.ascore(
+    reference="The Eiffel Tower is located in Paris.",
+    retrieved_contexts=["The Eiffel Tower is located in Paris."]
+)
+print(f"Context Entity Recall Score: {result.value}")
 ```
-Output
+
+Output:
 ```
-0.999999995
+Context Entity Recall Score: 0.999999995
 ```
 
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+
+    ```python
+    result = scorer.score(
+        reference="The Eiffel Tower is located in Paris.",
+        retrieved_contexts=["The Eiffel Tower is located in Paris."]
+    )
+    ```
+
 ### How It’s Calculated
 
 
@@ -65,3 +82,29 @@ Let us consider the reference and the retrieved contexts given above.
 
     We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
 
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with SingleTurnSample
+
+```python
+from ragas import SingleTurnSample
+from ragas.metrics import ContextEntityRecall
+
+sample = SingleTurnSample(
+    reference="The Eiffel Tower is located in Paris.",
+    retrieved_contexts=["The Eiffel Tower is located in Paris."],
+)
+
+scorer = ContextEntityRecall(llm=evaluator_llm)
+
+await scorer.single_turn_ascore(sample)
+```
+Output:
+```
+0.999999995
+```