explodinggradients
diff --git a/‎Makefile‎
Lines changed: 2 additions & 2 deletions b/‎Makefile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 6 deletions b/‎README.md‎
Lines changed: 24 additions & 6 deletions
diff --git a/‎docs/_static/imgs/results/rag_eval_result.png‎
60.5 KB b/‎docs/_static/imgs/results/rag_eval_result.png‎
60.5 KB
diff --git a/‎docs/concepts/components/eval_dataset.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/concepts/components/eval_dataset.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/concepts/metrics/available_metrics/index.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/concepts/metrics/available_metrics/index.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/concepts/metrics/available_metrics/nvidia_metrics.md‎
Lines changed: 6 additions & 3 deletions b/‎docs/concepts/metrics/available_metrics/nvidia_metrics.md‎
Lines changed: 6 additions & 3 deletions
@@ -161,7 +161,7 @@ build-docs: ## Build all documentation
 	@echo "Converting ipynb notebooks to md files..."
 	$(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py
 	@echo "Building ragas documentation..."
-	$(Q)uv run --group docs mkdocs build
+	$(Q)MKDOCS_CI=false uv run --group docs mkdocs build
 
 serve-docs: ## Build and serve documentation locally
-	$(Q)uv run --group docs mkdocs serve --dirtyreload
+	$(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload
@@ -97,21 +97,39 @@ Available templates:
 
 ### Evaluate your LLM App
 
-This is 5 main lines:
+This is a simple example evaluating a summary for accuracy:
 
 ```python
-from ragas import SingleTurnSample
-from ragas.metrics import AspectCritic
+import asyncio
+from ragas.metrics.collections import AspectCritic
+from ragas.llms import llm_factory
 
+# Setup your LLM
+llm = llm_factory("gpt-4o")
+
+# Create a metric
+metric = AspectCritic(
+    name="summary_accuracy",
+    definition="Verify if the summary is accurate and captures key information.",
+    llm=llm
+)
+
+# Evaluate
 test_data = {
     "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
     "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
 }
-evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
-metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
-await metric.single_turn_ascore(SingleTurnSample(**test_data))
+
+score = await metric.ascore(
+    user_input=test_data["user_input"],
+    response=test_data["response"]
+)
+print(f"Score: {score.value}")
+print(f"Reason: {score.reason}")
 ```
 
+> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set.
+
 Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals)
 
 ## Want help in improving your AI application using evals?
 
@@ -68,6 +68,7 @@ sample3 = SingleTurnSample(
 ```
 
 **Step 3:** Create the EvaluationDataset
+
 Create an EvaluationDataset by passing a list of SingleTurnSample instances.
 
 ```python
@@ -91,4 +92,4 @@ Load the dataset into a Ragas EvaluationDataset object.
 from ragas import EvaluationDataset
 
 eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
-```
+```
@@ -21,8 +21,9 @@ Each metric are essentially paradigms that are designed to evaluate a particular
 
 ## Agents or Tool use cases
 
-- [Topic adherence](agents.md#topic_adherence)
+- [Topic adherence](agents.md#topic-adherence)
 - [Tool call Accuracy](agents.md#tool-call-accuracy)
+- [Tool Call F1](agents.md#tool-call-f1)
 - [Agent Goal Accuracy](agents.md#agent-goal-accuracy)
 
 ## Natural Language Comparison
@@ -31,6 +32,7 @@ Each metric are essentially paradigms that are designed to evaluate a particular
 - [Semantic Similarity](semantic_similarity.md)
 - [Non LLM String Similarity](traditional.md#non-llm-string-similarity)
 - [BLEU Score](traditional.md#bleu-score)
+- [CHRF Score](traditional.md#chrf-score)
 - [ROUGE Score](traditional.md#rouge-score)
 - [String Presence](traditional.md#string-presence)
 - [Exact Match](traditional.md#exact-match)
 
@@ -78,10 +78,13 @@ Thus, the final **Answer Accuracy** score is **1**.
 
 **Context Relevance** evaluates whether the **retrieved_contexts** (chunks or passages) are pertinent to the **user_input**. This is done via two independent "LLM-as-a-Judge" prompt calls that each rate the relevance on a scale of **0, 1, or 2**. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
 
-- **0** → The retrieved contexts are not relevant to the user’s query at all.
+- **0** → The retrieved contexts are not relevant to the user's query at all.
 - **1** → The contexts are partially relevant.
 - **2** → The contexts are completely relevant.
 
+### Implementation Note
+
+**Difference from Original Paper:** The original Ragas paper defines Context Relevance using sentence-level extraction (CR = number of relevant sentences / total sentences), but the current implementation uses a more robust discrete judgment approach. Each LLM is asked to rate overall context relevance on a 0-2 scale, which is more efficient and less prone to sentence boundary errors. This was an intentional design decision to improve reliability and reduce computational overhead while maintaining the core evaluation objective.
 
 ```python
 from ragas.dataset_schema import SingleTurnSample
@@ -104,9 +107,9 @@ Output
 1.0
 ```
 
-### How It’s Calculated
+### How It's Calculated
 
-**Step 1:** The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of **0**, **1**, or **2**.
+**Step 1:** The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of **0**, **1**, or **2**. Using two independent evaluations provides robustness and helps mitigate individual LLM biases.
 
 **Step 2:** Each rating is normalized to a [0,1] scale by dividing by 2. If both ratings are valid, the final score is the average of these normalized values; if only one is valid, that score is used.