explodinggradients
diff --git a/‎Makefile‎
Lines changed: 2 additions & 2 deletions b/‎Makefile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 47 additions & 7 deletions b/‎README.md‎
Lines changed: 47 additions & 7 deletions
diff --git a/‎docs/_static/imgs/results/rag_eval_result.png‎
60.5 KB b/‎docs/_static/imgs/results/rag_eval_result.png‎
60.5 KB
diff --git a/‎docs/concepts/components/eval_dataset.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/concepts/components/eval_dataset.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/concepts/experimentation.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/concepts/experimentation.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/concepts/index.md‎
Lines changed: 12 additions & 17 deletions b/‎docs/concepts/index.md‎
Lines changed: 12 additions & 17 deletions
diff --git a/‎docs/concepts/metrics/available_metrics/answer_correctness.md‎
Lines changed: 58 additions & 11 deletions b/‎docs/concepts/metrics/available_metrics/answer_correctness.md‎
Lines changed: 58 additions & 11 deletions
diff --git a/‎docs/concepts/metrics/available_metrics/answer_relevance.md‎
Lines changed: 69 additions & 19 deletions b/‎docs/concepts/metrics/available_metrics/answer_relevance.md‎
Lines changed: 69 additions & 19 deletions
@@ -161,7 +161,7 @@ build-docs: ## Build all documentation
 	@echo "Converting ipynb notebooks to md files..."
 	$(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py
 	@echo "Building ragas documentation..."
-	$(Q)uv run --group docs mkdocs build
+	$(Q)MKDOCS_CI=false uv run --group docs mkdocs build
 
 serve-docs: ## Build and serve documentation locally
-	$(Q)uv run --group docs mkdocs serve --dirtyreload
+	$(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload
@@ -48,7 +48,7 @@ Ragas is your ultimate toolkit for evaluating and optimizing Large Language Mode
 Don't have a test dataset ready? We also do production-aligned test set generation.
 
 > [!NOTE]
-> Need help setting up Evals for your AI application? We'd love to help! We are conducting Office Hours every week. You can sign up [here](https://cal.com/team/ragas/office-hours).
+> Need help setting up Evals for your AI application? We'd love to help! We are conducting Office Hours every week. You can sign up [here](https://cal.com/team/vibrantlabs/office-hours).
 
 ## Key Features
 
@@ -73,23 +73,63 @@ pip install git+https://github.com/explodinggradients/ragas
 
 ## :fire: Quickstart
 
+### Clone a Complete Example Project
+
+The fastest way to get started is to use the `ragas quickstart` command:
+
+```bash
+# List available templates
+ragas quickstart
+
+# Create a RAG evaluation project
+ragas quickstart rag_eval
+
+# Create an agent evaluation project
+ragas quickstart agent_evals -o ./my-project
+```
+
+Available templates:
+- `rag_eval` - Evaluate RAG systems
+- `agent_evals` - Evaluate AI agents
+- `benchmark_llm` - Benchmark and compare LLMs
+- `prompt_evals` - Evaluate prompt variations
+- `workflow_eval` - Evaluate complex workflows
+
 ### Evaluate your LLM App
 
-This is 5 main lines:
+This is a simple example evaluating a summary for accuracy:
 
 ```python
-from ragas import SingleTurnSample
-from ragas.metrics import AspectCritic
+import asyncio
+from ragas.metrics.collections import AspectCritic
+from ragas.llms import llm_factory
+
+# Setup your LLM
+llm = llm_factory("gpt-4o")
 
+# Create a metric
+metric = AspectCritic(
+    name="summary_accuracy",
+    definition="Verify if the summary is accurate and captures key information.",
+    llm=llm
+)
+
+# Evaluate
 test_data = {
     "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
     "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
 }
-evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
-metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
-await metric.single_turn_ascore(SingleTurnSample(**test_data))
+
+score = await metric.ascore(
+    user_input=test_data["user_input"],
+    response=test_data["response"]
+)
+print(f"Score: {score.value}")
+print(f"Reason: {score.reason}")
 ```
 
+> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set.
+
 Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals)
 
 ## Want help in improving your AI application using evals?
 
@@ -68,6 +68,7 @@ sample3 = SingleTurnSample(
 ```
 
 **Step 3:** Create the EvaluationDataset
+
 Create an EvaluationDataset by passing a list of SingleTurnSample instances.
 
 ```python
@@ -91,4 +92,4 @@ Load the dataset into a Ragas EvaluationDataset object.
 from ragas import EvaluationDataset
 
 eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
-```
+```
@@ -36,7 +36,7 @@ graph LR
 
 ## Creating Experiments with Ragas
 
-Ragas provides an `@experiment` decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see [Run your first experiment](../getstarted/experiments_quickstart.md).
+Ragas provides an `@experiment` decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see the [Quick Start guide](../getstarted/quickstart.md).
 
 ### Basic Experiment Structure
 
 
@@ -3,41 +3,36 @@
 
 <div class="grid cards" markdown>
 
--   :material-widgets:{ .lg .middle } [__Components Guides__](components/index.md)
+-   :material-flask-outline:{ .lg .middle } [__Experimentation__](experimentation.md)
 
     ---
 
-    Discover the various components used within Ragas.
-    
-    Components like [Prompt Object](components/prompt.md), [Evaluation Dataset](components/eval_dataset.md) and [more..](components/index.md)
+    Learn how to systematically evaluate your AI applications using experiments.
 
+    Track changes, measure improvements, and compare results across different versions of your application.
 
--   ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md)
+-   :material-database-export:{ .lg .middle } [__Datasets__](datasets.md)
 
     ---
 
-    Explore available metrics and understand how they work.
+    Understand how to create, manage, and use evaluation datasets.
 
-    Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics).
+    Learn about dataset structure, storage backends, and best practices for maintaining your test data.
 
--   :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md)
+-   ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md)
 
     ---
 
-    Generate high-quality datasets for comprehensive testing.
-
-    Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md) 
+    Use our library of [available metrics](metrics/available_metrics/index.md) or create [custom metrics](metrics/overview/index.md) tailored to your use case.
 
+    Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics).
 
--   :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md)
+-   :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md)
 
     ---
 
-    Leverage signals from production data to gain actionable insights.
-
-    Learn about to leveraging implicit and explicit signals from production data.
-
-
+    Generate high-quality datasets for comprehensive testing.
 
+    Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md)
 
 </div>
@@ -16,20 +16,44 @@ Answer correctness encompasses two critical aspects: semantic similarity between
 ### Example
 
 ```python
-from datasets import Dataset 
-from ragas.metrics import answer_correctness
-from ragas import evaluate
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.embeddings.base import embedding_factory
+from ragas.metrics.collections import AnswerCorrectness
+
+# Setup LLM and embeddings
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client)
+
+# Create metric
+scorer = AnswerCorrectness(llm=llm, embeddings=embeddings)
+
+# Evaluate
+result = await scorer.ascore(
+    user_input="When was the first super bowl?",
+    response="The first superbowl was held on Jan 15, 1967",
+    reference="The first superbowl was held on January 15, 1967"
+)
+print(f"Answer Correctness Score: {result.value}")
+```
 
-data_samples = {
-    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
-    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
-    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
-}
-dataset = Dataset.from_dict(data_samples)
-score = evaluate(dataset,metrics=[answer_correctness])
-score.to_pandas()
+Output:
 
 ```
+Answer Correctness Score: 0.95
+```
+
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+    
+    ```python
+    result = scorer.score(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967",
+        reference="The first superbowl was held on January 15, 1967"
+    )
+    ```
 
 ### Calculation
 
@@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the
 
 Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.
 
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with Dataset
+
+```python
+from datasets import Dataset 
+from ragas.metrics import answer_correctness
+from ragas import evaluate
+
+data_samples = {
+    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
+    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
+    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
+}
+dataset = Dataset.from_dict(data_samples)
+score = evaluate(dataset,metrics=[answer_correctness])
+score.to_pandas()
+```
@@ -1,6 +1,8 @@
-## Response Relevancy
+## Answer Relevancy
 
-The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.  
+The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.
+
+An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
 
 This metric is calculated using the `user_input` and the `response` as follows:  
 
@@ -19,34 +21,50 @@ $$
 Where:  
 - $E_{g_i}$: Embedding of the $i^{th}$ generated question.  
 - $E_o$: Embedding of the user input.  
-- $N$: Number of generated questions (default is 3).  
+- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).  
 
 **Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
 
-An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
-
 ### Example
 
 ```python
-from ragas import SingleTurnSample 
-from ragas.metrics import ResponseRelevancy
+from openai import AsyncOpenAI
+from ragas.llms import llm_factory
+from ragas.embeddings.base import embedding_factory
+from ragas.metrics.collections import AnswerRelevancy
+
+# Setup LLM and embeddings
+client = AsyncOpenAI()
+llm = llm_factory("gpt-4o-mini", client=client)
+embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client, interface="modern")
+
+# Create metric
+scorer = AnswerRelevancy(llm=llm, embeddings=embeddings)
+
+# Evaluate
+result = await scorer.ascore(
+    user_input="When was the first super bowl?",
+    response="The first superbowl was held on Jan 15, 1967"
+)
+print(f"Answer Relevancy Score: {result.value}")
+```
 
-sample = SingleTurnSample(
-        user_input="When was the first super bowl?",
-        response="The first superbowl was held on Jan 15, 1967",
-        retrieved_contexts=[
-            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
-        ]
-    )
+Output:
 
-scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
-await scorer.single_turn_ascore(sample)
-```
-Output
 ```
-0.9165088378587264
+Answer Relevancy Score: 0.9165088378587264
 ```
 
+!!! note "Synchronous Usage"
+    If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
+    
+    ```python
+    result = scorer.score(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967"
+    )
+    ```
+
 ### How It’s Calculated
 
 !!! example
@@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
 - **Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.
 
 The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.
+
+
+## Legacy Metrics API
+
+The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
+
+!!! warning "Deprecation Timeline"
+    This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
+
+### Example with SingleTurnSample
+
+```python
+from ragas import SingleTurnSample 
+from ragas.metrics import ResponseRelevancy
+
+sample = SingleTurnSample(
+        user_input="When was the first super bowl?",
+        response="The first superbowl was held on Jan 15, 1967",
+        retrieved_contexts=[
+            "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
+        ]
+    )
+
+scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
+await scorer.single_turn_ascore(sample)
+```
+
+Output:
+
+```
+0.9165088378587264
+```