docs: add explanation (#2139)

shahules786 · web-flow · commit 677808ecd415 · 2025-07-29T11:13:57.000+01:00
diff --git a/docs/experimental/explanation/datasets.md b/docs/experimental/explanation/datasets.md
@@ -1 +1,88 @@
-# Dataset preparation for Evaluating AI Systems
+# Datasets and Experiment Results
+
+When we evaluate AI systems, we typically work with two main types of data:
+
+1. **Evaluation Datasets**: These are stored under the `datasets` directory.
+2. **Evaluation Results**: These are stored under the `experiments` directory.
+
+## Evaluation Datasets
+
+A dataset for evaluations contains:
+
+1. Inputs: a set of inputs that the system will process.
+2. Expected outputs (Optional): the expected outputs or responses from the system for the given inputs.
+3. Metadata (Optional): additional information that can be stored alongside the dataset.
+
+For example, in a Retrieval-Augmented Generation (RAG) system it might include query (input to the system), Grading notes (to grade the output from the system), and metadata like query complexity.
+
+Metadata is particularly useful for slicing and dicing the dataset, allowing you to analyze results across different facets. For instance, you might want to see how your system performs on complex queries versus simple ones, or how it handles different languages.
+
+## Experiment Results
+
+Experiment results include:
+
+1. All attributes from the dataset.
+2. The response from the evaluated system.
+3. Results of metrics.
+4. Optional metadata, such as a URI pointing to the system trace for a given input.
+
+For example, in a RAG system, the results might include Query, Grading notes, Response, Accuracy score (metric), link to the system trace, etc.
+
+## Data Storage in Ragas
+
+We understand that different teams have diverse preferences for organizing, updating, and maintaining data, for example:
+
+- A single developer might store datasets as CSV files in the local filesystem.
+- A small-to-medium team might use Google Sheets or Notion databases.
+- Enterprise teams might rely on Box or Microsoft OneDrive, depending on their data storage and sharing policies.
+
+Teams may also use various file formats like CSV, XLSX, or JSON. Among these, CSV or spreadsheet formats are often preferred for evaluation datasets due to their simplicity and smaller size compared to training datasets.
+
+Ragas, as an evaluation framework, supports these diverse preferences by enabling you to use your preferred file systems and formats for storing and reading datasets and experiment results.
+
+To achieve this, Ragas introduces the concept of **plug-and-play backends** for data storage:
+
+- Ragas provides default backends like `local/csv` and `google_drive/csv`.
+- These backends are extensible, allowing you to implement custom backends for any file system or format (e.g., `box/csv`).
+
+
+## Using Datasets and Results via API
+
+### Loading a Dataset
+
+```python
+from ragas_experimental import Dataset
+
+test_dataset = Dataset.load(name="test_dataset", backend="local/csv", root_dir=".")
+```
+
+This command loads a dataset named `test_dataset.csv` from the `root_directory/datasets` directory. The backend can be any backend registered via Ragas backends.
+
+### Loading Experiment Results
+
+```python
+from ragas_experimental import Experiment
+
+experiment_results = Experiment.load(name="first_experiment", backend="local/csv", root_dir=".")
+```
+
+This command loads experiment results named `first_experiment.csv` from the `root_directory/experiments` directory. The backend can be any backend registered via Ragas backends.
+
+## Data Validation Using Pydantic
+
+Ragas provides data type validation via Pydantic. You can configure a preferred `data_model` for a dataset or experiment results to ensure data is validated before reading or writing to the data storage.
+
+**Example**:
+
+```python
+from ragas_experimental import Dataset
+from pydantic import BaseModel
+
+class MyDataset(BaseModel):
+    query: str
+    ground_truth: str
+
+test_dataset = Dataset.load(name="test_dataset", backend="local/csv", root_dir=".", data_model=MyDataset)
+```
+
+This ensures that the data meets the specified type requirements, preventing invalid data from being read or written.
diff --git a/docs/experimental/explanation/experimentation.md b/docs/experimental/explanation/experimentation.md
@@ -1 +1,65 @@
-# Experimentation for Improving AI Systems
+# Experiments
+
+## What is an experiment?
+
+An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation (RAG) system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses.
+
+### Principles of a Good Experiment
+
+1. **Define measurable metrics**: Use metrics like accuracy, precision, or recall to quantify the impact of your changes.
+2. **Systematic result storage**: Ensure results are stored in an organized manner for easy comparison and tracking.
+3. **Isolate changes**: Make one change at a time to identify its specific impact. Avoid making multiple changes simultaneously, as this can obscure the results.
+4. **Iterative process**: Follow a structured approach: *Make a change → Run evaluations → Observe results →
+```mermaid
+graph LR
+    A[Make a change] --> B[Run evaluations]
+    B --> C[Observe results]
+    C --> D[Hypothesize next change]
+    D --> A
+```
+
+## Experiments in Ragas
+
+### Components of an Experiment
+
+1. **Test dataset**: The data used to evaluate the system.
+2. **Application endpoint**: The application, component or model being tested.
+3. **Metrics**: Quantitative measures to assess performance.
+
+### Execution Process
+
+Running an experiment involves:
+
+1. Executing the dataset against the application endpoint.
+2. Calculating metrics to quantify performance.
+3. Returning and storing the results.
+
+## Using the `@experiment` Decorator
+
+The `@experiment` decorator in Ragas simplifies the orchestration, scaling, and storage of experiments. Here's an example:
+
+```python
+from ragas_experimental import experiment
+
+# Define your metric and dataset
+my_metric = ...
+dataset = ...
+
+@experiment
+async def my_experiment(row):
+    # Process the query through your application
+    response = my_app(row.query)
+    
+    # Calculate the metric
+    metric = my_metric.score(response, row.ground_truth)
+    
+    # Return results
+    return {**row, "response": response, "accuracy": metric.value}
+
+# Run the experiment
+my_experiment.arun(dataset)
+```
+
+## Result Storage
+
+Once executed, Ragas processes each row in the dataset, runs it through the function, and stores the results in the `experiments` folder. The storage backend can be configured based on your preferences.
diff --git a/docs/experimental/explanation/index.md b/docs/experimental/explanation/index.md
@@ -1,5 +1,5 @@
 # 📚 Explanation
 
-1. [Metrics for Evaluating AI systems](metrics.md)
-2. [Experimentation for improving AI systems](experimentation.md)
-3. [Datasets preparation for evaluating AI systems](datasets.md)
+1. [Metrics](metrics.md)
+2. [Datasets and Experiment Results](datasets.md)
+3. [Experiments](experimentation.md)
diff --git a/docs/experimental/explanation/metrics.md b/docs/experimental/explanation/metrics.md
@@ -1,4 +1,4 @@
-# Metrics for evaluating AI Applications
+# Metrics
 
 ## Why Metrics Matter
 
@@ -65,11 +65,13 @@ from ragas_experimental.metrics import numeric_metric
 @numeric_metric(name="response_accuracy", allowed_values=(0, 1))
 def my_metric(predicted: float, expected: float) -> float:
     return abs(predicted - expected) / max(expected, 1e-5)
+
+my_metric.score(predicted=0.8, expected=1.0)  # Returns a float value
 ```
 
-### 3. Ranked Metrics
+### 3. Ranking Metrics
 
-These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare outputs relative to one another.
+These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare multiple outputs from the same pipeline relative to one another.
 
 ```python
 from ragas_experimental.metrics import ranked_metric
@@ -78,6 +80,8 @@ def my_metric(responses: list) -> list:
     response_lengths = [len(response) for response in responses]
     sorted_indices = sorted(range(len(response_lengths)), key=lambda i: response_lengths[i])
     return sorted_indices
+
+my_metric.score(responses=["short", "a bit longer", "the longest response"])  # Returns a ranked list of indices
 ```
 
 ## LLM-based vs. Non-LLM-based Metrics
@@ -103,9 +107,13 @@ These leverage LLMs (Large Language Models) to evaluate outcomes, typically usef
 
 Example:
 ```python
-def my_metric(predicted: str, expected: str) -> str:
-    response = llm.generate(f"Evaluate semantic similarity between '{predicted}' and '{expected}'")
-    return "pass" if response > 5 else "fail"
+from ragas_experimental.metrics import DiscreteMetric
+
+my_metric = DiscreteMetric(
+    name="response_quality",
+    prompt="Evaluate the response based on the pass criteria: {pass_criteria}. Does the response meet the criteria? Return 'pass' or 'fail'.\nResponse: {response}",
+    allowed_values=["pass", "fail"]
+)
 ```
 
 When to use: