Skip to content

Commit 677808e

Browse files
authored
docs: add explanation (#2139)
1 parent f31ea42 commit 677808e

File tree

4 files changed

+170
-11
lines changed

4 files changed

+170
-11
lines changed
Lines changed: 88 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,88 @@
1-
# Dataset preparation for Evaluating AI Systems
1+
# Datasets and Experiment Results
2+
3+
When we evaluate AI systems, we typically work with two main types of data:
4+
5+
1. **Evaluation Datasets**: These are stored under the `datasets` directory.
6+
2. **Evaluation Results**: These are stored under the `experiments` directory.
7+
8+
## Evaluation Datasets
9+
10+
A dataset for evaluations contains:
11+
12+
1. Inputs: a set of inputs that the system will process.
13+
2. Expected outputs (Optional): the expected outputs or responses from the system for the given inputs.
14+
3. Metadata (Optional): additional information that can be stored alongside the dataset.
15+
16+
For example, in a Retrieval-Augmented Generation (RAG) system it might include query (input to the system), Grading notes (to grade the output from the system), and metadata like query complexity.
17+
18+
Metadata is particularly useful for slicing and dicing the dataset, allowing you to analyze results across different facets. For instance, you might want to see how your system performs on complex queries versus simple ones, or how it handles different languages.
19+
20+
## Experiment Results
21+
22+
Experiment results include:
23+
24+
1. All attributes from the dataset.
25+
2. The response from the evaluated system.
26+
3. Results of metrics.
27+
4. Optional metadata, such as a URI pointing to the system trace for a given input.
28+
29+
For example, in a RAG system, the results might include Query, Grading notes, Response, Accuracy score (metric), link to the system trace, etc.
30+
31+
## Data Storage in Ragas
32+
33+
We understand that different teams have diverse preferences for organizing, updating, and maintaining data, for example:
34+
35+
- A single developer might store datasets as CSV files in the local filesystem.
36+
- A small-to-medium team might use Google Sheets or Notion databases.
37+
- Enterprise teams might rely on Box or Microsoft OneDrive, depending on their data storage and sharing policies.
38+
39+
Teams may also use various file formats like CSV, XLSX, or JSON. Among these, CSV or spreadsheet formats are often preferred for evaluation datasets due to their simplicity and smaller size compared to training datasets.
40+
41+
Ragas, as an evaluation framework, supports these diverse preferences by enabling you to use your preferred file systems and formats for storing and reading datasets and experiment results.
42+
43+
To achieve this, Ragas introduces the concept of **plug-and-play backends** for data storage:
44+
45+
- Ragas provides default backends like `local/csv` and `google_drive/csv`.
46+
- These backends are extensible, allowing you to implement custom backends for any file system or format (e.g., `box/csv`).
47+
48+
49+
## Using Datasets and Results via API
50+
51+
### Loading a Dataset
52+
53+
```python
54+
from ragas_experimental import Dataset
55+
56+
test_dataset = Dataset.load(name="test_dataset", backend="local/csv", root_dir=".")
57+
```
58+
59+
This command loads a dataset named `test_dataset.csv` from the `root_directory/datasets` directory. The backend can be any backend registered via Ragas backends.
60+
61+
### Loading Experiment Results
62+
63+
```python
64+
from ragas_experimental import Experiment
65+
66+
experiment_results = Experiment.load(name="first_experiment", backend="local/csv", root_dir=".")
67+
```
68+
69+
This command loads experiment results named `first_experiment.csv` from the `root_directory/experiments` directory. The backend can be any backend registered via Ragas backends.
70+
71+
## Data Validation Using Pydantic
72+
73+
Ragas provides data type validation via Pydantic. You can configure a preferred `data_model` for a dataset or experiment results to ensure data is validated before reading or writing to the data storage.
74+
75+
**Example**:
76+
77+
```python
78+
from ragas_experimental import Dataset
79+
from pydantic import BaseModel
80+
81+
class MyDataset(BaseModel):
82+
query: str
83+
ground_truth: str
84+
85+
test_dataset = Dataset.load(name="test_dataset", backend="local/csv", root_dir=".", data_model=MyDataset)
86+
```
87+
88+
This ensures that the data meets the specified type requirements, preventing invalid data from being read or written.
Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,65 @@
1-
# Experimentation for Improving AI Systems
1+
# Experiments
2+
3+
## What is an experiment?
4+
5+
An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation (RAG) system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses.
6+
7+
### Principles of a Good Experiment
8+
9+
1. **Define measurable metrics**: Use metrics like accuracy, precision, or recall to quantify the impact of your changes.
10+
2. **Systematic result storage**: Ensure results are stored in an organized manner for easy comparison and tracking.
11+
3. **Isolate changes**: Make one change at a time to identify its specific impact. Avoid making multiple changes simultaneously, as this can obscure the results.
12+
4. **Iterative process**: Follow a structured approach: *Make a change → Run evaluations → Observe results →
13+
```mermaid
14+
graph LR
15+
A[Make a change] --> B[Run evaluations]
16+
B --> C[Observe results]
17+
C --> D[Hypothesize next change]
18+
D --> A
19+
```
20+
21+
## Experiments in Ragas
22+
23+
### Components of an Experiment
24+
25+
1. **Test dataset**: The data used to evaluate the system.
26+
2. **Application endpoint**: The application, component or model being tested.
27+
3. **Metrics**: Quantitative measures to assess performance.
28+
29+
### Execution Process
30+
31+
Running an experiment involves:
32+
33+
1. Executing the dataset against the application endpoint.
34+
2. Calculating metrics to quantify performance.
35+
3. Returning and storing the results.
36+
37+
## Using the `@experiment` Decorator
38+
39+
The `@experiment` decorator in Ragas simplifies the orchestration, scaling, and storage of experiments. Here's an example:
40+
41+
```python
42+
from ragas_experimental import experiment
43+
44+
# Define your metric and dataset
45+
my_metric = ...
46+
dataset = ...
47+
48+
@experiment
49+
async def my_experiment(row):
50+
# Process the query through your application
51+
response = my_app(row.query)
52+
53+
# Calculate the metric
54+
metric = my_metric.score(response, row.ground_truth)
55+
56+
# Return results
57+
return {**row, "response": response, "accuracy": metric.value}
58+
59+
# Run the experiment
60+
my_experiment.arun(dataset)
61+
```
62+
63+
## Result Storage
64+
65+
Once executed, Ragas processes each row in the dataset, runs it through the function, and stores the results in the `experiments` folder. The storage backend can be configured based on your preferences.
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# 📚 Explanation
22

3-
1. [Metrics for Evaluating AI systems](metrics.md)
4-
2. [Experimentation for improving AI systems](experimentation.md)
5-
3. [Datasets preparation for evaluating AI systems](datasets.md)
3+
1. [Metrics](metrics.md)
4+
2. [Datasets and Experiment Results](datasets.md)
5+
3. [Experiments](experimentation.md)

docs/experimental/explanation/metrics.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Metrics for evaluating AI Applications
1+
# Metrics
22

33
## Why Metrics Matter
44

@@ -65,11 +65,13 @@ from ragas_experimental.metrics import numeric_metric
6565
@numeric_metric(name="response_accuracy", allowed_values=(0, 1))
6666
def my_metric(predicted: float, expected: float) -> float:
6767
return abs(predicted - expected) / max(expected, 1e-5)
68+
69+
my_metric.score(predicted=0.8, expected=1.0) # Returns a float value
6870
```
6971

70-
### 3. Ranked Metrics
72+
### 3. Ranking Metrics
7173

72-
These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare outputs relative to one another.
74+
These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare multiple outputs from the same pipeline relative to one another.
7375

7476
```python
7577
from ragas_experimental.metrics import ranked_metric
@@ -78,6 +80,8 @@ def my_metric(responses: list) -> list:
7880
response_lengths = [len(response) for response in responses]
7981
sorted_indices = sorted(range(len(response_lengths)), key=lambda i: response_lengths[i])
8082
return sorted_indices
83+
84+
my_metric.score(responses=["short", "a bit longer", "the longest response"]) # Returns a ranked list of indices
8185
```
8286

8387
## LLM-based vs. Non-LLM-based Metrics
@@ -103,9 +107,13 @@ These leverage LLMs (Large Language Models) to evaluate outcomes, typically usef
103107

104108
Example:
105109
```python
106-
def my_metric(predicted: str, expected: str) -> str:
107-
response = llm.generate(f"Evaluate semantic similarity between '{predicted}' and '{expected}'")
108-
return "pass" if response > 5 else "fail"
110+
from ragas_experimental.metrics import DiscreteMetric
111+
112+
my_metric = DiscreteMetric(
113+
name="response_quality",
114+
prompt="Evaluate the response based on the pass criteria: {pass_criteria}. Does the response meet the criteria? Return 'pass' or 'fail'.\nResponse: {response}",
115+
allowed_values=["pass", "fail"]
116+
)
109117
```
110118

111119
When to use:

0 commit comments

Comments
 (0)