Skip to content

Commit ce380da

Browse files
authored
Mirgate all leftover docks from experimental (#2243)
Migrate all remaining docs from experimental
1 parent e821105 commit ce380da

File tree

12 files changed

+643
-491
lines changed

12 files changed

+643
-491
lines changed

docs/concepts/datasets.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Datasets and Experiment Results
2+
3+
When we evaluate AI systems, we typically work with two main types of data:
4+
5+
1. **Evaluation Datasets**: These are stored under the `datasets` directory.
6+
2. **Evaluation Results**: These are stored under the `experiments` directory.
7+
8+
## Evaluation Datasets
9+
10+
A dataset for evaluations contains:
11+
12+
1. Inputs: a set of inputs that the system will process.
13+
2. Expected outputs (Optional): the expected outputs or responses from the system for the given inputs.
14+
3. Metadata (Optional): additional information that can be stored alongside the dataset.
15+
16+
For example, in a Retrieval-Augmented Generation (RAG) system it might include query (input to the system), Grading notes (to grade the output from the system), and metadata like query complexity.
17+
18+
Metadata is particularly useful for slicing and dicing the dataset, allowing you to analyze results across different facets. For instance, you might want to see how your system performs on complex queries versus simple ones, or how it handles different languages.
19+
20+
## Experiment Results
21+
22+
Experiment results include:
23+
24+
1. All attributes from the dataset.
25+
2. The response from the evaluated system.
26+
3. Results of metrics.
27+
4. Optional metadata, such as a URI pointing to the system trace for a given input.
28+
29+
For example, in a RAG system, the results might include Query, Grading notes, Response, Accuracy score (metric), link to the system trace, etc.
30+
31+
## Working with Datasets in Ragas
32+
33+
Ragas provides a `Dataset` class to work with evaluation datasets. Here's how you can use it:
34+
35+
### Creating a Dataset
36+
37+
```python
38+
from ragas import Dataset
39+
40+
# Create a new dataset
41+
dataset = Dataset(name="my_evaluation", backend="local/csv", root_dir="./data")
42+
43+
# Add a sample to the dataset
44+
dataset.append({
45+
"id": "sample_1",
46+
"query": "What is the capital of France?",
47+
"expected_answer": "Paris",
48+
"metadata": {"complexity": "simple", "language": "en"}
49+
})
50+
```
51+
52+
### Loading an Existing Dataset
53+
54+
```python
55+
# Load an existing dataset
56+
dataset = Dataset.load(
57+
name="my_evaluation",
58+
backend="local/csv",
59+
root_dir="./data"
60+
)
61+
```
62+
63+
### Dataset Structure
64+
65+
Datasets in Ragas are flexible and can contain any fields you need for your evaluation. Common fields include:
66+
67+
- `id`: Unique identifier for each sample
68+
- `query` or `input`: The input to your AI system
69+
- `expected_output` or `ground_truth`: The expected response (if available)
70+
- `metadata`: Additional information about the sample
71+
72+
### Best Practices for Dataset Creation
73+
74+
1. **Representative Samples**: Ensure your dataset represents the real-world scenarios your AI system will encounter.
75+
76+
2. **Balanced Distribution**: Include samples across different difficulty levels, topics, and edge cases.
77+
78+
3. **Quality Over Quantity**: It's better to have fewer high-quality, well-curated samples than many low-quality ones.
79+
80+
4. **Metadata Rich**: Include relevant metadata that allows you to analyze performance across different dimensions.
81+
82+
5. **Version Control**: Track changes to your datasets over time to ensure reproducibility.
83+
84+
## Dataset Storage and Management
85+
86+
### Local Storage
87+
88+
For local development and small datasets, you can use CSV files:
89+
90+
```python
91+
dataset = Dataset(name="my_eval", backend="local/csv", root_dir="./datasets")
92+
```
93+
94+
### Cloud Storage
95+
96+
For larger datasets or team collaboration, consider cloud backends:
97+
98+
```python
99+
# Google Drive (experimental)
100+
dataset = Dataset(name="my_eval", backend="gdrive", root_dir="folder_id")
101+
102+
# Other backends can be added as needed
103+
```
104+
105+
### Dataset Versioning
106+
107+
Keep track of dataset versions for reproducible experiments:
108+
109+
```python
110+
# Include version in dataset name
111+
dataset = Dataset(name="my_eval_v1.2", backend="local/csv", root_dir="./datasets")
112+
```
113+
114+
## Integration with Evaluation Workflows
115+
116+
Datasets integrate seamlessly with Ragas evaluation workflows:
117+
118+
```python
119+
from ragas import experiment, Dataset
120+
121+
# Load your dataset
122+
dataset = Dataset.load(name="my_evaluation", backend="local/csv", root_dir="./data")
123+
124+
# Define your experiment
125+
@experiment()
126+
async def my_experiment(row):
127+
# Process the input through your AI system
128+
response = await my_ai_system(row["query"])
129+
130+
# Return results for metric evaluation
131+
return {
132+
**row, # Include original data
133+
"response": response,
134+
"experiment_name": "baseline_v1"
135+
}
136+
137+
# Run evaluation on the dataset
138+
results = await my_experiment.arun(dataset)
139+
```
140+
141+
This integration allows you to maintain a clear separation between your test data (datasets) and your evaluation results (experiments), making it easier to track progress and compare different approaches.

0 commit comments

Comments
 (0)