|
| 1 | +# Datasets and Experiment Results |
| 2 | + |
| 3 | +When we evaluate AI systems, we typically work with two main types of data: |
| 4 | + |
| 5 | +1. **Evaluation Datasets**: These are stored under the `datasets` directory. |
| 6 | +2. **Evaluation Results**: These are stored under the `experiments` directory. |
| 7 | + |
| 8 | +## Evaluation Datasets |
| 9 | + |
| 10 | +A dataset for evaluations contains: |
| 11 | + |
| 12 | +1. Inputs: a set of inputs that the system will process. |
| 13 | +2. Expected outputs (Optional): the expected outputs or responses from the system for the given inputs. |
| 14 | +3. Metadata (Optional): additional information that can be stored alongside the dataset. |
| 15 | + |
| 16 | +For example, in a Retrieval-Augmented Generation (RAG) system it might include query (input to the system), Grading notes (to grade the output from the system), and metadata like query complexity. |
| 17 | + |
| 18 | +Metadata is particularly useful for slicing and dicing the dataset, allowing you to analyze results across different facets. For instance, you might want to see how your system performs on complex queries versus simple ones, or how it handles different languages. |
| 19 | + |
| 20 | +## Experiment Results |
| 21 | + |
| 22 | +Experiment results include: |
| 23 | + |
| 24 | +1. All attributes from the dataset. |
| 25 | +2. The response from the evaluated system. |
| 26 | +3. Results of metrics. |
| 27 | +4. Optional metadata, such as a URI pointing to the system trace for a given input. |
| 28 | + |
| 29 | +For example, in a RAG system, the results might include Query, Grading notes, Response, Accuracy score (metric), link to the system trace, etc. |
| 30 | + |
| 31 | +## Working with Datasets in Ragas |
| 32 | + |
| 33 | +Ragas provides a `Dataset` class to work with evaluation datasets. Here's how you can use it: |
| 34 | + |
| 35 | +### Creating a Dataset |
| 36 | + |
| 37 | +```python |
| 38 | +from ragas import Dataset |
| 39 | + |
| 40 | +# Create a new dataset |
| 41 | +dataset = Dataset(name="my_evaluation", backend="local/csv", root_dir="./data") |
| 42 | + |
| 43 | +# Add a sample to the dataset |
| 44 | +dataset.append({ |
| 45 | + "id": "sample_1", |
| 46 | + "query": "What is the capital of France?", |
| 47 | + "expected_answer": "Paris", |
| 48 | + "metadata": {"complexity": "simple", "language": "en"} |
| 49 | +}) |
| 50 | +``` |
| 51 | + |
| 52 | +### Loading an Existing Dataset |
| 53 | + |
| 54 | +```python |
| 55 | +# Load an existing dataset |
| 56 | +dataset = Dataset.load( |
| 57 | + name="my_evaluation", |
| 58 | + backend="local/csv", |
| 59 | + root_dir="./data" |
| 60 | +) |
| 61 | +``` |
| 62 | + |
| 63 | +### Dataset Structure |
| 64 | + |
| 65 | +Datasets in Ragas are flexible and can contain any fields you need for your evaluation. Common fields include: |
| 66 | + |
| 67 | +- `id`: Unique identifier for each sample |
| 68 | +- `query` or `input`: The input to your AI system |
| 69 | +- `expected_output` or `ground_truth`: The expected response (if available) |
| 70 | +- `metadata`: Additional information about the sample |
| 71 | + |
| 72 | +### Best Practices for Dataset Creation |
| 73 | + |
| 74 | +1. **Representative Samples**: Ensure your dataset represents the real-world scenarios your AI system will encounter. |
| 75 | + |
| 76 | +2. **Balanced Distribution**: Include samples across different difficulty levels, topics, and edge cases. |
| 77 | + |
| 78 | +3. **Quality Over Quantity**: It's better to have fewer high-quality, well-curated samples than many low-quality ones. |
| 79 | + |
| 80 | +4. **Metadata Rich**: Include relevant metadata that allows you to analyze performance across different dimensions. |
| 81 | + |
| 82 | +5. **Version Control**: Track changes to your datasets over time to ensure reproducibility. |
| 83 | + |
| 84 | +## Dataset Storage and Management |
| 85 | + |
| 86 | +### Local Storage |
| 87 | + |
| 88 | +For local development and small datasets, you can use CSV files: |
| 89 | + |
| 90 | +```python |
| 91 | +dataset = Dataset(name="my_eval", backend="local/csv", root_dir="./datasets") |
| 92 | +``` |
| 93 | + |
| 94 | +### Cloud Storage |
| 95 | + |
| 96 | +For larger datasets or team collaboration, consider cloud backends: |
| 97 | + |
| 98 | +```python |
| 99 | +# Google Drive (experimental) |
| 100 | +dataset = Dataset(name="my_eval", backend="gdrive", root_dir="folder_id") |
| 101 | + |
| 102 | +# Other backends can be added as needed |
| 103 | +``` |
| 104 | + |
| 105 | +### Dataset Versioning |
| 106 | + |
| 107 | +Keep track of dataset versions for reproducible experiments: |
| 108 | + |
| 109 | +```python |
| 110 | +# Include version in dataset name |
| 111 | +dataset = Dataset(name="my_eval_v1.2", backend="local/csv", root_dir="./datasets") |
| 112 | +``` |
| 113 | + |
| 114 | +## Integration with Evaluation Workflows |
| 115 | + |
| 116 | +Datasets integrate seamlessly with Ragas evaluation workflows: |
| 117 | + |
| 118 | +```python |
| 119 | +from ragas import experiment, Dataset |
| 120 | + |
| 121 | +# Load your dataset |
| 122 | +dataset = Dataset.load(name="my_evaluation", backend="local/csv", root_dir="./data") |
| 123 | + |
| 124 | +# Define your experiment |
| 125 | +@experiment() |
| 126 | +async def my_experiment(row): |
| 127 | + # Process the input through your AI system |
| 128 | + response = await my_ai_system(row["query"]) |
| 129 | + |
| 130 | + # Return results for metric evaluation |
| 131 | + return { |
| 132 | + **row, # Include original data |
| 133 | + "response": response, |
| 134 | + "experiment_name": "baseline_v1" |
| 135 | + } |
| 136 | + |
| 137 | +# Run evaluation on the dataset |
| 138 | +results = await my_experiment.arun(dataset) |
| 139 | +``` |
| 140 | + |
| 141 | +This integration allows you to maintain a clear separation between your test data (datasets) and your evaluation results (experiments), making it easier to track progress and compare different approaches. |
0 commit comments