|
1 | | -# Dataset Preset Testing Documentation |
| 1 | +# Dataset Preset Testing |
2 | 2 |
|
3 | | -## Overview |
| 3 | +Unit tests for dataset preset transforms. These tests verify that presets correctly transform dataset columns without requiring end-to-end benchmark runs. |
4 | 4 |
|
5 | | -This guide explains the unit testing solution for preset datasets in the MLPerf Inference Endpoint system. The tests verify that dataset transforms work correctly without requiring end-to-end benchmark runs or external compute resources. |
| 5 | +## Quick Start |
6 | 6 |
|
7 | | -## What Was Added |
8 | | - |
9 | | -### 1. **Test File: `tests/unit/dataset_manager/test_dataset_presets.py`** |
10 | | - |
11 | | -Comprehensive unit tests covering all dataset presets: |
12 | | - |
13 | | -- **CNNDailyMail**: Tests for `llama3_8b` and `llama3_8b_sglang` presets |
14 | | -- **AIME25**: Tests for `gptoss` preset |
15 | | -- **GPQA**: Tests for `gptoss` preset |
16 | | -- **LiveCodeBench**: Tests for `gptoss` preset |
17 | | -- **OpenOrca**: Tests for `llama2_70b` preset |
18 | | - |
19 | | -Each preset gets three types of tests: |
20 | | -1. **Instantiation test** - Verifies the preset can be created |
21 | | -2. **Transform application test** - Verifies transforms apply without errors |
22 | | -3. **Output validation test** - Verifies transforms produce expected output format |
23 | | - |
24 | | -## Running the Tests |
25 | | - |
26 | | -### Run all preset tests: |
27 | 7 | ```bash |
| 8 | +# Run all preset tests |
28 | 9 | pytest tests/unit/dataset_manager/test_dataset_presets.py -v |
29 | | -``` |
30 | 10 |
|
31 | | -### Run tests for a specific dataset: |
32 | | -```bash |
| 11 | +# Run tests for a specific dataset |
33 | 12 | pytest tests/unit/dataset_manager/test_dataset_presets.py::TestCNNDailyMailPresets -v |
34 | | -``` |
35 | 13 |
|
36 | | -### Run a specific test: |
37 | | -```bash |
38 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py::TestCNNDailyMailPresets::test_llama3_8b_transforms_apply -v |
39 | | -``` |
40 | | - |
41 | | -### Run with coverage: |
42 | | -```bash |
43 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py --cov=src/inference_endpoint/dataset_manager --cov-report=html |
44 | | -``` |
45 | | - |
46 | | -## Test Structure |
47 | | - |
48 | | -Each test class uses pytest fixtures to provide minimal sample data: |
49 | | - |
50 | | -```python |
51 | | -@pytest.fixture |
52 | | -def sample_cnn_data(self): |
53 | | - """Create minimal sample data matching CNN/DailyMail schema.""" |
54 | | - return pd.DataFrame({ |
55 | | - "article": ["..."], |
56 | | - "highlights": ["..."], |
57 | | - }) |
| 14 | +# Exclude slow tests (Harmonize transform requires transformers) |
| 15 | +pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v |
58 | 16 | ``` |
59 | 17 |
|
60 | | -This approach: |
61 | | -- ✅ No external API calls or dataset downloads |
62 | | -- ✅ Tests run in <1 second (no network I/O) |
63 | | -- ✅ Minimal memory footprint |
64 | | -- ✅ Tests can run in CI/CD pipelines |
65 | | -- ✅ Simple to extend with new datasets |
66 | | - |
67 | | -## Programmatic Dataset Usage (No YAML) |
68 | | - |
69 | | -The schema reference documents how to use datasets without YAML configuration. See the `DATASET_SCHEMA_REFERENCE.md` for input/output column specifications. |
| 18 | +## Preset Coverage |
70 | 19 |
|
71 | | -### Load a dataset with preset programmatically: |
72 | | -```python |
73 | | -from inference_endpoint.dataset_manager.predefined.cnndailymail import CNNDailyMail |
74 | | - |
75 | | -# Get transforms |
76 | | -transforms = CNNDailyMail.PRESETS.llama3_8b_sglang() |
| 20 | +| Dataset | Presets | Tests | |
| 21 | +|---------|---------|-------| |
| 22 | +| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 | |
| 23 | +| AIME25 | `gptoss` | 3 | |
| 24 | +| GPQA | `gptoss` | 3 | |
| 25 | +| LiveCodeBench | `gptoss` | 3 | |
| 26 | +| OpenOrca | `llama2_70b` | 3 | |
77 | 27 |
|
78 | | -# Load dataset |
79 | | -dataset = CNNDailyMail.get_dataloader(transforms=transforms) |
| 28 | +## Adding Tests for New Presets |
80 | 29 |
|
81 | | -# Use in benchmark |
82 | | -sample = dataset.load_sample(0) |
83 | | -``` |
| 30 | +When adding a new dataset preset, add a test class to `tests/unit/dataset_manager/test_dataset_presets.py`: |
84 | 31 |
|
85 | | -### Create and test custom dataset: |
86 | 32 | ```python |
87 | | -from inference_endpoint.dataset_manager.dataset import Dataset |
88 | | -from inference_endpoint.dataset_manager.transforms import apply_transforms |
89 | 33 | import pandas as pd |
| 34 | +import pytest |
| 35 | +from inference_endpoint.dataset_manager.transforms import apply_transforms |
| 36 | +from inference_endpoint.dataset_manager.predefined.my_dataset import MyDataset |
90 | 37 |
|
91 | | -# Create sample data |
92 | | -data = pd.DataFrame({ |
93 | | - "question": ["What is AI?"], |
94 | | - "answer": ["Artificial Intelligence"] |
95 | | -}) |
96 | | - |
97 | | -# Get preset transforms |
98 | | -from inference_endpoint.dataset_manager.predefined.aime25 import AIME25 |
99 | | -transforms = AIME25.PRESETS.gptoss() |
100 | | - |
101 | | -# Apply transforms |
102 | | -result = apply_transforms(data, transforms) |
103 | | - |
104 | | -# Verify |
105 | | -assert "prompt" in result.columns |
106 | | -assert len(result) == 1 |
107 | | -``` |
108 | | - |
109 | | -## How Transform Tests Work |
110 | | - |
111 | | -### Test Categories |
112 | | - |
113 | | -1. **Instantiation Tests** |
114 | | - - Verify preset functions can be called without errors |
115 | | - - Ensure transforms are returned as a list |
116 | | - - Quick smoke tests |
117 | | - |
118 | | -2. **Application Tests** |
119 | | - - Apply transforms to sample data |
120 | | - - Verify output DataFrame has correct shape |
121 | | - - Check that required output columns are created |
122 | | - |
123 | | -3. **Validation Tests** |
124 | | - - Verify transform output meets expected format |
125 | | - - Check that data from source columns is properly embedded |
126 | | - - Validate format-specific requirements (e.g., code delimiters, multiple choice format) |
127 | | - |
128 | | -### Example Test Pattern |
129 | | - |
130 | | -```python |
131 | | -def test_preset_name_transforms_apply(self, sample_data): |
132 | | - """Test that transforms apply without errors.""" |
133 | | - # 1. Get the preset |
134 | | - transforms = DatasetClass.PRESETS.preset_name() |
135 | | - |
136 | | - # 2. Apply to sample data |
137 | | - result = apply_transforms(sample_data, transforms) |
138 | | - |
139 | | - # 3. Verify output |
140 | | - assert result is not None |
141 | | - assert len(result) == len(sample_data) |
142 | | - assert "prompt" in result.columns # or other expected column |
143 | | -``` |
144 | | - |
145 | | -## Extending the Tests |
146 | | - |
147 | | -### Add a new dataset preset test: |
148 | | - |
149 | | -1. **Create the test class** in `test_dataset_presets.py`: |
150 | | -```python |
151 | | -class TestNewDatasetPresets: |
152 | | - """Test NewDataset presets.""" |
153 | 38 |
|
| 39 | +class TestMyDatasetPresets: |
154 | 40 | @pytest.fixture |
155 | 41 | def sample_data(self): |
156 | | - """Create sample data matching schema.""" |
| 42 | + """Minimal sample data matching dataset schema.""" |
157 | 43 | return pd.DataFrame({ |
158 | | - "column1": [...], |
159 | | - "column2": [...], |
| 44 | + "input_col1": ["value1"], |
| 45 | + "input_col2": ["value2"], |
160 | 46 | }) |
161 | 47 |
|
162 | | - def test_preset_name_instantiation(self): |
163 | | - """Test preset can be instantiated.""" |
164 | | - transforms = NewDataset.PRESETS.preset_name() |
| 48 | + def test_my_preset_instantiation(self): |
| 49 | + """Verify preset can be created.""" |
| 50 | + transforms = MyDataset.PRESETS.my_preset() |
165 | 51 | assert transforms is not None |
| 52 | + assert len(transforms) > 0 |
166 | 53 |
|
167 | | - def test_preset_name_transforms_apply(self, sample_data): |
168 | | - """Test transforms apply without errors.""" |
169 | | - transforms = NewDataset.PRESETS.preset_name() |
| 54 | + def test_my_preset_transforms_apply(self, sample_data): |
| 55 | + """Verify transforms apply without errors.""" |
| 56 | + transforms = MyDataset.PRESETS.my_preset() |
170 | 57 | result = apply_transforms(sample_data, transforms) |
171 | | - assert "prompt" in result.columns |
172 | | -``` |
173 | 58 |
|
174 | | -2. **Import the dataset class** at the top: |
175 | | -```python |
176 | | -from inference_endpoint.dataset_manager.predefined.new_dataset import NewDataset |
177 | | -``` |
178 | | - |
179 | | -### Test when transforms change: |
180 | | - |
181 | | -Since tests apply actual transforms to sample data, any change to a preset's transforms will automatically be caught: |
182 | | - |
183 | | -```bash |
184 | | -# Run tests before making changes to preset |
185 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py -v |
| 59 | + assert result is not None |
| 60 | + assert len(result) == len(sample_data) |
| 61 | + assert "prompt" in result.columns # Expected output column |
186 | 62 |
|
187 | | -# Modify src/inference_endpoint/dataset_manager/predefined/cnndailymail/presets.py |
188 | | -# Tests will catch any breaking changes: |
189 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py::TestCNNDailyMailPresets -v |
190 | | -``` |
191 | | - |
192 | | -## What These Tests Don't Cover |
193 | | - |
194 | | -These are **unit tests** for transforms, not end-to-end benchmark tests: |
195 | | - |
196 | | -- ❌ Network latency or throughput metrics |
197 | | -- ❌ Model inference accuracy |
198 | | -- ❌ Full dataset loading (only sample rows) |
199 | | -- ❌ API endpoint responses |
200 | | -- ❌ External service dependencies |
201 | | - |
202 | | -These require separate integration tests or actual benchmark runs. |
203 | | - |
204 | | -## Integration with CI/CD |
205 | | - |
206 | | -Add to your CI pipeline: |
| 63 | + def test_my_preset_output_format(self, sample_data): |
| 64 | + """Verify output has expected format.""" |
| 65 | + transforms = MyDataset.PRESETS.my_preset() |
| 66 | + result = apply_transforms(sample_data, transforms) |
207 | 67 |
|
208 | | -```yaml |
209 | | -# Example GitHub Actions or similar |
210 | | -- name: Test Dataset Presets |
211 | | - run: | |
212 | | - pytest tests/unit/dataset_manager/test_dataset_presets.py \ |
213 | | - -v \ |
214 | | - --cov=src/inference_endpoint/dataset_manager \ |
215 | | - --cov-report=json |
| 68 | + # Validate format-specific expectations |
| 69 | + assert len(result["prompt"][0]) > 0 |
216 | 70 | ``` |
217 | 71 |
|
218 | | -## Key Benefits |
219 | | -
|
220 | | -✅ **Fast** - Tests run in <5 seconds with no external dependencies |
221 | | -✅ **Reliable** - No flakiness from network calls or dataset availability |
222 | | -✅ **Maintainable** - Clear test structure, easy to extend |
223 | | -✅ **Coverage** - Catches transform regressions automatically |
224 | | -✅ **No resources** - Works with no GPU/compute, only CPU |
225 | | -✅ **Development friendly** - Run locally before committing |
| 72 | +If the preset uses `Harmonize` transform (requires `transformers` library), mark slow tests: |
226 | 73 |
|
227 | | -## Example Usage Scenarios |
228 | | -
|
229 | | -### Scenario 1: Verify transform changes don't break presets |
230 | | -```bash |
231 | | -# After modifying transforms.py: |
232 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py -v |
233 | | -``` |
234 | | - |
235 | | -### Scenario 2: Test new preset implementation |
236 | 74 | ```python |
237 | | -# In your preset function: |
238 | | -def new_preset() -> list[Transform]: |
239 | | - return [Transform1(), Transform2()] |
240 | | - |
241 | | -# Add unit test: |
242 | | -def test_new_preset_transforms_apply(self, sample_data): |
243 | | - transforms = DatasetClass.PRESETS.new_preset() |
244 | | - result = apply_transforms(sample_data, transforms) |
245 | | - assert "expected_column" in result.columns |
246 | | -``` |
247 | | - |
248 | | -### Scenario 3: Validate dataset before full benchmark run |
249 | | -```bash |
250 | | -# Quick validation using pytest |
251 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py -v |
252 | | -``` |
253 | | - |
254 | | -## Troubleshooting |
255 | | - |
256 | | -### Test import errors |
257 | | -```bash |
258 | | -# Ensure src directory is in PYTHONPATH (from repo root) |
259 | | -export PYTHONPATH=./src:$PYTHONPATH |
260 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py |
261 | | -``` |
262 | | - |
263 | | -### Missing dataset dependencies |
264 | | -Some presets may require optional tokenizers (e.g., Harmonize transform requires transformers). |
265 | | -Run with: |
266 | | -```bash |
267 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v |
268 | | -``` |
269 | | - |
270 | | -### Debugging a specific test |
271 | | -```bash |
272 | | -pytest tests/unit/dataset_manager/test_dataset_presets.py::TestClass::test_method -vvs |
| 75 | +@pytest.mark.slow |
| 76 | +def test_my_preset_transforms_apply(self, sample_data): |
| 77 | + # Test that requires transformers library |
| 78 | + pass |
273 | 79 | ``` |
274 | 80 |
|
275 | | -## Next Steps |
| 81 | +## Test Scope |
276 | 82 |
|
277 | | -1. **Run the tests** to verify your current setup: |
278 | | - ```bash |
279 | | - pytest tests/unit/dataset_manager/test_dataset_presets.py -v |
280 | | - ``` |
| 83 | +✅ **Tests verify:** |
| 84 | +- Preset instantiation |
| 85 | +- Transform application without errors |
| 86 | +- Required output columns exist |
| 87 | +- Data is properly transformed |
281 | 88 |
|
282 | | -2. **Add to pre-commit** to catch regressions automatically: |
283 | | - ```bash |
284 | | - pre-commit run pytest |
285 | | - ``` |
| 89 | +❌ **Tests do NOT verify:** |
| 90 | +- Model inference accuracy |
| 91 | +- API endpoint compatibility |
| 92 | +- Throughput/latency metrics |
| 93 | +- Full benchmark runs |
286 | 94 |
|
287 | | -3. **Extend tests** when adding new dataset presets or transforms |
| 95 | +See `src/inference_endpoint/dataset_manager/README.md` for dataset schema and preset creation details. |
0 commit comments