Skip to content

Commit 3d6a02d

Browse files
committed
docs: Expand steps on adding new benchmarks (docs and tests).
1 parent f7d7b98 commit 3d6a02d

File tree

1 file changed

+38
-7
lines changed

1 file changed

+38
-7
lines changed

docs/add_new_benchmark_guide.md

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ This guide provides comprehensive instructions for adding new benchmarks to the
55
## Overview
66

77
The eval-framework supports two response types:
8+
89
1. **Completion Tasks** - Generate text completions (e.g., math problems, code generation)
910
2. **Loglikelihood Tasks** - Multiple choice questions where the model ranks answer options
1011

1112
For detailed information about implementing each task type, please refer to:
13+
1214
- [Completion Task Guide](completion_task_guide.md) - Comprehensive guide for text generation tasks
1315
- [Loglikelihood Task Guide](loglikelihood_task_guide.md) - Detailed guide for multiple choice tasks
1416

@@ -95,7 +97,6 @@ def post_process_generated_completion(self, completion_text: str, sample: Sample
9597

9698
This section provides a complete reference for all configurations available when creating benchmarks.
9799

98-
99100
### Response Types
100101

101102
The response type determines how your model interacts with the task and what type of output is expected.
@@ -114,7 +115,6 @@ RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS
114115

115116
Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.
116117

117-
118118
#### Completion Metrics
119119

120120
These metrics work with generated text outputs from COMPLETION tasks:
@@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas
175175

176176
These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:
177177

178-
179178
```python
180179
from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
181180
# Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
@@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn
221220

222221
```
223222

224-
225223
## Implementation Examples and Patterns
226224

227225
### Practical Example: GeographyQATask
@@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]):
267265
return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot)
268266
```
269267

270-
271268
### Add to Task Registry
272269

273270
Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`:
@@ -280,32 +277,66 @@ The task will now be available through `get_task("GeographyQA")`.
280277

281278
### Testing your benchmark
282279

283-
All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
280+
All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.
284281

285282
#### Automatic Formatting Tests
286-
All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.
283+
284+
All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.
285+
286+
The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`.
287+
288+
When you add a new task:
289+
290+
1. Run the formatter hash test once for your task to generate/check hashes.
291+
2. If your task hash is new, it will be added to `task-prompts-hashes.json`.
292+
3. Commit the updated JSON file together with your task changes.
293+
294+
Run the formatter hash test only for your newly created task (replace `YourTaskName`):
295+
296+
```bash
297+
uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName"
298+
```
287299

288300
#### Custom Task Tests (Optional)
301+
289302
If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it.
290303

304+
### Update benchmark documentation
305+
306+
After adding a benchmark, you also need to update task documentation:
307+
308+
1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them).
309+
2. Regenerate the task docs:
310+
311+
```bash
312+
uv run -m eval_framework.utils.generate_task_docs
313+
```
314+
315+
This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`.
316+
291317
## Benchmark Examples by Task Type
292318

293319
Study these existing benchmarks in the codebase for more complex patterns:
294320

295321
#### Simple Classification Tasks
322+
296323
- **ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods
297324
- **MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects
298325

299326
#### Reasoning Tasks
327+
300328
- **GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns
301329

302330
#### Code Generation
331+
303332
- **HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation
304333
- **MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation
305334

306335
#### Long Context Tasks
336+
307337
- **InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks
308338

309339
#### Custom Format Tasks
340+
310341
- **IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation
311342
- **JSON/CSV Tasks:** Custom format validation examples

0 commit comments

Comments
 (0)