You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Metrics define how your task's outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.
116
117
117
-
118
118
#### Completion Metrics
119
119
120
120
These metrics work with generated text outputs from COMPLETION tasks:
@@ -175,7 +175,6 @@ from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMas
175
175
176
176
These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:
177
177
178
-
179
178
```python
180
179
from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
181
180
# Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
@@ -221,7 +220,6 @@ from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKn
221
220
222
221
```
223
222
224
-
225
223
## Implementation Examples and Patterns
226
224
227
225
### Practical Example: GeographyQATask
@@ -267,7 +265,6 @@ class GeographyQATask(BaseTask[str]):
Add a registration call for your new benchmark to `register_all_tasks` in `src/eval_framework/tasks/task_names.py`:
@@ -280,32 +277,66 @@ The task will now be available through `get_task("GeographyQA")`.
280
277
281
278
### Testing your benchmark
282
279
283
-
All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
280
+
All tasks automatically go through formatting tests to ensure proper prompt generation. The formatting test lives in `tests/tests_eval_framework/tasks/test_all_formatters.py` and runs all registered tasks automatically.
284
281
285
282
#### Automatic Formatting Tests
286
-
All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.
283
+
284
+
All benchmarks are automatically tested for proper prompt formatting across different chat templates. If your new task needs non-default initialization arguments (for example, a specific `num_fewshot`), add an entry for your task to `SPECIAL_ARGS` in `tests/tests_eval_framework/tasks/test_all_formatters.py`.
285
+
286
+
The expected formatter outputs are tracked as hashes in `tests/tests_eval_framework/tasks/task-prompts-hashes.json`.
287
+
288
+
When you add a new task:
289
+
290
+
1. Run the formatter hash test once for your task to generate/check hashes.
291
+
2. If your task hash is new, it will be added to `task-prompts-hashes.json`.
292
+
3. Commit the updated JSON file together with your task changes.
293
+
294
+
Run the formatter hash test only for your newly created task (replace `YourTaskName`):
295
+
296
+
```bash
297
+
uv run pytest tests/tests_eval_framework/tasks/test_all_formatters.py -m formatter_hash -k "YourTaskName"
298
+
```
287
299
288
300
#### Custom Task Tests (Optional)
301
+
289
302
If your benchmark has specific logic that needs testing, create a test file in `tests/tasks/` to test it.
290
303
304
+
### Update benchmark documentation
305
+
306
+
After adding a benchmark, you also need to update task documentation:
307
+
308
+
1. Manually add the new benchmark name(s) to `docs/benchmarks_and_metrics.md` (including `*_IDK` variants if your benchmark has them).
309
+
2. Regenerate the task docs:
310
+
311
+
```bash
312
+
uv run -m eval_framework.utils.generate_task_docs
313
+
```
314
+
315
+
This updates `docs/tasks/README.md` and creates per-task documentation files for new tasks in `docs/tasks/`.
316
+
291
317
## Benchmark Examples by Task Type
292
318
293
319
Study these existing benchmarks in the codebase for more complex patterns:
294
320
295
321
#### Simple Classification Tasks
322
+
296
323
-**ARC** (`src/eval_framework/tasks/arc.py`): Multiple choice with loglikelihoods
297
324
-**MMLU** (`src/eval_framework/tasks/mmlu.py`): Multi-subject classification with enum subjects
298
325
299
326
#### Reasoning Tasks
327
+
300
328
-**GSM8K** (`src/eval_framework/tasks/gsm8k.py`): Math reasoning with answer extraction patterns
301
329
302
330
#### Code Generation
331
+
303
332
-**HumanEval** (`src/eval_framework/tasks/human_eval.py`): Code completion with execution validation
304
333
-**MBPP** (`src/eval_framework/tasks/mbpp.py`): Code generation with comprehensive test validation
305
334
306
335
#### Long Context Tasks
336
+
307
337
-**InfiniteBench** (`src/eval_framework/tasks/infinite_bench_tasks.py`): Long context reasoning tasks
308
338
309
339
#### Custom Format Tasks
340
+
310
341
-**IFEval** (`src/eval_framework/tasks/ifeval.py`): Instruction following with format validation
311
342
-**JSON/CSV Tasks:** Custom format validation examples
0 commit comments