Skip to content

Commit b961fe0

Browse files
committed
docs(zero-shot): Streamline evaluation documentation structure
- Consolidate 'When to Use' and 'How It Works' into concise 'Overview' - Reorganize advanced topics into tabbed sections for better readability - Simplify configuration examples and reduce redundant explanations - Replace verbose 'Next Steps' with compact 'Related Topics' links
1 parent 0faaff8 commit b961fe0

File tree

1 file changed

+85
-131
lines changed

1 file changed

+85
-131
lines changed

docs/applications/zero_shot_evaluation.md

Lines changed: 85 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -3,31 +3,14 @@
33
Automatically evaluate and compare multiple models or AI agents without pre-existing test data. This end-to-end pipeline generates test queries, collects responses, and ranks models through pairwise comparison.
44

55

6-
## When to Use
6+
## Overview
77

8-
Use zero-shot evaluation for:
9-
10-
- **Model Comparison** — Compare different models on a specific task without preparing test data
11-
- **Agent Pipeline Testing** — Evaluate different agent configurations or workflows
12-
- **New Domain Evaluation** — Quickly assess model performance in new domains
13-
- **Rapid Prototyping** — Get quick feedback on model quality during development
14-
15-
16-
## How It Works
17-
18-
Zero-shot evaluation automates the entire evaluation pipeline:
19-
20-
1. **Generate Test Queries** — Create diverse, representative queries based on task description
21-
2. **Collect Responses** — Query all target models/agents to collect responses
22-
3. **Generate Rubrics** — Create evaluation criteria tailored to the task
23-
4. **Pairwise Comparison** — Compare all response pairs using a judge model
24-
5. **Rank Models** — Calculate win rates and produce final rankings
8+
Zero-shot evaluation is ideal for **model comparison**, **agent pipeline testing**, **new domain evaluation**, and **rapid prototyping**—all without preparing test data upfront.
259

2610
!!! tip "No Test Data Required"
2711
Unlike traditional evaluation, zero-shot evaluation generates its own test queries from the task description, eliminating the need for pre-existing test datasets.
2812

29-
30-
## Five-Step Pipeline
13+
The pipeline automates five steps: generate test queries → collect responses → create evaluation rubrics → run pairwise comparisons → produce rankings.
3114

3215
| Step | Component | Description |
3316
|------|-----------|-------------|
@@ -102,10 +85,7 @@ Get started with Zero-Shot Evaluation in just a few lines of code. Choose the ap
10285
python -m cookbooks.zero_shot_evaluation --config config.yaml --queries_file queries.json --save
10386
```
10487

105-
106-
## Configuration
107-
108-
Create a YAML configuration file to define your evaluation:
88+
All methods require a YAML configuration file. Here's a complete example:
10989

11090
```yaml
11191
# Task description
@@ -159,9 +139,9 @@ output:
159139
Use `${ENV_VAR}` syntax to reference environment variables for sensitive data like API keys.
160140

161141

162-
## Step-by-Step Guide
142+
## Component Guide
163143

164-
For fine-grained control over the evaluation process, you can use individual pipeline components directly. The workflow below shows how each component connects:
144+
For fine-grained control, use individual pipeline components directly. The workflow below shows how each component connects:
165145

166146
<div class="workflow-single">
167147
<div class="workflow-header">Pipeline Components</div>
@@ -256,128 +236,108 @@ Use `ZeroShotPipeline` to orchestrate the full evaluation, comparing all respons
256236
```
257237

258238

259-
## Understanding Results
239+
## Advanced Topics
260240

261-
The `EvaluationResult` provides comprehensive ranking statistics:
241+
=== "Understanding Results"
262242

263-
| Field | Type | Description |
264-
|-------|------|-------------|
265-
| `rankings` | `List[Tuple[str, float]]` | Models sorted by win rate (best first) |
266-
| `win_rates` | `Dict[str, float]` | Win rate for each model (0.0-1.0) |
267-
| `win_matrix` | `Dict[str, Dict[str, float]]` | Head-to-head win rates between models |
268-
| `best_pipeline` | `str` | Model with highest win rate |
269-
| `total_queries` | `int` | Total number of test queries |
270-
| `total_comparisons` | `int` | Total number of pairwise comparisons |
243+
The `EvaluationResult` provides comprehensive ranking statistics:
271244

272-
!!! example "Sample Output"
273-
```
274-
============================================================
275-
ZERO-SHOT EVALUATION RESULTS
276-
============================================================
277-
Task: English to Chinese translation assistant...
278-
Queries: 20
279-
Comparisons: 80
280-
281-
Rankings:
282-
1. qwen_candidate [################----] 80.0%
283-
2. gpt4_baseline [########------------] 40.0%
284-
285-
Win Matrix (row vs column):
286-
qwen_cand gpt4_base
287-
qwen_candidate | -- 80.0%
288-
gpt4_baseline | 20.0% --
289-
290-
Best Pipeline: qwen_candidate
291-
============================================================
292-
```
245+
| Field | Type | Description |
246+
|-------|------|-------------|
247+
| `rankings` | `List[Tuple[str, float]]` | Models sorted by win rate (best first) |
248+
| `win_rates` | `Dict[str, float]` | Win rate for each model (0.0-1.0) |
249+
| `win_matrix` | `Dict[str, Dict[str, float]]` | Head-to-head win rates between models |
250+
| `best_pipeline` | `str` | Model with highest win rate |
251+
| `total_queries` | `int` | Total number of test queries |
252+
| `total_comparisons` | `int` | Total number of pairwise comparisons |
293253

254+
!!! example "Sample Output"
255+
```
256+
============================================================
257+
ZERO-SHOT EVALUATION RESULTS
258+
============================================================
259+
Task: English to Chinese translation assistant...
260+
Queries: 20 | Comparisons: 80
294261

295-
## Advanced Configuration
262+
Rankings:
263+
1. qwen_candidate [################----] 80.0%
264+
2. gpt4_baseline [########------------] 40.0%
296265

297-
Fine-tune query generation behavior through these configuration options:
266+
Best Pipeline: qwen_candidate
267+
============================================================
268+
```
298269

299-
| Option | Default | Description |
300-
|--------|---------|-------------|
301-
| `num_queries` | 20 | Total number of queries to generate |
302-
| `queries_per_call` | 10 | Queries per API call (1-50) |
303-
| `num_parallel_batches` | 3 | Number of parallel generation batches |
304-
| `temperature` | 0.9 | Sampling temperature for diversity |
305-
| `max_similarity` | 0.85 | Deduplication similarity threshold |
306-
| `enable_evolution` | false | Enable Evol-Instruct complexity evolution |
307-
| `evolution_rounds` | 1 | Number of evolution rounds (0-3) |
270+
=== "Query Generation Options"
308271

309-
??? tip "Enable Evol-Instruct for Harder Queries"
310-
Evol-Instruct progressively increases query complexity to stress-test your models. Enable it to generate more challenging test cases:
272+
Fine-tune query generation behavior:
311273

312-
```yaml
313-
query_generation:
314-
enable_evolution: true
315-
evolution_rounds: 2
316-
complexity_levels:
317-
- "constraints" # Add time, scope, or condition constraints
318-
- "reasoning" # Require multi-step reasoning or comparison
319-
- "edge_cases" # Include edge cases and unusual conditions
320-
```
274+
| Option | Default | Description |
275+
|--------|---------|-------------|
276+
| `num_queries` | 20 | Total number of queries to generate |
277+
| `queries_per_call` | 10 | Queries per API call (1-50) |
278+
| `num_parallel_batches` | 3 | Number of parallel generation batches |
279+
| `temperature` | 0.9 | Sampling temperature for diversity |
280+
| `max_similarity` | 0.85 | Deduplication similarity threshold |
281+
| `enable_evolution` | false | Enable Evol-Instruct complexity evolution |
282+
| `evolution_rounds` | 1 | Number of evolution rounds (0-3) |
321283

284+
??? tip "Enable Evol-Instruct for Harder Queries"
285+
Evol-Instruct progressively increases query complexity:
322286

323-
## Evaluation Report
287+
```yaml
288+
query_generation:
289+
enable_evolution: true
290+
evolution_rounds: 2
291+
complexity_levels:
292+
- "constraints" # Add time, scope, or condition constraints
293+
- "reasoning" # Require multi-step reasoning
294+
- "edge_cases" # Include edge cases
295+
```
324296

325-
The pipeline can generate a comprehensive Markdown report explaining the evaluation results with concrete examples. Enable it by adding a `report` section to your configuration:
326-
327-
```yaml
328-
report:
329-
enabled: true # Enable report generation
330-
language: "zh" # Report language: "zh" (Chinese) or "en" (English)
331-
include_examples: 3 # Number of examples per section (1-10)
332-
```
297+
=== "Evaluation Report"
333298

334-
The generated report includes four sections—**Executive Summary**, **Ranking Explanation**, **Model Analysis**, and **Representative Cases**—each produced in parallel for efficiency.
299+
Generate a comprehensive Markdown report with concrete examples:
335300

336-
| Section | Description |
337-
|---------|-------------|
338-
| **Executive Summary** | Overview of evaluation purpose, methodology, and key findings |
339-
| **Ranking Explanation** | Detailed analysis of why models are ranked in this order |
340-
| **Model Analysis** | Per-model strengths, weaknesses, and improvement suggestions |
341-
| **Representative Cases** | Concrete comparison examples with evaluation reasons |
301+
```yaml
302+
report:
303+
enabled: true # Enable report generation
304+
language: "zh" # "zh" (Chinese) or "en" (English)
305+
include_examples: 3 # Examples per section (1-10)
306+
```
342307

343-
When report generation is enabled, all results are saved to the output directory:
308+
The report includes **Executive Summary**, **Ranking Explanation**, **Model Analysis**, and **Representative Cases**.
344309

345-
```
346-
evaluation_results/
347-
├── evaluation_report.md # Generated Markdown report
348-
├── comparison_details.json # All pairwise comparison details
349-
├── evaluation_results.json # Final rankings and statistics
350-
├── queries.json # Generated test queries
351-
├── responses.json # Model responses
352-
└── rubrics.json # Evaluation criteria
353-
```
310+
All results are saved to the output directory:
354311

355-
!!! tip "Complete Example Report"
356-
View a real evaluation report: [Oncology Medical Translation Evaluation Report](sample_reports/oncology_translation_report.md)—comparing three models (qwen-plus, qwen3-32b, qwen-turbo) on Chinese-to-English translation in the medical oncology domain.
312+
```
313+
evaluation_results/
314+
├── evaluation_report.md # Generated Markdown report
315+
├── comparison_details.json # All pairwise comparison details
316+
├── evaluation_results.json # Final rankings and statistics
317+
├── queries.json # Generated test queries
318+
├── responses.json # Model responses
319+
└── rubrics.json # Evaluation criteria
320+
```
357321

322+
!!! tip "Example Report"
323+
View a real report: [Oncology Medical Translation Evaluation](sample_reports/oncology_translation_report.md)
358324

359-
## Checkpoint & Resume
325+
=== "Checkpoint & Resume"
360326

361-
Evaluations automatically save checkpoints, allowing resumption after interruptions:
327+
Evaluations automatically save checkpoints for resumption after interruptions:
362328

363-
```bash
364-
# First run (interrupted)
365-
python -m cookbooks.zero_shot_evaluation --config config.yaml --save
366-
# Progress saved at: ./evaluation_results/checkpoint.json
329+
```bash
330+
# First run (interrupted)
331+
python -m cookbooks.zero_shot_evaluation --config config.yaml --save
367332
368-
# Resume from checkpoint (automatic)
369-
python -m cookbooks.zero_shot_evaluation --config config.yaml --save
370-
# Resumes from last completed step
333+
# Resume from checkpoint (automatic)
334+
python -m cookbooks.zero_shot_evaluation --config config.yaml --save
371335
372-
# Start fresh (ignore checkpoint)
373-
python -m cookbooks.zero_shot_evaluation --config config.yaml --fresh --save
374-
```
336+
# Start fresh (ignore checkpoint)
337+
python -m cookbooks.zero_shot_evaluation --config config.yaml --fresh --save
338+
```
375339

376-
!!! info "Checkpoint Stages"
377-
1. `QUERIES_GENERATED` — Test queries saved
378-
2. `RESPONSES_COLLECTED` — All responses saved
379-
3. `RUBRICS_GENERATED` — Evaluation rubrics saved
380-
4. `EVALUATION_COMPLETE` — Final results saved
340+
Checkpoint stages: `QUERIES_GENERATED` → `RESPONSES_COLLECTED` → `RUBRICS_GENERATED` → `EVALUATION_COMPLETE`
381341

382342

383343
## Best Practices
@@ -395,12 +355,6 @@ python -m cookbooks.zero_shot_evaluation --config config.yaml --fresh --save
395355
- Skip checkpoint resumption for long-running evaluations
396356
- Compare models with fundamentally different capabilities (e.g., text vs vision)
397357

398-
399-
## Next Steps
400-
401-
- [Pairwise Evaluation](select_rank.md) — Compare models with pre-existing test data
402-
- [Refine Data Quality](data_refinement.md) — Use grader feedback to improve outputs
403-
- [Create Custom Graders](../building_graders/create_custom_graders.md) — Build specialized evaluation criteria
404-
- [Run Grading Tasks](../running_graders/run_tasks.md) — Scale evaluations with GradingRunner
358+
**Related Topics:** [Pairwise Evaluation](select_rank.md) · [Refine Data Quality](data_refinement.md) · [Create Custom Graders](../building_graders/create_custom_graders.md) · [Run Grading Tasks](../running_graders/run_tasks.md)
405359

406360

0 commit comments

Comments
 (0)