You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Consolidate 'When to Use' and 'How It Works' into concise 'Overview'
- Reorganize advanced topics into tabbed sections for better readability
- Simplify configuration examples and reduce redundant explanations
- Replace verbose 'Next Steps' with compact 'Related Topics' links
Automatically evaluate and compare multiple models or AI agents without pre-existing test data. This end-to-end pipeline generates test queries, collects responses, and ranks models through pairwise comparison.
4
4
5
5
6
-
## When to Use
6
+
## Overview
7
7
8
-
Use zero-shot evaluation for:
9
-
10
-
-**Model Comparison** — Compare different models on a specific task without preparing test data
11
-
-**Agent Pipeline Testing** — Evaluate different agent configurations or workflows
12
-
-**New Domain Evaluation** — Quickly assess model performance in new domains
13
-
-**Rapid Prototyping** — Get quick feedback on model quality during development
14
-
15
-
16
-
## How It Works
17
-
18
-
Zero-shot evaluation automates the entire evaluation pipeline:
19
-
20
-
1.**Generate Test Queries** — Create diverse, representative queries based on task description
21
-
2.**Collect Responses** — Query all target models/agents to collect responses
22
-
3.**Generate Rubrics** — Create evaluation criteria tailored to the task
23
-
4.**Pairwise Comparison** — Compare all response pairs using a judge model
24
-
5.**Rank Models** — Calculate win rates and produce final rankings
8
+
Zero-shot evaluation is ideal for **model comparison**, **agent pipeline testing**, **new domain evaluation**, and **rapid prototyping**—all without preparing test data upfront.
25
9
26
10
!!! tip "No Test Data Required"
27
11
Unlike traditional evaluation, zero-shot evaluation generates its own test queries from the task description, eliminating the need for pre-existing test datasets.
28
12
29
-
30
-
## Five-Step Pipeline
13
+
The pipeline automates five steps: generate test queries → collect responses → create evaluation rubrics → run pairwise comparisons → produce rankings.
31
14
32
15
| Step | Component | Description |
33
16
|------|-----------|-------------|
@@ -102,10 +85,7 @@ Get started with Zero-Shot Evaluation in just a few lines of code. Choose the ap
Create a YAML configuration file to define your evaluation:
88
+
All methods require a YAML configuration file. Here's a complete example:
109
89
110
90
```yaml
111
91
# Task description
@@ -159,9 +139,9 @@ output:
159
139
Use `${ENV_VAR}` syntax to reference environment variables for sensitive data like API keys.
160
140
161
141
162
-
## Step-by-Step Guide
142
+
## Component Guide
163
143
164
-
For fine-grained control over the evaluation process, you can use individual pipeline components directly. The workflow below shows how each component connects:
144
+
For fine-grained control, use individual pipeline components directly. The workflow below shows how each component connects:
- "constraints" # Add time, scope, or condition constraints
293
+
- "reasoning" # Require multi-step reasoning
294
+
- "edge_cases" # Include edge cases
295
+
```
324
296
325
-
The pipeline can generate a comprehensive Markdown report explaining the evaluation results with concrete examples. Enable it by adding a `report` section to your configuration:
326
-
327
-
```yaml
328
-
report:
329
-
enabled: true # Enable report generation
330
-
language: "zh" # Report language: "zh" (Chinese) or "en" (English)
331
-
include_examples: 3 # Number of examples per section (1-10)
332
-
```
297
+
=== "Evaluation Report"
333
298
334
-
The generated report includes four sections—**Executive Summary**, **Ranking Explanation**, **Model Analysis**, and **Representative Cases**—each produced in parallel for efficiency.
299
+
Generate a comprehensive Markdown report with concrete examples:
335
300
336
-
| Section | Description |
337
-
|---------|-------------|
338
-
| **Executive Summary** | Overview of evaluation purpose, methodology, and key findings |
339
-
| **Ranking Explanation** | Detailed analysis of why models are ranked in this order |
├── comparison_details.json # All pairwise comparison details
349
-
├── evaluation_results.json # Final rankings and statistics
350
-
├── queries.json # Generated test queries
351
-
├── responses.json # Model responses
352
-
└── rubrics.json # Evaluation criteria
353
-
```
310
+
All results are saved to the output directory:
354
311
355
-
!!! tip "Complete Example Report"
356
-
View a real evaluation report: [Oncology Medical Translation Evaluation Report](sample_reports/oncology_translation_report.md)—comparing three models (qwen-plus, qwen3-32b, qwen-turbo) on Chinese-to-English translation in the medical oncology domain.
0 commit comments