Skip to content

Commit bd0eed4

Browse files
committed
docs: rename examples folder to sample_reports
1 parent 94e48ce commit bd0eed4

File tree

4 files changed

+277
-1
lines changed

4 files changed

+277
-1
lines changed
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# =============================================================================
2+
# Zero-Shot Evaluation Configuration
3+
# =============================================================================
4+
# This configuration file defines all settings for the zero-shot evaluation
5+
# pipeline, including task definition, query generation, target endpoints,
6+
# judge endpoint, evaluation parameters, and output settings.
7+
#
8+
# Environment variables can be referenced using ${VAR_NAME} syntax.
9+
# =============================================================================
10+
11+
# =============================================================================
12+
# Task Configuration
13+
# =============================================================================
14+
# Defines the task that the target models/agents will be evaluated on.
15+
16+
task:
17+
# [Required] A clear description of what the task is about.
18+
# This helps the query generator create relevant test queries.
19+
description: "English to Chinese translation assistant, helping users translate various types of English content into fluent and accurate Chinese"
20+
21+
# [Optional] The usage scenario or context for this task.
22+
# Provides additional context for query generation.
23+
scenario: "Users need to translate English articles, documents, or text into Chinese"
24+
25+
# =============================================================================
26+
# Query Generation Configuration
27+
# =============================================================================
28+
# Settings for automatic test query generation.
29+
30+
query_generation:
31+
# ---------------------------------------------------------------------------
32+
# Basic Settings
33+
# ---------------------------------------------------------------------------
34+
35+
# [Optional, default=20] Total number of queries to generate.
36+
num_queries: 20
37+
38+
# [Optional] Seed queries to guide the generation style and format.
39+
# These examples help the generator understand what kind of queries to create.
40+
seed_queries:
41+
- "Please translate the following paragraph into Chinese: 'The rapid advancement of artificial intelligence has transformed numerous industries.'"
42+
- "Translate this sentence to Chinese: 'Climate change poses significant challenges to global food security.'"
43+
44+
# [Optional] Query categories with weights for stratified generation.
45+
# Each category can have a name, description, and weight.
46+
# If not specified, queries are generated without category constraints.
47+
# categories:
48+
# - name: "technical"
49+
# description: "Technical documents and papers"
50+
# weight: 0.3
51+
# - name: "literary"
52+
# description: "Literary and creative content"
53+
# weight: 0.3
54+
# - name: "business"
55+
# description: "Business and formal documents"
56+
# weight: 0.4
57+
58+
# ---------------------------------------------------------------------------
59+
# Custom Endpoint (Optional)
60+
# ---------------------------------------------------------------------------
61+
# If not specified, uses judge_endpoint for query generation.
62+
63+
# endpoint:
64+
# base_url: "https://api.openai.com/v1"
65+
# api_key: "${OPENAI_API_KEY}"
66+
# model: "gpt-4o"
67+
# system_prompt: null # Optional system prompt for query generation
68+
# extra_params: # Optional extra parameters
69+
# temperature: 0.9
70+
71+
# ---------------------------------------------------------------------------
72+
# Generation Control
73+
# ---------------------------------------------------------------------------
74+
75+
# [Optional, default=10, range=1-50] Number of queries generated per API call.
76+
# Higher values are more efficient but may reduce diversity.
77+
queries_per_call: 10
78+
79+
# [Optional, default=3, min=1] Number of parallel batches for generation.
80+
# Increases throughput but uses more API quota concurrently.
81+
num_parallel_batches: 3
82+
83+
# [Optional, default=0.9, range=0.0-2.0] Sampling temperature.
84+
# Higher values increase diversity but may reduce quality.
85+
temperature: 0.9
86+
87+
# [Optional, default=0.95, range=0.0-1.0] Top-p (nucleus) sampling.
88+
# Controls the cumulative probability threshold for token selection.
89+
top_p: 0.95
90+
91+
# ---------------------------------------------------------------------------
92+
# Deduplication
93+
# ---------------------------------------------------------------------------
94+
95+
# [Optional, default=0.85, range=0.0-1.0] Maximum similarity threshold.
96+
# Queries with similarity above this threshold are considered duplicates.
97+
# Lower values enforce stricter deduplication.
98+
max_similarity: 0.85
99+
100+
# ---------------------------------------------------------------------------
101+
# Evol-Instruct Complexity Evolution
102+
# ---------------------------------------------------------------------------
103+
# Evol-Instruct progressively increases query complexity through
104+
# multiple evolution rounds.
105+
106+
# [Optional, default=false] Enable complexity evolution.
107+
enable_evolution: false
108+
109+
# [Optional, default=1, range=0-3] Number of evolution rounds.
110+
# Each round increases the complexity of queries.
111+
evolution_rounds: 1
112+
113+
# [Optional] Complexity evolution strategies to apply.
114+
# Available strategies:
115+
# - "constraints": Add constraints and requirements
116+
# - "reasoning": Require multi-step reasoning
117+
# - "edge_cases": Include edge cases and corner scenarios
118+
# - "specificity": Make queries more specific and detailed
119+
# - "multi_step": Require multiple steps to complete
120+
complexity_levels:
121+
- "constraints"
122+
- "reasoning"
123+
- "edge_cases"
124+
125+
# =============================================================================
126+
# Target Endpoints
127+
# =============================================================================
128+
# Define the models or agents to be evaluated. Each endpoint is identified
129+
# by a unique name and configured with connection details.
130+
131+
target_endpoints:
132+
# Example: GPT-4 as baseline
133+
gpt4_baseline:
134+
# [Required] API base URL (OpenAI-compatible format)
135+
base_url: "https://api.openai.com/v1"
136+
137+
# [Required] API key (supports ${ENV_VAR} format for security)
138+
api_key: "${OPENAI_API_KEY}"
139+
140+
# [Required] Model name/identifier
141+
model: "gpt-4"
142+
143+
# [Optional] System prompt to set the model's behavior
144+
system_prompt: "You are a professional English-Chinese translator. Provide accurate and fluent translations."
145+
146+
# [Optional] Extra parameters passed to the API request
147+
extra_params:
148+
temperature: 0.7
149+
max_tokens: 2048
150+
151+
# Example: Qwen model as candidate
152+
qwen_candidate:
153+
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
154+
api_key: "${DASHSCOPE_API_KEY}"
155+
model: "qwen-max"
156+
system_prompt: "You are a professional English-Chinese translator. Provide accurate and fluent translations."
157+
extra_params:
158+
temperature: 0.7
159+
max_tokens: 2048
160+
161+
# =============================================================================
162+
# Judge Endpoint
163+
# =============================================================================
164+
# The judge model evaluates and compares responses from target endpoints.
165+
# It should be a capable model that can assess quality objectively.
166+
167+
judge_endpoint:
168+
# [Required] API base URL
169+
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
170+
171+
# [Required] API key
172+
api_key: "${DASHSCOPE_API_KEY}"
173+
174+
# [Required] Model name (recommend using a strong model for judging)
175+
model: "qwen-max"
176+
177+
# [Optional] System prompt for the judge
178+
# If not specified, a default judging prompt will be used.
179+
system_prompt: null
180+
181+
# [Optional] Extra parameters for the judge model
182+
# Lower temperature is recommended for more consistent judgments.
183+
extra_params:
184+
temperature: 0.1
185+
max_tokens: 4096
186+
187+
# =============================================================================
188+
# Evaluation Configuration
189+
# =============================================================================
190+
# Settings that control the evaluation process.
191+
192+
evaluation:
193+
# [Optional, default=10] Maximum number of concurrent API requests.
194+
# Higher values increase throughput but may hit rate limits.
195+
max_concurrency: 10
196+
197+
# [Optional, default=60] Request timeout in seconds.
198+
# Increase for complex tasks or slow endpoints.
199+
timeout: 60
200+
201+
# [Optional, default=3] Number of retry attempts for failed requests.
202+
retry_times: 3
203+
204+
# =============================================================================
205+
# Output Configuration
206+
# =============================================================================
207+
# Settings for saving evaluation results and intermediate data.
208+
209+
output:
210+
# [Optional, default=true] Save generated queries to a JSON file.
211+
save_queries: true
212+
213+
# [Optional, default=true] Save all model responses to a JSON file.
214+
save_responses: true
215+
216+
# [Optional, default=true] Save detailed evaluation results including
217+
# individual judgments and scores.
218+
save_details: true
219+
220+
# [Optional, default="./evaluation_results"] Directory for output files.
221+
# Supports relative and absolute paths.
222+
output_dir: "./evaluation_results"
223+
224+
# =============================================================================
225+
# Report Configuration
226+
# =============================================================================
227+
# Settings for generating evaluation reports. When enabled, a comprehensive
228+
# Markdown report explaining the rankings with concrete examples is generated.
229+
230+
report:
231+
# [Optional, default=false] Enable report generation.
232+
# When true, generates a detailed Markdown report after evaluation.
233+
enabled: true
234+
235+
# [Optional, default="zh"] Report language.
236+
# Supported values: "zh" (Chinese), "en" (English)
237+
language: "zh"
238+
239+
# [Optional, default=3, range=1-10] Number of examples per section.
240+
# Controls how many concrete examples are included in the report.
241+
include_examples: 3
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# =============================================================================
2+
# Minimal Configuration Example
3+
# =============================================================================
4+
# This is the minimum required configuration for zero-shot evaluation.
5+
# Only required fields are specified; all other settings use defaults.
6+
# =============================================================================
7+
8+
# Task description (required)
9+
task:
10+
description: "Academic GPT assistant for research and writing tasks"
11+
12+
# Target endpoints to evaluate (required, at least one)
13+
target_endpoints:
14+
model_v1:
15+
base_url: "https://api.openai.com/v1"
16+
api_key: "${OPENAI_API_KEY}"
17+
model: "gpt-4"
18+
19+
model_v2:
20+
base_url: "https://api.openai.com/v1"
21+
api_key: "${OPENAI_API_KEY}"
22+
model: "gpt-3.5-turbo"
23+
24+
# Judge endpoint for evaluation (required)
25+
judge_endpoint:
26+
base_url: "https://api.openai.com/v1"
27+
api_key: "${OPENAI_API_KEY}"
28+
model: "gpt-4"
29+
30+
# All other settings use defaults:
31+
# - query_generation.num_queries: 20
32+
# - query_generation.temperature: 0.9
33+
# - evaluation.max_concurrency: 10
34+
# - evaluation.timeout: 60
35+
# - output.output_dir: "./evaluation_results"

docs/applications/examples/oncology_translation_report.md renamed to docs/applications/sample_reports/oncology_translation_report.md

File renamed without changes.

docs/applications/zero_shot_evaluation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ The generated report includes four sections, each generated in parallel:
386386
```
387387

388388
!!! tip "Complete Example Report"
389-
View a real evaluation report example: [Oncology Medical Translation Evaluation Report](examples/oncology_translation_report.md)
389+
View a real evaluation report example: [Oncology Medical Translation Evaluation Report](sample_reports/oncology_translation_report.md)
390390

391391
This example demonstrates a complete report generated by Zero-Shot Evaluation, comparing three models (qwen-plus, qwen3-32b, qwen-turbo) on Chinese-to-English translation in the medical oncology domain.
392392

0 commit comments

Comments
 (0)