Skip to content

Commit f90fbb6

Browse files
authored
docs: Model Performance Evaluation Guide (#136)
1 parent 581c401 commit f90fbb6

File tree

5 files changed

+328
-0
lines changed

5 files changed

+328
-0
lines changed
Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Model Performance Evaluation
2+
## Why evaluate?
3+
Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can:
4+
5+
- Select the right model for each category and rank them into categories.model_scores
6+
- Pick a sensible default_model based on overall performance
7+
- Decide when CoT prompting is worth the latency/cost tradeoff
8+
- Catch regressions when models, prompts, or parameters change
9+
- Keep changes reproducible and auditable for CI and releases
10+
11+
In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router.
12+
13+
---
14+
15+
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update `categories.model_scores` in config.
16+
17+
see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval)
18+
19+
### What you'll run end-to-end
20+
#### 1) Evaluate models
21+
22+
- per-category accuracies
23+
- ARC Challenge: overall accuracy
24+
25+
#### 2) Visualize results
26+
27+
- bar/heatmap plot of per-category accuracies
28+
29+
![Bar](/img/bar.png)
30+
![Heatmap](/img/heatmap.png)
31+
32+
#### 3) Generate an updated config.yaml
33+
34+
- Rank models per category into categories.model_scores
35+
- Set default_model to the best average performer
36+
- Keep or apply category-level reasioning settings
37+
38+
## 1.Prerequisites
39+
40+
- A running vLLM-compatible OpenAI endpoint serving your models
41+
- Endpoint URL like http://localhost:8000/v1
42+
- Optional API key if your endpoint requires one
43+
44+
```bash
45+
# Terminal 1
46+
vllm serve microsoft/phi-4 --port 11434 --served_model_name phi4
47+
48+
# Terminal 2
49+
vllm serve Qwen/Qwen3-0.6B --port 11435 --served_model_name qwen3-0.6B
50+
```
51+
52+
- Python packages for evaluation scripts:
53+
- From the repo root: matplotlib in [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/requirements.txt)
54+
- From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt)
55+
56+
```bash
57+
# We will work at this dir in this guide
58+
cd /src/training/model_eval
59+
pip install -r requirements.txt
60+
```
61+
62+
**Optional tip:**
63+
64+
- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly.
65+
66+
## 2.Evaluate on MMLU-Pro
67+
see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py)
68+
69+
### Example usage patterns
70+
71+
```bash
72+
# Evaluate a few models, few samples per category, direct prompting
73+
python mmlu_pro_vllm_eval.py \
74+
--endpoint http://localhost:11434/v1 \
75+
--models phi4 \
76+
--samples-per-category 10
77+
78+
python mmlu_pro_vllm_eval.py \
79+
--endpoint http://localhost:11435/v1 \
80+
--models qwen3-0.6B \
81+
--samples-per-category 10
82+
83+
# Evaluate with CoT (results saved under *_cot)
84+
python mmlu_pro_vllm_eval.py \
85+
--endpoint http://localhost:11435/v1 \
86+
--models qwen3-0.6B \
87+
--samples-per-category 10
88+
--use-cot
89+
90+
# If you have set up Semantic Router properly, you can run in one go
91+
python mmlu_pro_vllm_eval.py \
92+
--endpoint http://localhost:8801/v1 \
93+
--models qwen3-0.6B, phi4 \
94+
--samples-per-category
95+
# --use-cot # Uncomment this line if use CoT
96+
```
97+
98+
### Key flags
99+
100+
- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1)
101+
- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint
102+
- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset
103+
- **--samples-per-category**: limit questions per category (useful for quick runs)
104+
- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct)
105+
- **--concurrent-requests**: concurrency for throughput
106+
- **--output-dir**: where results are saved (default results)
107+
- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs
108+
109+
### What it outputs per model
110+
111+
- **results/Model_Name_(direct|cot)/**
112+
- **detailed_results.csv**: one row per question with is_correct and category
113+
- **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts
114+
- **summary.json**: condensed metrics
115+
- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection)
116+
117+
**Note**
118+
119+
- **Model naming**: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
120+
- Category accuracy is computed on successful queries only; failed requests are excluded.
121+
122+
## 3.Evaluate on ARC Challenge (optional, overall sanity check)
123+
see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py)
124+
125+
### Example usage patterns
126+
127+
``` bash
128+
python arc_challenge_vllm_eval.py \
129+
--endpoint http://localhost:8801/v1\
130+
--models qwen3-0.6B,phi4
131+
--output-dir arc_results
132+
```
133+
134+
### Key flags
135+
136+
- **--samples**: total questions to sample (default 20); ARC is not categorized in our script
137+
- Other flags mirror the **MMLU-Pro** script
138+
139+
### What it outputs per model
140+
141+
- **results/Model_Name_(direct|cot)/**
142+
- **detailed_results.csv**: one row per question with is_correct and category
143+
- **analysis.json**: overall_accuracy, avg_response_time
144+
- **summary.json**: condensed metrics
145+
- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection)
146+
147+
**Note**
148+
149+
- ARC results do not feed `categories[].model_scores` directly, but they can help spot regressions.
150+
151+
## 4.Visualize per-category performance
152+
see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py)
153+
154+
### Example usage patterns:
155+
156+
```bash
157+
# Use results/ to generate bar plot
158+
python plot_category_accuracies.py \
159+
--results-dir results \
160+
--plot-type bar \
161+
--output-file results/bar.png
162+
163+
# Use results/ to generate heatmap plot
164+
python plot_category_accuracies.py \
165+
--results-dir results \
166+
--plot-type heatmap \
167+
--output-file results/heatmap.png
168+
169+
# Use sample-data to generate example plot
170+
python src/training/model_eval/plot_category_accuracies.py \
171+
--sample-data \
172+
--plot-type heatmap \
173+
--output-file results/category_accuracies.png
174+
```
175+
176+
### Key flags
177+
178+
- **--results-dir**: where analysis.json files are
179+
- **--plot-type**: bar or heatmap
180+
- **--output-file**: output image path (default model_eval/category_accuracies.png)
181+
- **--sample-data**: if no results exist, generates fake data to preview the plot
182+
183+
### What it does
184+
185+
- Finds all `results/**/analysis.json`, aggregates analysis["category_accuracy"] per model
186+
- Adds an Overall column representing the average across categories
187+
- Produces a figure to quickly compare model/category performance
188+
189+
**Note**
190+
191+
- It merges `direct` and `cot` as distinct model variants by appending `:direct` or `:cot` to the label; the legend hides `:direct` for brevity.
192+
193+
## 5.Generate performance-based routing config
194+
see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py)
195+
196+
### Example usage patterns
197+
198+
```bash
199+
# Use results/ to generate a new config file (not overridden)
200+
python src/training/model_eval/result_to_config.py \
201+
--results-dir results \
202+
--output-file config/config.eval.yaml
203+
204+
# Modify similarity-thredshold
205+
python src/training/model_eval/result_to_config.py \
206+
--results-dir results \
207+
--output-file config/config.eval.yaml \
208+
--similarity-threshold 0.85
209+
210+
# Generate from specific folder
211+
python src/training/model_eval/result_to_config.py \
212+
--results-dir results/mmlu_run_2025_09_10 \
213+
--output-file config/config.eval.yaml
214+
```
215+
216+
### Key flags
217+
218+
- **--results-dir**: points to the folder where analysis.json files live
219+
- **--output-file**: target config path (default config/config.yaml)
220+
- **--similarity-threshold**: semantic cache threshold to set in the generated config
221+
222+
### What it does
223+
224+
- Reads all `analysis.json` files, extracting analysis["category_accuracy"]
225+
- Constructs a new config:
226+
- **categories**: For each category present in results, ranks models by accuracy:
227+
- **category.model_scores** = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first
228+
- **default_model**: the best average performer across categories
229+
- **category reasoning settings**: auto-filled from a built-in mapping (you can adjust after generation)
230+
- math, physics, chemistry, CS, engineering -> high reasoning
231+
- others default -> low/medium
232+
- Leaves out any special “auto” placeholder models if present
233+
234+
### Schema alignment
235+
236+
- **categories[].name**: the MMLU-Pro category string
237+
- **categories[].model_scores**: descending ranking by accuracy for that category
238+
- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct)
239+
- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs
240+
241+
**Note**
242+
243+
- This script only work with results from **MMLU_Pro** Evaluation.
244+
- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing:
245+
- `--output-file config/config.eval.yaml`
246+
- If your production config.yaml carries **environment-specific settings (endpoints, pricing, policies)**, port the evaluated `categories[].model_scores` and `default_model` back into your canonical config.
247+
248+
### Example config.eval.yaml
249+
see more about config at [configuration](https://vllm-semantic-router.com/docs/getting-started/configuration)
250+
251+
```yaml
252+
bert_model:
253+
model_id: sentence-transformers/all-MiniLM-L12-v2
254+
threshold: 0.6
255+
use_cpu: true
256+
semantic_cache:
257+
enabled: true
258+
similarity_threshold: 0.85
259+
max_entries: 1000
260+
ttl_seconds: 3600
261+
tools:
262+
enabled: true
263+
top_k: 3
264+
similarity_threshold: 0.2
265+
tools_db_path: config/tools_db.json
266+
fallback_to_empty: true
267+
prompt_guard:
268+
enabled: true
269+
use_modernbert: true
270+
model_id: models/jailbreak_classifier_modernbert-base_model
271+
threshold: 0.7
272+
use_cpu: true
273+
jailbreak_mapping_path: models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json
274+
275+
# Lack of endpoint config and model_config right here, modify here as needed
276+
277+
classifier:
278+
category_model:
279+
model_id: models/category_classifier_modernbert-base_model
280+
use_modernbert: true
281+
threshold: 0.6
282+
use_cpu: true
283+
category_mapping_path: models/category_classifier_modernbert-base_model/category_mapping.json
284+
pii_model:
285+
model_id: models/pii_classifier_modernbert-base_presidio_token_model
286+
use_modernbert: true
287+
threshold: 0.7
288+
use_cpu: true
289+
pii_mapping_path: models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json
290+
categories:
291+
- name: business
292+
use_reasoning: false
293+
reasoning_description: Business content is typically conversational
294+
reasoning_effort: low
295+
model_scores:
296+
- model: phi4
297+
score: 0.2
298+
- model: qwen3-0.6B
299+
score: 0.0
300+
- name: law
301+
use_reasoning: false
302+
reasoning_description: Legal content is typically explanatory
303+
reasoning_effort: medium
304+
model_scores:
305+
- model: phi4
306+
score: 0.8
307+
- model: qwen3-0.6B
308+
score: 0.2
309+
310+
# Ignore some categories here
311+
312+
- name: engineering
313+
use_reasoning: true
314+
reasoning_description: Engineering problems require systematic problem-solving
315+
reasoning_effort: high
316+
model_scores:
317+
- model: phi4
318+
score: 0.6
319+
- model: qwen3-0.6B
320+
score: 0.2
321+
default_reasoning_effort: medium
322+
default_model: phi4
323+
```

website/docs/training/training-overview.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -998,3 +998,7 @@ lora_training_infrastructure:
998998
lora_training: "$5-20 per model (reduced compute)"
999999
savings: "80-90% cost reduction"
10001000
```
1001+
1002+
## Next
1003+
1004+
- See: [Model Performance Evaluation](/docs/training/model-performance-eval)

website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ const sidebars = {
4747
label: 'Model Training',
4848
items: [
4949
'training/training-overview',
50+
'training/model-performance-eval',
5051
],
5152
},
5253
{

website/static/img/bar.png

50.5 KB
Loading

website/static/img/heatmap.png

62.7 KB
Loading

0 commit comments

Comments
 (0)