Skip to content

Commit f4d7de1

Browse files
committed
docs: add pngs and examples in doc & add doc to sidebar
Signed-off-by: JaredforReal <[email protected]>
1 parent 468fb15 commit f4d7de1

File tree

4 files changed

+151
-45
lines changed

4 files changed

+151
-45
lines changed

website/docs/training/model_performance_eval.md renamed to website/docs/training/model-performance-eval.md

Lines changed: 150 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ In short, evaluation converts anecdotes into measurable signals that improve qua
1212

1313
---
1414

15-
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config.
15+
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update `categories.model_scores` in config.
1616

1717
see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval)
1818

1919
### What you'll run end-to-end
20-
#### 1) Evaluate models:
20+
#### 1) Evaluate models
2121

2222
- per-category accuracies
2323
- ARC Challenge: overall accuracy
@@ -26,8 +26,10 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
2626

2727
- bar/heatmap plot of per-category accuracies
2828

29-
**TODO** a picture needed
30-
#### 3) Generate an updated config.yaml:
29+
![Bar](/img/bar.png)
30+
![Heatmap](/img/heatmap.png)
31+
32+
#### 3) Generate an updated config.yaml
3133

3234
- Rank models per category into categories.model_scores
3335
- Set default_model to the best average performer
@@ -38,11 +40,21 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
3840
- A running vLLM-compatible OpenAI endpoint serving your models
3941
- Endpoint URL like http://localhost:8000/v1
4042
- Optional API key if your endpoint requires one
43+
44+
```bash
45+
# Terminal 1
46+
vllm serve microsoft/phi-4 --port 11434 --served_model_name phi4
47+
48+
# Terminal 2
49+
vllm serve Qwen/Qwen3-0.6B --port 11435 --served_model_name qwen3-0.6B
50+
```
51+
4152
- Python packages for evaluation scripts:
42-
- From the repo root: matplotlib
53+
- From the repo root: matplotlib in [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/requirements.txt)
4354
- From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt)
4455

4556
```bash
57+
# We will work at this dir in this guide
4658
cd /src/training/model_eval
4759
pip install -r requirements.txt
4860
```
@@ -54,24 +66,36 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
5466
## 2.Evaluate on MMLU-Pro
5567
see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py)
5668

57-
### Example usage patterns:
69+
### Example usage patterns
5870

5971
```bash
6072
# Evaluate a few models, few samples per category, direct prompting
6173
python mmlu_pro_vllm_eval.py \
62-
--endpoint http://localhost:8000/v1 \
63-
--models gemma3:27b phi4 mistral-small3.1 \
74+
--endpoint http://localhost:11434/v1 \
75+
--models phi4 \
76+
--samples-per-category 10
77+
78+
python mmlu_pro_vllm_eval.py \
79+
--endpoint http://localhost:11435/v1 \
80+
--models qwen3-0.6B \
6481
--samples-per-category 10
6582

6683
# Evaluate with CoT (results saved under *_cot)
6784
python mmlu_pro_vllm_eval.py \
68-
--endpoint http://localhost:8000/v1 \
69-
--models gemma3:27b phi4 mistral-small3.1 \
85+
--endpoint http://localhost:11435/v1 \
86+
--models qwen3-0.6B \
7087
--samples-per-category 10
7188
--use-cot
89+
90+
# If you have set up Semantic Router properly, you can run in one go
91+
python mmlu_pro_vllm_eval.py \
92+
--endpoint http://localhost:8801/v1 \
93+
--models qwen3-0.6B, phi4 \
94+
--samples-per-category
95+
# --use-cot # Uncomment this line if use CoT
7296
```
7397

74-
### Key flags:
98+
### Key flags
7599

76100
- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1)
77101
- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint
@@ -82,45 +106,47 @@ python mmlu_pro_vllm_eval.py \
82106
- **--output-dir**: where results are saved (default results)
83107
- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs
84108

85-
### What it outputs per model:
109+
### What it outputs per model
86110

87111
- **results/Model_Name_(direct|cot)/**
88112
- **detailed_results.csv**: one row per question with is_correct and category
89113
- **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts
90114
- **summary.json**: condensed metrics
91115
- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection)
92116

93-
### Notes:
117+
**Note**
94118

95-
- Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
119+
- **Model naming**: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
96120
- Category accuracy is computed on successful queries only; failed requests are excluded.
97121

98122
## 3.Evaluate on ARC Challenge (optional, overall sanity check)
99123
see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py)
100124

101-
### Example usage patterns:
125+
### Example usage patterns
102126

103127
``` bash
104128
python arc_challenge_vllm_eval.py \
105-
--endpoint http://localhost:8000/v1\
106-
--models gemma3:27b,phi4:latest
129+
--endpoint http://localhost:8801/v1\
130+
--models qwen3-0.6B,phi4
131+
--output-dir arc_results
107132
```
108133

109-
### Key flags:
134+
### Key flags
110135

111136
- **--samples**: total questions to sample (default 20); ARC is not categorized in our script
112-
- Other flags mirror the MMLU-Pro script
137+
- Other flags mirror the **MMLU-Pro** script
113138

114-
### What it outputs per model:
139+
### What it outputs per model
115140

116141
- **results/Model_Name_(direct|cot)/**
117142
- **detailed_results.csv**: one row per question with is_correct and category
118143
- **analysis.json**: overall_accuracy, avg_response_time
119144
- **summary.json**: condensed metrics
120145
- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection)
121146

122-
### Note:
123-
ARC results do not feed categories[].model_scores directly, but they can help spot regressions.
147+
**Note**
148+
149+
- ARC results do not feed `categories[].model_scores` directly, but they can help spot regressions.
124150

125151
## 4.Visualize per-category performance
126152
see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py)
@@ -129,45 +155,45 @@ see script in [plot_category_accuracies.py](https://github.com/vllm-project/sema
129155

130156
```bash
131157
# Use results/ to generate bar plot
132-
python src/training/model_eval/plot_category_accuracies.py \
158+
python plot_category_accuracies.py \
133159
--results-dir results \
134160
--plot-type bar \
135-
--output-file model_eval/category_accuracies.png
161+
--output-file results/bar.png
136162

137163
# Use results/ to generate heatmap plot
138-
python src/training/model_eval/plot_category_accuracies.py \
164+
python plot_category_accuracies.py \
139165
--results-dir results \
140166
--plot-type heatmap \
141-
--output-file model_eval/category_accuracies.png
167+
--output-file results/heatmap.png
142168

143169
# Use sample-data to generate example plot
144170
python src/training/model_eval/plot_category_accuracies.py \
145171
--sample-data \
146172
--plot-type heatmap \
147-
--output-file model_eval/category_accuracies.png
173+
--output-file results/category_accuracies.png
148174
```
149175

150-
### Key flags:
176+
### Key flags
151177

152178
- **--results-dir**: where analysis.json files are
153179
- **--plot-type**: bar or heatmap
154180
- **--output-file**: output image path (default model_eval/category_accuracies.png)
155181
- **--sample-data**: if no results exist, generates fake data to preview the plot
156182

157-
### What it does:
183+
### What it does
158184

159-
- Finds all results/**/analysis.json, aggregates analysis["category_accuracy"] per model
185+
- Finds all `results/**/analysis.json`, aggregates analysis["category_accuracy"] per model
160186
- Adds an Overall column representing the average across categories
161187
- Produces a figure to quickly compare model/category performance
162188

163-
### Note:
189+
**Note**
164190

165-
- It merges direct and cot as distinct model variants by appending :direct or :cot to the label; the legend hides :direct for brevity.
191+
- It merges `direct` and `cot` as distinct model variants by appending `:direct` or `:cot` to the label; the legend hides `:direct` for brevity.
166192

167193
## 5.Generate performance-based routing config
168194
see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py)
169195

170-
### Example usage patterns:
196+
### Example usage patterns
171197

172198
```bash
173199
# Use results/ to generate a new config file (not overridded)
@@ -187,31 +213,111 @@ python src/training/model_eval/result_to_config.py \
187213
--output-file config/config.eval.yaml
188214
```
189215

190-
### Key flags:
216+
### Key flags
191217

192218
- **--results-dir**: points to the folder where analysis.json files live
193219
- **--output-file**: target config path (default config/config.yaml)
194220
- **--similarity-threshold**: semantic cache threshold to set in the generated config
195221

196-
### What it does:
222+
### What it does
197223

198-
- Reads all analysis.json files, extracting analysis["category_accuracy"]
224+
- Reads all `analysis.json` files, extracting analysis["category_accuracy"]
199225
- Constructs a new config:
200-
- default_model: the best average performer across categories
201-
- categories: For each category present in results, ranks models by accuracy:
202-
- category.model_scores = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first
203-
- category reasoning settings: auto-filled from a built-in mapping (math, physics, chemistry, CS, engineering -> high reasoning; others default to low/medium; you can adjust after generation)
226+
- **categories**: For each category present in results, ranks models by accuracy:
227+
- **category.model_scores** = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first
228+
- **default_model**: the best average performer across categories
229+
- **category reasoning settings**: auto-filled from a built-in mapping (you can adjust after generation)
230+
- math, physics, chemistry, CS, engineering -> high reasoning
231+
- others default -> low/medium
204232
- Leaves out any special “auto” placeholder models if present
205233

206-
### Schema alignment:
234+
### Schema alignment
207235

208236
- **categories[].name**: the MMLU-Pro category string
209237
- **categories[].model_scores**: descending ranking by accuracy for that category
210238
- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct)
211239
- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs
212240

213-
### Note:
241+
**Note**
214242

243+
- This script only work with results from **MMLU_Pro** Evaluation.
215244
- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing:
216-
- --output-file config/config.eval.yaml
217-
- If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config.
245+
- `--output-file config/config.eval.yaml`
246+
- If your production config.yaml carries **environment-specific settings (endpoints, pricing, policies)**, port the evaluated `categories[].model_scores` and `default_model` back into your canonical config.
247+
248+
### Example config.eval.yaml
249+
see more about config at [configuration](https://vllm-semantic-router.com/docs/getting-started/configuration)
250+
251+
```yaml
252+
bert_model:
253+
model_id: sentence-transformers/all-MiniLM-L12-v2
254+
threshold: 0.6
255+
use_cpu: true
256+
semantic_cache:
257+
enabled: true
258+
similarity_threshold: 0.85
259+
max_entries: 1000
260+
ttl_seconds: 3600
261+
tools:
262+
enabled: true
263+
top_k: 3
264+
similarity_threshold: 0.2
265+
tools_db_path: config/tools_db.json
266+
fallback_to_empty: true
267+
prompt_guard:
268+
enabled: true
269+
use_modernbert: true
270+
model_id: models/jailbreak_classifier_modernbert-base_model
271+
threshold: 0.7
272+
use_cpu: true
273+
jailbreak_mapping_path: models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json
274+
275+
# Lack of endpoint config and model_config right here, modify here as needed
276+
277+
classifier:
278+
category_model:
279+
model_id: models/category_classifier_modernbert-base_model
280+
use_modernbert: true
281+
threshold: 0.6
282+
use_cpu: true
283+
category_mapping_path: models/category_classifier_modernbert-base_model/category_mapping.json
284+
pii_model:
285+
model_id: models/pii_classifier_modernbert-base_presidio_token_model
286+
use_modernbert: true
287+
threshold: 0.7
288+
use_cpu: true
289+
pii_mapping_path: models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json
290+
categories:
291+
- name: business
292+
use_reasoning: false
293+
reasoning_description: Business content is typically conversational
294+
reasoning_effort: low
295+
model_scores:
296+
- model: phi4
297+
score: 0.2
298+
- model: qwen3-0.6B
299+
score: 0.0
300+
- name: law
301+
use_reasoning: false
302+
reasoning_description: Legal content is typically explanatory
303+
reasoning_effort: medium
304+
model_scores:
305+
- model: phi4
306+
score: 0.8
307+
- model: qwen3-0.6B
308+
score: 0.2
309+
310+
# Ignore some categories here
311+
312+
- name: engineering
313+
use_reasoning: true
314+
reasoning_description: Engineering problems require systematic problem-solving
315+
reasoning_effort: high
316+
model_scores:
317+
- model: phi4
318+
score: 0.6
319+
- model: qwen3-0.6B
320+
score: 0.2
321+
default_reasoning_effort: medium
322+
default_model: phi4
323+
```

website/sidebars.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ const sidebars = {
3838
label: 'Model Training',
3939
items: [
4040
'training/training-overview',
41-
'training/model_performance_eval',
41+
'training/model-performance-eval',
4242
],
4343
},
4444
{

website/static/img/bar.png

50.5 KB
Loading

website/static/img/heatmap.png

62.7 KB
Loading

0 commit comments

Comments
 (0)