You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,12 +12,12 @@ In short, evaluation converts anecdotes into measurable signals that improve qua
12
12
13
13
---
14
14
15
-
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config.
15
+
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update `categories.model_scores` in config.
16
16
17
17
see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval)
18
18
19
19
### What you'll run end-to-end
20
-
#### 1) Evaluate models:
20
+
#### 1) Evaluate models
21
21
22
22
- per-category accuracies
23
23
- ARC Challenge: overall accuracy
@@ -26,8 +26,10 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
26
26
27
27
- bar/heatmap plot of per-category accuracies
28
28
29
-
**TODO** a picture needed
30
-
#### 3) Generate an updated config.yaml:
29
+

30
+

31
+
32
+
#### 3) Generate an updated config.yaml
31
33
32
34
- Rank models per category into categories.model_scores
33
35
- Set default_model to the best average performer
@@ -38,11 +40,21 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
38
40
- A running vLLM-compatible OpenAI endpoint serving your models
- From the repo root: matplotlib in [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/requirements.txt)
43
54
- From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt)
44
55
45
56
```bash
57
+
# We will work at this dir in this guide
46
58
cd /src/training/model_eval
47
59
pip install -r requirements.txt
48
60
```
@@ -54,24 +66,36 @@ see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-
54
66
## 2.Evaluate on MMLU-Pro
55
67
see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py)
56
68
57
-
### Example usage patterns:
69
+
### Example usage patterns
58
70
59
71
```bash
60
72
# Evaluate a few models, few samples per category, direct prompting
61
73
python mmlu_pro_vllm_eval.py \
62
-
--endpoint http://localhost:8000/v1 \
63
-
--models gemma3:27b phi4 mistral-small3.1 \
74
+
--endpoint http://localhost:11434/v1 \
75
+
--models phi4 \
76
+
--samples-per-category 10
77
+
78
+
python mmlu_pro_vllm_eval.py \
79
+
--endpoint http://localhost:11435/v1 \
80
+
--models qwen3-0.6B \
64
81
--samples-per-category 10
65
82
66
83
# Evaluate with CoT (results saved under *_cot)
67
84
python mmlu_pro_vllm_eval.py \
68
-
--endpoint http://localhost:8000/v1 \
69
-
--models gemma3:27b phi4 mistral-small3.1 \
85
+
--endpoint http://localhost:11435/v1 \
86
+
--models qwen3-0.6B \
70
87
--samples-per-category 10
71
88
--use-cot
89
+
90
+
# If you have set up Semantic Router properly, you can run in one go
-**mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection)
92
116
93
-
### Notes:
117
+
**Note**
94
118
95
-
- Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
119
+
-**Model naming**: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
96
120
- Category accuracy is computed on successful queries only; failed requests are excluded.
97
121
98
122
## 3.Evaluate on ARC Challenge (optional, overall sanity check)
99
123
see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py)
100
124
101
-
### Example usage patterns:
125
+
### Example usage patterns
102
126
103
127
```bash
104
128
python arc_challenge_vllm_eval.py \
105
-
--endpoint http://localhost:8000/v1\
106
-
--models gemma3:27b,phi4:latest
129
+
--endpoint http://localhost:8801/v1\
130
+
--models qwen3-0.6B,phi4
131
+
--output-dir arc_results
107
132
```
108
133
109
-
### Key flags:
134
+
### Key flags
110
135
111
136
-**--samples**: total questions to sample (default 20); ARC is not categorized in our script
112
-
- Other flags mirror the MMLU-Pro script
137
+
- Other flags mirror the **MMLU-Pro** script
113
138
114
-
### What it outputs per model:
139
+
### What it outputs per model
115
140
116
141
-**results/Model_Name_(direct|cot)/**
117
142
-**detailed_results.csv**: one row per question with is_correct and category
-**arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection)
121
146
122
-
### Note:
123
-
ARC results do not feed categories[].model_scores directly, but they can help spot regressions.
147
+
**Note**
148
+
149
+
- ARC results do not feed `categories[].model_scores` directly, but they can help spot regressions.
124
150
125
151
## 4.Visualize per-category performance
126
152
see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py)
@@ -129,45 +155,45 @@ see script in [plot_category_accuracies.py](https://github.com/vllm-project/sema
- category reasoning settings: auto-filled from a built-in mapping (math, physics, chemistry, CS, engineering -> high reasoning; others default to low/medium; you can adjust after generation)
226
+
-**categories**: For each category present in results, ranks models by accuracy:
-**default_model**: the best average performer across categories
229
+
-**category reasoning settings**: auto-filled from a built-in mapping (you can adjust after generation)
230
+
- math, physics, chemistry, CS, engineering -> high reasoning
231
+
- others default -> low/medium
204
232
- Leaves out any special “auto” placeholder models if present
205
233
206
-
### Schema alignment:
234
+
### Schema alignment
207
235
208
236
-**categories[].name**: the MMLU-Pro category string
209
237
-**categories[].model_scores**: descending ranking by accuracy for that category
210
238
-**default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct)
211
239
- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs
212
240
213
-
### Note:
241
+
**Note**
214
242
243
+
- This script only work with results from **MMLU_Pro** Evaluation.
215
244
- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing:
216
-
- --output-file config/config.eval.yaml
217
-
- If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config.
245
+
-`--output-file config/config.eval.yaml`
246
+
- If your production config.yaml carries **environment-specific settings (endpoints, pricing, policies)**, port the evaluated `categories[].model_scores` and `default_model` back into your canonical config.
247
+
248
+
### Example config.eval.yaml
249
+
see more about config at [configuration](https://vllm-semantic-router.com/docs/getting-started/configuration)
0 commit comments