Skip to content

Commit 2e78bc2

Browse files
committed
docs: fix markdownlint
Signed-off-by: JaredforReal <[email protected]>
1 parent 4a6c433 commit 2e78bc2

File tree

1 file changed

+217
-0
lines changed

1 file changed

+217
-0
lines changed
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Model Performance Evaluation
2+
## Why evaluate?
3+
Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can:
4+
5+
- Select the right model for each category and rank them into categories.model_scores
6+
- Pick a sensible default_model based on overall performance
7+
- Decide when CoT prompting is worth the latency/cost tradeoff
8+
- Catch regressions when models, prompts, or parameters change
9+
- Keep changes reproducible and auditable for CI and releases
10+
11+
In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router.
12+
13+
---
14+
15+
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config.
16+
17+
see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval)
18+
19+
### What you'll run end-to-end
20+
#### 1) Evaluate models:
21+
22+
- per-category accuracies
23+
- ARC Challenge: overall accuracy
24+
25+
#### 2) Visualize results
26+
27+
- bar/heatmap plot of per-category accuracies
28+
29+
**TODO** a picture needed
30+
#### 3) Generate an updated config.yaml:
31+
32+
- Rank models per category into categories.model_scores
33+
- Set default_model to the best average performer
34+
- Keep or apply category-level reasioning settings
35+
36+
## 1.Prerequisites
37+
38+
- A running vLLM-compatible OpenAI endpoint serving your models
39+
- Endpoint URL like http://localhost:8000/v1
40+
- Optional API key if your endpoint requires one
41+
- Python packages for evaluation scripts:
42+
- From the repo root: matplotlib
43+
- From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt)
44+
45+
```bash
46+
cd /src/training/model_eval
47+
pip install -r requirements.txt
48+
```
49+
50+
**Optional tip:**
51+
52+
- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly.
53+
54+
## 2.Evaluate on MMLU-Pro
55+
see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py)
56+
57+
### Example usage patterns:
58+
59+
```bash
60+
# Evaluate a few models, few samples per category, direct prompting
61+
python mmlu_pro_vllm_eval.py \
62+
--endpoint http://localhost:8000/v1 \
63+
--models gemma3:27b phi4 mistral-small3.1 \
64+
--samples-per-category 10
65+
66+
# Evaluate with CoT (results saved under *_cot)
67+
python mmlu_pro_vllm_eval.py \
68+
--endpoint http://localhost:8000/v1 \
69+
--models gemma3:27b phi4 mistral-small3.1 \
70+
--samples-per-category 10
71+
--use-cot
72+
```
73+
74+
### Key flags:
75+
76+
- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1)
77+
- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint
78+
- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset
79+
- **--samples-per-category**: limit questions per category (useful for quick runs)
80+
- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct)
81+
- **--concurrent-requests**: concurrency for throughput
82+
- **--output-dir**: where results are saved (default results)
83+
- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs
84+
85+
### What it outputs per model:
86+
87+
- **results/<model_name>_(direct|cot)/**
88+
- **detailed_results.csv**: one row per question with is_correct and category
89+
- **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts
90+
- **summary.json**: condensed metrics
91+
- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection)
92+
93+
### Notes:
94+
95+
- Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
96+
- Category accuracy is computed on successful queries only; failed requests are excluded.
97+
98+
## 3.Evaluate on ARC Challenge (optional, overall sanity check)
99+
see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py)
100+
101+
### Example usage patterns:
102+
103+
``` bash
104+
python arc_challenge_vllm_eval.py \
105+
--endpoint http://localhost:8000/v1\
106+
--models gemma3:27b,phi4:latest
107+
```
108+
109+
### Key flags:
110+
111+
- **--samples**: total questions to sample (default 20); ARC is not categorized in our script
112+
- Other flags mirror the MMLU-Pro script
113+
114+
### What it outputs per model:
115+
116+
- **results/<model_name>_(direct|cot)/**
117+
- **detailed_results.csv**: one row per question with is_correct and category
118+
- **analysis.json**: overall_accuracy, avg_response_time
119+
- **summary.json**: condensed metrics
120+
- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection)
121+
122+
### Note:
123+
ARC results do not feed categories[].model_scores directly, but they can help spot regressions.
124+
125+
## 4.Visualize per-category performance
126+
see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py)
127+
128+
### Example usage patterns:
129+
130+
```bash
131+
# Use results/ to generate bar plot
132+
python src/training/model_eval/plot_category_accuracies.py \
133+
--results-dir results \
134+
--plot-type bar \
135+
--output-file model_eval/category_accuracies.png
136+
137+
# Use results/ to generate heatmap plot
138+
python src/training/model_eval/plot_category_accuracies.py \
139+
--results-dir results \
140+
--plot-type heatmap \
141+
--output-file model_eval/category_accuracies.png
142+
143+
# Use sample-data to generate example plot
144+
python src/training/model_eval/plot_category_accuracies.py \
145+
--sample-data \
146+
--plot-type heatmap \
147+
--output-file model_eval/category_accuracies.png
148+
```
149+
150+
### Key flags:
151+
152+
- **--results-dir**: where analysis.json files are
153+
- **--plot-type**: bar or heatmap
154+
- **--output-file**: output image path (default model_eval/category_accuracies.png)
155+
- **--sample-data**: if no results exist, generates fake data to preview the plot
156+
157+
### What it does:
158+
159+
- Finds all results/**/analysis.json, aggregates analysis["category_accuracy"] per model
160+
- Adds an Overall column representing the average across categories
161+
- Produces a figure to quickly compare model/category performance
162+
163+
### Note:
164+
165+
- It merges “direct” and “cot” as distinct model variants by appending :direct or :cot to the label; the legend hides “:direct” for brevity.
166+
167+
## 5.Generate performance-based routing config
168+
see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py)
169+
170+
### Example usage patterns:
171+
172+
```bash
173+
# Use results/ to generate a new config file (not overridded)
174+
python src/training/model_eval/result_to_config.py \
175+
--results-dir results \
176+
--output-file config/config.eval.yaml
177+
178+
# Modify similarity-thredshold
179+
python src/training/model_eval/result_to_config.py \
180+
--results-dir results \
181+
--output-file config/config.eval.yaml \
182+
--similarity-threshold 0.85
183+
184+
# Generate from specific folder
185+
python src/training/model_eval/result_to_config.py \
186+
--results-dir results/mmlu_run_2025_09_10 \
187+
--output-file config/config.eval.yaml
188+
```
189+
190+
### Key flags:
191+
192+
- **--results-dir**: points to the folder where analysis.json files live
193+
- **--output-file**: target config path (default config/config.yaml)
194+
- **--similarity-threshold**: semantic cache threshold to set in the generated config
195+
196+
### What it does:
197+
198+
- Reads all analysis.json files, extracting analysis["category_accuracy"]
199+
- Constructs a new config:
200+
- default_model: the best average performer across categories
201+
- categories: For each category present in results, ranks models by accuracy:
202+
- category.model_scores = [{model: "<name>", score: <float>}, ...], highest first
203+
- category reasoning settings: auto-filled from a built-in mapping (math, physics, chemistry, CS, engineering -> high reasoning; others default to low/medium; you can adjust after generation)
204+
- Leaves out any special “auto” placeholder models if present
205+
206+
### Schema alignment:
207+
208+
- **categories[].name**: the MMLU-Pro category string
209+
- **categories[].model_scores**: descending ranking by accuracy for that category
210+
- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct)
211+
- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs
212+
213+
### Note:
214+
215+
- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing:
216+
- --output-file config/config.eval.yaml
217+
- If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config.

0 commit comments

Comments
 (0)