|
| 1 | +# Model Performance Evaluation |
| 2 | +## Why evaluate? |
| 3 | +Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can: |
| 4 | + |
| 5 | +- Select the right model for each category and rank them into categories.model_scores |
| 6 | +- Pick a sensible default_model based on overall performance |
| 7 | +- Decide when CoT prompting is worth the latency/cost tradeoff |
| 8 | +- Catch regressions when models, prompts, or parameters change |
| 9 | +- Keep changes reproducible and auditable for CI and releases |
| 10 | + |
| 11 | +In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update `categories.model_scores` in config. |
| 16 | + |
| 17 | +see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval) |
| 18 | + |
| 19 | +### What you'll run end-to-end |
| 20 | +#### 1) Evaluate models |
| 21 | + |
| 22 | +- per-category accuracies |
| 23 | +- ARC Challenge: overall accuracy |
| 24 | + |
| 25 | +#### 2) Visualize results |
| 26 | + |
| 27 | +- bar/heatmap plot of per-category accuracies |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +#### 3) Generate an updated config.yaml |
| 33 | + |
| 34 | +- Rank models per category into categories.model_scores |
| 35 | +- Set default_model to the best average performer |
| 36 | +- Keep or apply category-level reasioning settings |
| 37 | + |
| 38 | +## 1.Prerequisites |
| 39 | + |
| 40 | +- A running vLLM-compatible OpenAI endpoint serving your models |
| 41 | + - Endpoint URL like http://localhost:8000/v1 |
| 42 | + - Optional API key if your endpoint requires one |
| 43 | + |
| 44 | + ```bash |
| 45 | + # Terminal 1 |
| 46 | + vllm serve microsoft/phi-4 --port 11434 --served_model_name phi4 |
| 47 | + |
| 48 | + # Terminal 2 |
| 49 | + vllm serve Qwen/Qwen3-0.6B --port 11435 --served_model_name qwen3-0.6B |
| 50 | + ``` |
| 51 | + |
| 52 | +- Python packages for evaluation scripts: |
| 53 | + - From the repo root: matplotlib in [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/requirements.txt) |
| 54 | + - From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt) |
| 55 | + |
| 56 | + ```bash |
| 57 | + # We will work at this dir in this guide |
| 58 | + cd /src/training/model_eval |
| 59 | + pip install -r requirements.txt |
| 60 | + ``` |
| 61 | + |
| 62 | +**Optional tip:** |
| 63 | + |
| 64 | +- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly. |
| 65 | + |
| 66 | +## 2.Evaluate on MMLU-Pro |
| 67 | +see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py) |
| 68 | + |
| 69 | +### Example usage patterns |
| 70 | + |
| 71 | +```bash |
| 72 | +# Evaluate a few models, few samples per category, direct prompting |
| 73 | +python mmlu_pro_vllm_eval.py \ |
| 74 | + --endpoint http://localhost:11434/v1 \ |
| 75 | + --models phi4 \ |
| 76 | + --samples-per-category 10 |
| 77 | + |
| 78 | +python mmlu_pro_vllm_eval.py \ |
| 79 | + --endpoint http://localhost:11435/v1 \ |
| 80 | + --models qwen3-0.6B \ |
| 81 | + --samples-per-category 10 |
| 82 | + |
| 83 | +# Evaluate with CoT (results saved under *_cot) |
| 84 | +python mmlu_pro_vllm_eval.py \ |
| 85 | + --endpoint http://localhost:11435/v1 \ |
| 86 | + --models qwen3-0.6B \ |
| 87 | + --samples-per-category 10 |
| 88 | + --use-cot |
| 89 | + |
| 90 | +# If you have set up Semantic Router properly, you can run in one go |
| 91 | +python mmlu_pro_vllm_eval.py \ |
| 92 | + --endpoint http://localhost:8801/v1 \ |
| 93 | + --models qwen3-0.6B, phi4 \ |
| 94 | + --samples-per-category |
| 95 | + # --use-cot # Uncomment this line if use CoT |
| 96 | +``` |
| 97 | + |
| 98 | +### Key flags |
| 99 | + |
| 100 | +- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1) |
| 101 | +- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint |
| 102 | +- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset |
| 103 | +- **--samples-per-category**: limit questions per category (useful for quick runs) |
| 104 | +- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct) |
| 105 | +- **--concurrent-requests**: concurrency for throughput |
| 106 | +- **--output-dir**: where results are saved (default results) |
| 107 | +- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs |
| 108 | + |
| 109 | +### What it outputs per model |
| 110 | + |
| 111 | +- **results/Model_Name_(direct|cot)/** |
| 112 | + - **detailed_results.csv**: one row per question with is_correct and category |
| 113 | + - **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts |
| 114 | + - **summary.json**: condensed metrics |
| 115 | +- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection) |
| 116 | + |
| 117 | +**Note** |
| 118 | + |
| 119 | +- **Model naming**: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory. |
| 120 | +- Category accuracy is computed on successful queries only; failed requests are excluded. |
| 121 | + |
| 122 | +## 3.Evaluate on ARC Challenge (optional, overall sanity check) |
| 123 | +see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py) |
| 124 | + |
| 125 | +### Example usage patterns |
| 126 | + |
| 127 | +``` bash |
| 128 | +python arc_challenge_vllm_eval.py \ |
| 129 | + --endpoint http://localhost:8801/v1\ |
| 130 | + --models qwen3-0.6B,phi4 |
| 131 | + --output-dir arc_results |
| 132 | +``` |
| 133 | + |
| 134 | +### Key flags |
| 135 | + |
| 136 | +- **--samples**: total questions to sample (default 20); ARC is not categorized in our script |
| 137 | +- Other flags mirror the **MMLU-Pro** script |
| 138 | + |
| 139 | +### What it outputs per model |
| 140 | + |
| 141 | +- **results/Model_Name_(direct|cot)/** |
| 142 | + - **detailed_results.csv**: one row per question with is_correct and category |
| 143 | + - **analysis.json**: overall_accuracy, avg_response_time |
| 144 | + - **summary.json**: condensed metrics |
| 145 | +- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection) |
| 146 | + |
| 147 | +**Note** |
| 148 | + |
| 149 | +- ARC results do not feed `categories[].model_scores` directly, but they can help spot regressions. |
| 150 | + |
| 151 | +## 4.Visualize per-category performance |
| 152 | +see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py) |
| 153 | + |
| 154 | +### Example usage patterns: |
| 155 | + |
| 156 | +```bash |
| 157 | +# Use results/ to generate bar plot |
| 158 | +python plot_category_accuracies.py \ |
| 159 | + --results-dir results \ |
| 160 | + --plot-type bar \ |
| 161 | + --output-file results/bar.png |
| 162 | + |
| 163 | +# Use results/ to generate heatmap plot |
| 164 | +python plot_category_accuracies.py \ |
| 165 | + --results-dir results \ |
| 166 | + --plot-type heatmap \ |
| 167 | + --output-file results/heatmap.png |
| 168 | + |
| 169 | +# Use sample-data to generate example plot |
| 170 | +python src/training/model_eval/plot_category_accuracies.py \ |
| 171 | + --sample-data \ |
| 172 | + --plot-type heatmap \ |
| 173 | + --output-file results/category_accuracies.png |
| 174 | +``` |
| 175 | + |
| 176 | +### Key flags |
| 177 | + |
| 178 | +- **--results-dir**: where analysis.json files are |
| 179 | +- **--plot-type**: bar or heatmap |
| 180 | +- **--output-file**: output image path (default model_eval/category_accuracies.png) |
| 181 | +- **--sample-data**: if no results exist, generates fake data to preview the plot |
| 182 | + |
| 183 | +### What it does |
| 184 | + |
| 185 | +- Finds all `results/**/analysis.json`, aggregates analysis["category_accuracy"] per model |
| 186 | +- Adds an Overall column representing the average across categories |
| 187 | +- Produces a figure to quickly compare model/category performance |
| 188 | + |
| 189 | +**Note** |
| 190 | + |
| 191 | +- It merges `direct` and `cot` as distinct model variants by appending `:direct` or `:cot` to the label; the legend hides `:direct` for brevity. |
| 192 | + |
| 193 | +## 5.Generate performance-based routing config |
| 194 | +see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py) |
| 195 | + |
| 196 | +### Example usage patterns |
| 197 | + |
| 198 | +```bash |
| 199 | +# Use results/ to generate a new config file (not overridden) |
| 200 | +python src/training/model_eval/result_to_config.py \ |
| 201 | + --results-dir results \ |
| 202 | + --output-file config/config.eval.yaml |
| 203 | + |
| 204 | +# Modify similarity-thredshold |
| 205 | +python src/training/model_eval/result_to_config.py \ |
| 206 | + --results-dir results \ |
| 207 | + --output-file config/config.eval.yaml \ |
| 208 | + --similarity-threshold 0.85 |
| 209 | + |
| 210 | +# Generate from specific folder |
| 211 | +python src/training/model_eval/result_to_config.py \ |
| 212 | + --results-dir results/mmlu_run_2025_09_10 \ |
| 213 | + --output-file config/config.eval.yaml |
| 214 | +``` |
| 215 | + |
| 216 | +### Key flags |
| 217 | + |
| 218 | +- **--results-dir**: points to the folder where analysis.json files live |
| 219 | +- **--output-file**: target config path (default config/config.yaml) |
| 220 | +- **--similarity-threshold**: semantic cache threshold to set in the generated config |
| 221 | + |
| 222 | +### What it does |
| 223 | + |
| 224 | +- Reads all `analysis.json` files, extracting analysis["category_accuracy"] |
| 225 | +- Constructs a new config: |
| 226 | + - **categories**: For each category present in results, ranks models by accuracy: |
| 227 | + - **category.model_scores** = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first |
| 228 | + - **default_model**: the best average performer across categories |
| 229 | + - **category reasoning settings**: auto-filled from a built-in mapping (you can adjust after generation) |
| 230 | + - math, physics, chemistry, CS, engineering -> high reasoning |
| 231 | + - others default -> low/medium |
| 232 | + - Leaves out any special “auto” placeholder models if present |
| 233 | + |
| 234 | +### Schema alignment |
| 235 | + |
| 236 | +- **categories[].name**: the MMLU-Pro category string |
| 237 | +- **categories[].model_scores**: descending ranking by accuracy for that category |
| 238 | +- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct) |
| 239 | +- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs |
| 240 | + |
| 241 | +**Note** |
| 242 | + |
| 243 | +- This script only work with results from **MMLU_Pro** Evaluation. |
| 244 | +- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing: |
| 245 | + - `--output-file config/config.eval.yaml` |
| 246 | +- If your production config.yaml carries **environment-specific settings (endpoints, pricing, policies)**, port the evaluated `categories[].model_scores` and `default_model` back into your canonical config. |
| 247 | + |
| 248 | +### Example config.eval.yaml |
| 249 | +see more about config at [configuration](https://vllm-semantic-router.com/docs/getting-started/configuration) |
| 250 | + |
| 251 | +```yaml |
| 252 | +bert_model: |
| 253 | + model_id: sentence-transformers/all-MiniLM-L12-v2 |
| 254 | + threshold: 0.6 |
| 255 | + use_cpu: true |
| 256 | +semantic_cache: |
| 257 | + enabled: true |
| 258 | + similarity_threshold: 0.85 |
| 259 | + max_entries: 1000 |
| 260 | + ttl_seconds: 3600 |
| 261 | +tools: |
| 262 | + enabled: true |
| 263 | + top_k: 3 |
| 264 | + similarity_threshold: 0.2 |
| 265 | + tools_db_path: config/tools_db.json |
| 266 | + fallback_to_empty: true |
| 267 | +prompt_guard: |
| 268 | + enabled: true |
| 269 | + use_modernbert: true |
| 270 | + model_id: models/jailbreak_classifier_modernbert-base_model |
| 271 | + threshold: 0.7 |
| 272 | + use_cpu: true |
| 273 | + jailbreak_mapping_path: models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json |
| 274 | + |
| 275 | +# Lack of endpoint config and model_config right here, modify here as needed |
| 276 | + |
| 277 | +classifier: |
| 278 | + category_model: |
| 279 | + model_id: models/category_classifier_modernbert-base_model |
| 280 | + use_modernbert: true |
| 281 | + threshold: 0.6 |
| 282 | + use_cpu: true |
| 283 | + category_mapping_path: models/category_classifier_modernbert-base_model/category_mapping.json |
| 284 | + pii_model: |
| 285 | + model_id: models/pii_classifier_modernbert-base_presidio_token_model |
| 286 | + use_modernbert: true |
| 287 | + threshold: 0.7 |
| 288 | + use_cpu: true |
| 289 | + pii_mapping_path: models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json |
| 290 | +categories: |
| 291 | +- name: business |
| 292 | + use_reasoning: false |
| 293 | + reasoning_description: Business content is typically conversational |
| 294 | + reasoning_effort: low |
| 295 | + model_scores: |
| 296 | + - model: phi4 |
| 297 | + score: 0.2 |
| 298 | + - model: qwen3-0.6B |
| 299 | + score: 0.0 |
| 300 | +- name: law |
| 301 | + use_reasoning: false |
| 302 | + reasoning_description: Legal content is typically explanatory |
| 303 | + reasoning_effort: medium |
| 304 | + model_scores: |
| 305 | + - model: phi4 |
| 306 | + score: 0.8 |
| 307 | + - model: qwen3-0.6B |
| 308 | + score: 0.2 |
| 309 | + |
| 310 | +# Ignore some categories here |
| 311 | + |
| 312 | +- name: engineering |
| 313 | + use_reasoning: true |
| 314 | + reasoning_description: Engineering problems require systematic problem-solving |
| 315 | + reasoning_effort: high |
| 316 | + model_scores: |
| 317 | + - model: phi4 |
| 318 | + score: 0.6 |
| 319 | + - model: qwen3-0.6B |
| 320 | + score: 0.2 |
| 321 | +default_reasoning_effort: medium |
| 322 | +default_model: phi4 |
| 323 | +``` |
0 commit comments