|
| 1 | +# Model Performance Evaluation |
| 2 | +## Why evaluate? |
| 3 | +Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can: |
| 4 | + |
| 5 | +- Select the right model for each category and rank them into categories.model_scores |
| 6 | +- Pick a sensible default_model based on overall performance |
| 7 | +- Decide when CoT prompting is worth the latency/cost tradeoff |
| 8 | +- Catch regressions when models, prompts, or parameters change |
| 9 | +- Keep changes reproducible and auditable for CI and releases |
| 10 | + |
| 11 | +In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config. |
| 16 | + |
| 17 | +see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval) |
| 18 | + |
| 19 | +### What you'll run end-to-end |
| 20 | +#### 1) Evaluate models: |
| 21 | + |
| 22 | +- per-category accuracies |
| 23 | +- ARC Challenge: overall accuracy |
| 24 | + |
| 25 | +#### 2) Visualize results |
| 26 | + |
| 27 | +- bar/heatmap plot of per-category accuracies |
| 28 | + |
| 29 | +**TODO** a picture needed |
| 30 | +#### 3) Generate an updated config.yaml: |
| 31 | + |
| 32 | +- Rank models per category into categories.model_scores |
| 33 | +- Set default_model to the best average performer |
| 34 | +- Keep or apply category-level reasioning settings |
| 35 | + |
| 36 | +## 1.Prerequisites |
| 37 | + |
| 38 | +- A running vLLM-compatible OpenAI endpoint serving your models |
| 39 | + - Endpoint URL like http://localhost:8000/v1 |
| 40 | + - Optional API key if your endpoint requires one |
| 41 | +- Python packages for evaluation scripts: |
| 42 | + - From the repo root: matplotlib |
| 43 | + - From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt) |
| 44 | + |
| 45 | + ```bash |
| 46 | + cd /src/training/model_eval |
| 47 | + pip install -r requirements.txt |
| 48 | + ``` |
| 49 | + |
| 50 | +**Optional tip:** |
| 51 | + |
| 52 | +- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly. |
| 53 | + |
| 54 | +## 2.Evaluate on MMLU-Pro |
| 55 | +see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py) |
| 56 | + |
| 57 | +### Example usage patterns: |
| 58 | + |
| 59 | +```bash |
| 60 | +# Evaluate a few models, few samples per category, direct prompting |
| 61 | +python mmlu_pro_vllm_eval.py \ |
| 62 | + --endpoint http://localhost:8000/v1 \ |
| 63 | + --models gemma3:27b phi4 mistral-small3.1 \ |
| 64 | + --samples-per-category 10 |
| 65 | + |
| 66 | +# Evaluate with CoT (results saved under *_cot) |
| 67 | +python mmlu_pro_vllm_eval.py \ |
| 68 | + --endpoint http://localhost:8000/v1 \ |
| 69 | + --models gemma3:27b phi4 mistral-small3.1 \ |
| 70 | + --samples-per-category 10 |
| 71 | + --use-cot |
| 72 | +``` |
| 73 | + |
| 74 | +### Key flags: |
| 75 | + |
| 76 | +- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1) |
| 77 | +- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint |
| 78 | +- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset |
| 79 | +- **--samples-per-category**: limit questions per category (useful for quick runs) |
| 80 | +- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct) |
| 81 | +- **--concurrent-requests**: concurrency for throughput |
| 82 | +- **--output-dir**: where results are saved (default results) |
| 83 | +- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs |
| 84 | + |
| 85 | +### What it outputs per model: |
| 86 | + |
| 87 | +- **results/<model_name>_(direct|cot)/** |
| 88 | + - **detailed_results.csv**: one row per question with is_correct and category |
| 89 | + - **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts |
| 90 | + - **summary.json**: condensed metrics |
| 91 | +- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection) |
| 92 | + |
| 93 | +### Notes: |
| 94 | + |
| 95 | +- Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory. |
| 96 | +- Category accuracy is computed on successful queries only; failed requests are excluded. |
| 97 | + |
| 98 | +## 3.Evaluate on ARC Challenge (optional, overall sanity check) |
| 99 | +see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py) |
| 100 | + |
| 101 | +### Example usage patterns: |
| 102 | + |
| 103 | +``` bash |
| 104 | +python arc_challenge_vllm_eval.py \ |
| 105 | + --endpoint http://localhost:8000/v1\ |
| 106 | + --models gemma3:27b,phi4:latest |
| 107 | +``` |
| 108 | + |
| 109 | +### Key flags: |
| 110 | + |
| 111 | +- **--samples**: total questions to sample (default 20); ARC is not categorized in our script |
| 112 | +- Other flags mirror the MMLU-Pro script |
| 113 | + |
| 114 | +### What it outputs per model: |
| 115 | + |
| 116 | +- **results/<model_name>_(direct|cot)/** |
| 117 | + - **detailed_results.csv**: one row per question with is_correct and category |
| 118 | + - **analysis.json**: overall_accuracy, avg_response_time |
| 119 | + - **summary.json**: condensed metrics |
| 120 | +- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection) |
| 121 | + |
| 122 | +### Note: |
| 123 | +ARC results do not feed categories[].model_scores directly, but they can help spot regressions. |
| 124 | + |
| 125 | +## 4.Visualize per-category performance |
| 126 | +see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py) |
| 127 | + |
| 128 | +### Example usage patterns: |
| 129 | + |
| 130 | +```bash |
| 131 | +# Use results/ to generate bar plot |
| 132 | +python src/training/model_eval/plot_category_accuracies.py \ |
| 133 | + --results-dir results \ |
| 134 | + --plot-type bar \ |
| 135 | + --output-file model_eval/category_accuracies.png |
| 136 | + |
| 137 | +# Use results/ to generate heatmap plot |
| 138 | +python src/training/model_eval/plot_category_accuracies.py \ |
| 139 | + --results-dir results \ |
| 140 | + --plot-type heatmap \ |
| 141 | + --output-file model_eval/category_accuracies.png |
| 142 | + |
| 143 | +# Use sample-data to generate example plot |
| 144 | +python src/training/model_eval/plot_category_accuracies.py \ |
| 145 | + --sample-data \ |
| 146 | + --plot-type heatmap \ |
| 147 | + --output-file model_eval/category_accuracies.png |
| 148 | +``` |
| 149 | + |
| 150 | +### Key flags: |
| 151 | + |
| 152 | +- **--results-dir**: where analysis.json files are |
| 153 | +- **--plot-type**: bar or heatmap |
| 154 | +- **--output-file**: output image path (default model_eval/category_accuracies.png) |
| 155 | +- **--sample-data**: if no results exist, generates fake data to preview the plot |
| 156 | + |
| 157 | +### What it does: |
| 158 | + |
| 159 | +- Finds all results/**/analysis.json, aggregates analysis["category_accuracy"] per model |
| 160 | +- Adds an Overall column representing the average across categories |
| 161 | +- Produces a figure to quickly compare model/category performance |
| 162 | + |
| 163 | +### Note: |
| 164 | + |
| 165 | +- It merges “direct” and “cot” as distinct model variants by appending :direct or :cot to the label; the legend hides “:direct” for brevity. |
| 166 | + |
| 167 | +## 5.Generate performance-based routing config |
| 168 | +see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py) |
| 169 | + |
| 170 | +### Example usage patterns: |
| 171 | + |
| 172 | +```bash |
| 173 | +# Use results/ to generate a new config file (not overridded) |
| 174 | +python src/training/model_eval/result_to_config.py \ |
| 175 | + --results-dir results \ |
| 176 | + --output-file config/config.eval.yaml |
| 177 | + |
| 178 | +# Modify similarity-thredshold |
| 179 | +python src/training/model_eval/result_to_config.py \ |
| 180 | + --results-dir results \ |
| 181 | + --output-file config/config.eval.yaml \ |
| 182 | + --similarity-threshold 0.85 |
| 183 | + |
| 184 | +# Generate from specific folder |
| 185 | +python src/training/model_eval/result_to_config.py \ |
| 186 | + --results-dir results/mmlu_run_2025_09_10 \ |
| 187 | + --output-file config/config.eval.yaml |
| 188 | +``` |
| 189 | + |
| 190 | +### Key flags: |
| 191 | + |
| 192 | +- **--results-dir**: points to the folder where analysis.json files live |
| 193 | +- **--output-file**: target config path (default config/config.yaml) |
| 194 | +- **--similarity-threshold**: semantic cache threshold to set in the generated config |
| 195 | + |
| 196 | +### What it does: |
| 197 | + |
| 198 | +- Reads all analysis.json files, extracting analysis["category_accuracy"] |
| 199 | +- Constructs a new config: |
| 200 | + - default_model: the best average performer across categories |
| 201 | + - categories: For each category present in results, ranks models by accuracy: |
| 202 | + - category.model_scores = [{model: "<name>", score: <float>}, ...], highest first |
| 203 | + - category reasoning settings: auto-filled from a built-in mapping (math, physics, chemistry, CS, engineering -> high reasoning; others default to low/medium; you can adjust after generation) |
| 204 | + - Leaves out any special “auto” placeholder models if present |
| 205 | + |
| 206 | +### Schema alignment: |
| 207 | + |
| 208 | +- **categories[].name**: the MMLU-Pro category string |
| 209 | +- **categories[].model_scores**: descending ranking by accuracy for that category |
| 210 | +- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct) |
| 211 | +- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs |
| 212 | + |
| 213 | +### Note: |
| 214 | + |
| 215 | +- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing: |
| 216 | + - --output-file config/config.eval.yaml |
| 217 | +- If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config. |
0 commit comments