diff --git a/website/docs/training/model-performance-eval.md b/website/docs/training/model-performance-eval.md new file mode 100644 index 00000000..87cd0a05 --- /dev/null +++ b/website/docs/training/model-performance-eval.md @@ -0,0 +1,323 @@ +# Model Performance Evaluation +## Why evaluate? +Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can: + +- Select the right model for each category and rank them into categories.model_scores +- Pick a sensible default_model based on overall performance +- Decide when CoT prompting is worth the latency/cost tradeoff +- Catch regressions when models, prompts, or parameters change +- Keep changes reproducible and auditable for CI and releases + +In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router. + +--- + +This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update `categories.model_scores` in config. + +see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval) + +### What you'll run end-to-end +#### 1) Evaluate models + +- per-category accuracies +- ARC Challenge: overall accuracy + +#### 2) Visualize results + +- bar/heatmap plot of per-category accuracies + +![Bar](/img/bar.png) +![Heatmap](/img/heatmap.png) + +#### 3) Generate an updated config.yaml + +- Rank models per category into categories.model_scores +- Set default_model to the best average performer +- Keep or apply category-level reasioning settings + +## 1.Prerequisites + +- A running vLLM-compatible OpenAI endpoint serving your models + - Endpoint URL like http://localhost:8000/v1 + - Optional API key if your endpoint requires one + + ```bash + # Terminal 1 + vllm serve microsoft/phi-4 --port 11434 --served_model_name phi4 + + # Terminal 2 + vllm serve Qwen/Qwen3-0.6B --port 11435 --served_model_name qwen3-0.6B + ``` + +- Python packages for evaluation scripts: + - From the repo root: matplotlib in [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/requirements.txt) + - From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt) + + ```bash + # We will work at this dir in this guide + cd /src/training/model_eval + pip install -r requirements.txt + ``` + +**Optional tip:** + +- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly. + +## 2.Evaluate on MMLU-Pro +see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py) + +### Example usage patterns + +```bash +# Evaluate a few models, few samples per category, direct prompting +python mmlu_pro_vllm_eval.py \ + --endpoint http://localhost:11434/v1 \ + --models phi4 \ + --samples-per-category 10 + +python mmlu_pro_vllm_eval.py \ + --endpoint http://localhost:11435/v1 \ + --models qwen3-0.6B \ + --samples-per-category 10 + +# Evaluate with CoT (results saved under *_cot) +python mmlu_pro_vllm_eval.py \ + --endpoint http://localhost:11435/v1 \ + --models qwen3-0.6B \ + --samples-per-category 10 + --use-cot + +# If you have set up Semantic Router properly, you can run in one go +python mmlu_pro_vllm_eval.py \ + --endpoint http://localhost:8801/v1 \ + --models qwen3-0.6B, phi4 \ + --samples-per-category + # --use-cot # Uncomment this line if use CoT +``` + +### Key flags + +- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1) +- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint +- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset +- **--samples-per-category**: limit questions per category (useful for quick runs) +- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct) +- **--concurrent-requests**: concurrency for throughput +- **--output-dir**: where results are saved (default results) +- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs + +### What it outputs per model + +- **results/Model_Name_(direct|cot)/** + - **detailed_results.csv**: one row per question with is_correct and category + - **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts + - **summary.json**: condensed metrics +- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection) + +**Note** + +- **Model naming**: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory. +- Category accuracy is computed on successful queries only; failed requests are excluded. + +## 3.Evaluate on ARC Challenge (optional, overall sanity check) +see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py) + +### Example usage patterns + +``` bash +python arc_challenge_vllm_eval.py \ + --endpoint http://localhost:8801/v1\ + --models qwen3-0.6B,phi4 + --output-dir arc_results +``` + +### Key flags + +- **--samples**: total questions to sample (default 20); ARC is not categorized in our script +- Other flags mirror the **MMLU-Pro** script + +### What it outputs per model + +- **results/Model_Name_(direct|cot)/** + - **detailed_results.csv**: one row per question with is_correct and category + - **analysis.json**: overall_accuracy, avg_response_time + - **summary.json**: condensed metrics +- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection) + +**Note** + +- ARC results do not feed `categories[].model_scores` directly, but they can help spot regressions. + +## 4.Visualize per-category performance +see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py) + +### Example usage patterns: + +```bash +# Use results/ to generate bar plot +python plot_category_accuracies.py \ + --results-dir results \ + --plot-type bar \ + --output-file results/bar.png + +# Use results/ to generate heatmap plot +python plot_category_accuracies.py \ + --results-dir results \ + --plot-type heatmap \ + --output-file results/heatmap.png + +# Use sample-data to generate example plot +python src/training/model_eval/plot_category_accuracies.py \ + --sample-data \ + --plot-type heatmap \ + --output-file results/category_accuracies.png +``` + +### Key flags + +- **--results-dir**: where analysis.json files are +- **--plot-type**: bar or heatmap +- **--output-file**: output image path (default model_eval/category_accuracies.png) +- **--sample-data**: if no results exist, generates fake data to preview the plot + +### What it does + +- Finds all `results/**/analysis.json`, aggregates analysis["category_accuracy"] per model +- Adds an Overall column representing the average across categories +- Produces a figure to quickly compare model/category performance + +**Note** + +- It merges `direct` and `cot` as distinct model variants by appending `:direct` or `:cot` to the label; the legend hides `:direct` for brevity. + +## 5.Generate performance-based routing config +see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py) + +### Example usage patterns + +```bash +# Use results/ to generate a new config file (not overridden) +python src/training/model_eval/result_to_config.py \ + --results-dir results \ + --output-file config/config.eval.yaml + +# Modify similarity-thredshold +python src/training/model_eval/result_to_config.py \ + --results-dir results \ + --output-file config/config.eval.yaml \ + --similarity-threshold 0.85 + +# Generate from specific folder +python src/training/model_eval/result_to_config.py \ + --results-dir results/mmlu_run_2025_09_10 \ + --output-file config/config.eval.yaml +``` + +### Key flags + +- **--results-dir**: points to the folder where analysis.json files live +- **--output-file**: target config path (default config/config.yaml) +- **--similarity-threshold**: semantic cache threshold to set in the generated config + +### What it does + +- Reads all `analysis.json` files, extracting analysis["category_accuracy"] +- Constructs a new config: + - **categories**: For each category present in results, ranks models by accuracy: + - **category.model_scores** = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first + - **default_model**: the best average performer across categories + - **category reasoning settings**: auto-filled from a built-in mapping (you can adjust after generation) + - math, physics, chemistry, CS, engineering -> high reasoning + - others default -> low/medium + - Leaves out any special “auto” placeholder models if present + +### Schema alignment + +- **categories[].name**: the MMLU-Pro category string +- **categories[].model_scores**: descending ranking by accuracy for that category +- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct) +- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs + +**Note** + +- This script only work with results from **MMLU_Pro** Evaluation. +- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing: + - `--output-file config/config.eval.yaml` +- If your production config.yaml carries **environment-specific settings (endpoints, pricing, policies)**, port the evaluated `categories[].model_scores` and `default_model` back into your canonical config. + +### Example config.eval.yaml +see more about config at [configuration](https://vllm-semantic-router.com/docs/getting-started/configuration) + +```yaml +bert_model: + model_id: sentence-transformers/all-MiniLM-L12-v2 + threshold: 0.6 + use_cpu: true +semantic_cache: + enabled: true + similarity_threshold: 0.85 + max_entries: 1000 + ttl_seconds: 3600 +tools: + enabled: true + top_k: 3 + similarity_threshold: 0.2 + tools_db_path: config/tools_db.json + fallback_to_empty: true +prompt_guard: + enabled: true + use_modernbert: true + model_id: models/jailbreak_classifier_modernbert-base_model + threshold: 0.7 + use_cpu: true + jailbreak_mapping_path: models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json + +# Lack of endpoint config and model_config right here, modify here as needed + +classifier: + category_model: + model_id: models/category_classifier_modernbert-base_model + use_modernbert: true + threshold: 0.6 + use_cpu: true + category_mapping_path: models/category_classifier_modernbert-base_model/category_mapping.json + pii_model: + model_id: models/pii_classifier_modernbert-base_presidio_token_model + use_modernbert: true + threshold: 0.7 + use_cpu: true + pii_mapping_path: models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json +categories: +- name: business + use_reasoning: false + reasoning_description: Business content is typically conversational + reasoning_effort: low + model_scores: + - model: phi4 + score: 0.2 + - model: qwen3-0.6B + score: 0.0 +- name: law + use_reasoning: false + reasoning_description: Legal content is typically explanatory + reasoning_effort: medium + model_scores: + - model: phi4 + score: 0.8 + - model: qwen3-0.6B + score: 0.2 + +# Ignore some categories here + +- name: engineering + use_reasoning: true + reasoning_description: Engineering problems require systematic problem-solving + reasoning_effort: high + model_scores: + - model: phi4 + score: 0.6 + - model: qwen3-0.6B + score: 0.2 +default_reasoning_effort: medium +default_model: phi4 +``` diff --git a/website/docs/training/training-overview.md b/website/docs/training/training-overview.md index f4e0c476..1dee0d93 100644 --- a/website/docs/training/training-overview.md +++ b/website/docs/training/training-overview.md @@ -998,3 +998,7 @@ lora_training_infrastructure: lora_training: "$5-20 per model (reduced compute)" savings: "80-90% cost reduction" ``` + +## Next + +- See: [Model Performance Evaluation](/docs/training/model-performance-eval) diff --git a/website/sidebars.js b/website/sidebars.js index b0397e97..edf8296d 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -47,6 +47,7 @@ const sidebars = { label: 'Model Training', items: [ 'training/training-overview', + 'training/model-performance-eval', ], }, { diff --git a/website/static/img/bar.png b/website/static/img/bar.png new file mode 100644 index 00000000..d369681d Binary files /dev/null and b/website/static/img/bar.png differ diff --git a/website/static/img/heatmap.png b/website/static/img/heatmap.png new file mode 100644 index 00000000..06b4ea26 Binary files /dev/null and b/website/static/img/heatmap.png differ