Skip to content

Commit c911705

Browse files
asaadbalumAsaad Balumrootfs
authored
feat(bench): Add reasoning mode evaluation benchmark (Issue #42) (#791)
Implements comprehensive benchmarking for comparing standard vs reasoning mode as specified in Issue #42 acceptance criteria. Key Features: - New reasoning_mode_eval.py module for dedicated standard vs reasoning comparison - Implements all Issue #42 metrics: - Response correctness (accuracy) on MMLU(-Pro) and non-MMLU test sets - Token usage ratio (completion_tokens / prompt_tokens) - Response time per output token (ms) - Per-category breakdown of all metrics - Automatic plot generation comparing modes - Comprehensive markdown report generation - CLI integration via 'reasoning-eval' command New Files: - bench/vllm_semantic_router_bench/reasoning_mode_eval.py - bench/reasoning_mode_eval.sh (convenience script) - bench/test_mock_server.py (testing mock server) Modified: - bench/vllm_semantic_router_bench/cli.py (new command) - bench/vllm_semantic_router_bench/__init__.py (exports) - bench/pyproject.toml (entry point, version bump to 1.1.0) - bench/README.md (documentation) Usage: vllm-semantic-router-bench reasoning-eval --datasets mmlu gpqa --samples 10 reasoning-mode-eval --datasets mmlu truthfulqa --samples-per-category 5 Closes #42 Signed-off-by: Asaad Balum <[email protected]> Co-authored-by: Asaad Balum <[email protected]> Co-authored-by: Huamin Chen <[email protected]>
1 parent 970f349 commit c911705

File tree

6 files changed

+1643
-4
lines changed

6 files changed

+1643
-4
lines changed

bench/README.md

Lines changed: 142 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ A comprehensive benchmark suite for evaluating **semantic router** performance a
1010
- **6 Major Reasoning Datasets**: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag
1111
- **Router vs vLLM Comparison**: Side-by-side performance evaluation
1212
- **Multiple Evaluation Modes**: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)
13+
- **Reasoning Mode Evaluation** (Issue #42): Dedicated standard vs reasoning mode comparison
1314
- **Research-Ready Output**: CSV files and publication-quality plots
1415
- **Dataset-Agnostic Architecture**: Easy to extend with new datasets
1516
- **CLI Tools**: Simple command-line interface for common operations
@@ -31,13 +32,67 @@ vllm-semantic-router-bench test --dataset mmlu --samples 5
3132
# Full comparison between router and vLLM
3233
vllm-semantic-router-bench compare --dataset arc --samples 10
3334

35+
# Reasoning mode evaluation (Issue #42)
36+
vllm-semantic-router-bench reasoning-eval --datasets mmlu gpqa --samples 10
37+
3438
# List available datasets
3539
vllm-semantic-router-bench list-datasets
3640

3741
# Run comprehensive multi-dataset benchmark
3842
vllm-semantic-router-bench comprehensive
3943
```
4044

45+
### Reasoning Mode Evaluation (Issue #42)
46+
47+
Dedicated benchmark comparing standard vs reasoning mode with key metrics:
48+
49+
```bash
50+
# Run reasoning mode evaluation
51+
reasoning-mode-eval --datasets mmlu gpqa truthfulqa --samples-per-category 10
52+
53+
# Or use the shell script
54+
./reasoning_mode_eval.sh
55+
```
56+
57+
**Key Metrics Evaluated:**
58+
59+
- **Response Correctness**: Accuracy on MMLU(-Pro) and non-MMLU test sets
60+
- **Token Usage Ratio**: `completion_tokens / prompt_tokens`
61+
- **Time per Output Token**: Response time efficiency metric (ms)
62+
63+
**Automated vSR Config Generation:**
64+
65+
The benchmark automatically generates vLLM Semantic Router (vSR) model configuration based on evaluation results:
66+
67+
```bash
68+
# Generate vSR config with reasoning family specification
69+
reasoning-mode-eval \
70+
--datasets mmlu gpqa \
71+
--model qwen3-14b \
72+
--reasoning-family qwen3 \
73+
--samples-per-category 20
74+
```
75+
76+
**Output includes:**
77+
78+
- `vsr_model_config.yaml` - Ready-to-use YAML config snippet for `config/config.yaml`
79+
- `vsr_model_config_recommendation.json` - Detailed performance analysis and recommendations
80+
- Automatic recommendation based on accuracy vs. cost/latency trade-offs
81+
82+
**Example generated config:**
83+
84+
```yaml
85+
model_config:
86+
qwen3-14b:
87+
reasoning_family: qwen3
88+
```
89+
90+
**Supported reasoning families:**
91+
92+
- `qwen3` - For Qwen-3 models with `chat_template_kwargs`
93+
- `deepseek` - For DeepSeek-R1 models with `thinking` parameter
94+
- `gpt-oss` - For GPT-OSS models with `reasoning_effort`
95+
4196
### Python API
4297

4398
```python
@@ -104,10 +159,12 @@ The benchmark generates research-ready outputs:
104159
- **Plots**: Accuracy and token usage comparisons
105160
- **Summary Reports**: Markdown reports with key findings
106161

107-
### Example Output Structure
162+
### Generated Output Structure
163+
164+
**Note**: The following directory structure is created locally when you run the benchmark. These files are not committed to the repository.
108165

109166
```
110-
results/
167+
results/ # Created locally when running benchmarks
111168
├── research_results_master.csv # Main research data
112169
├── comparison_20250115_143022/
113170
│ ├── router_mmlu/
@@ -118,6 +175,89 @@ results/
118175
│ │ ├── accuracy_comparison.png
119176
│ │ └── token_usage_comparison.png
120177
│ └── RESEARCH_SUMMARY.md
178+
└── reasoning_mode_eval/ # Issue #42 evaluation results
179+
├── reasoning_mode_eval_summary.json # Full evaluation summary with all metrics
180+
├── vsr_model_config.yaml # Ready-to-use vSR config snippet
181+
├── vsr_model_config_recommendation.json # Detailed recommendation & analysis
182+
├── REASONING_MODE_EVALUATION_REPORT.md # Human-readable report
183+
├── plots/
184+
│ ├── MMLU-Pro_overall_comparison.png
185+
│ ├── MMLU-Pro_category_accuracy.png
186+
│ ├── MMLU-Pro_token_usage_ratio.png
187+
│ └── MMLU-Pro_time_per_token.png
188+
└── MMLU-Pro/
189+
├── detailed_results.csv
190+
├── standard_mode_results.csv
191+
└── reasoning_mode_results.csv
192+
```
193+
194+
## 🚀 Using Generated vSR Config in Production
195+
196+
After running the reasoning mode evaluation, integrate the generated configuration into your semantic-router deployment:
197+
198+
### 1. Review the Recommendation
199+
200+
```bash
201+
# Check the detailed recommendation
202+
cat results/reasoning_mode_eval/vsr_model_config_recommendation.json
203+
204+
# View the generated config
205+
cat results/reasoning_mode_eval/vsr_model_config.yaml
206+
```
207+
208+
### 2. Integrate into config.yaml
209+
210+
Copy the generated `model_config` section to your `config/config.yaml`:
211+
212+
```yaml
213+
# config/config.yaml
214+
215+
model_config:
216+
qwen3-14b:
217+
reasoning_family: qwen3 # From generated config
218+
preferred_endpoints: ["endpoint1"] # Optional: your endpoint configuration
219+
```
220+
221+
### 3. Enable Reasoning for Categories (Optional)
222+
223+
To enable reasoning mode for specific categories, update your intelligent routing configuration:
224+
225+
```yaml
226+
# config/config.yaml
227+
228+
default_reasoning_effort: "medium" # or "low", "high"
229+
230+
# OR enable per-category
231+
categories:
232+
- name: math
233+
reasoning_enabled: true # Enable reasoning for complex math queries
234+
- name: casual
235+
reasoning_enabled: false # Disable for casual conversations
236+
```
237+
238+
### 4. End-to-End Pipeline Example
239+
240+
```bash
241+
# 1. Run evaluation
242+
reasoning-mode-eval \
243+
--datasets mmlu gpqa truthfulqa \
244+
--model qwen3-14b \
245+
--reasoning-family qwen3 \
246+
--endpoint http://your-vllm-server:8000/v1 \
247+
--samples-per-category 50
248+
249+
# 2. Review results
250+
cat results/reasoning_mode_eval/REASONING_MODE_EVALUATION_REPORT.md
251+
252+
# 3. If recommendation is positive, merge generated config
253+
cp results/reasoning_mode_eval/vsr_model_config.yaml config/model_config_addition.yaml
254+
255+
# 4. Update your main config.yaml with the new model_config section
256+
257+
# 5. Restart semantic-router with updated config
258+
kubectl rollout restart deployment semantic-router # For K8s
259+
# OR
260+
docker-compose restart semantic-router # For Docker Compose
121261
```
122262

123263
## 🛠️ Development

bench/pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "vllm-semantic-router-bench"
7-
version = "1.0.0"
7+
version = "1.1.0"
88
description = "Comprehensive benchmark suite for semantic router vs direct vLLM evaluation across multiple reasoning datasets"
99
readme = "README.md"
1010
requires-python = ">=3.8"
@@ -77,6 +77,7 @@ Repository = "https://github.com/vllm-project/semantic-router"
7777
vllm-semantic-router-bench = "vllm_semantic_router_bench.cli:main"
7878
router-bench = "vllm_semantic_router_bench.router_reason_bench_multi_dataset:main"
7979
bench-plot = "vllm_semantic_router_bench.bench_plot:main"
80+
reasoning-mode-eval = "vllm_semantic_router_bench.reasoning_mode_eval:main"
8081

8182
[tool.setuptools.packages.find]
8283
where = ["."]

bench/reasoning_mode_eval.sh

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/bin/bash
2+
#
3+
# Reasoning Mode Evaluation Script
4+
# Issue #42: [v0.1]Bench: Reasoning mode evaluation
5+
#
6+
# Compares standard vs reasoning mode using:
7+
# - Response correctness on MMLU(-Pro) and non-MMLU test sets
8+
# - Token usage (completion_tokens/prompt_tokens ratio)
9+
# - Response time per output token
10+
#
11+
# Usage:
12+
# ./reasoning_mode_eval.sh [options]
13+
#
14+
# Environment Variables:
15+
# VLLM_ENDPOINT - vLLM endpoint URL (default: http://127.0.0.1:8000/v1)
16+
# VLLM_API_KEY - API key for vLLM endpoint (default: 1234)
17+
# MODEL - Model to evaluate (fetches from endpoint if not set)
18+
#
19+
20+
set -e
21+
22+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
23+
24+
# Default configuration
25+
VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://127.0.0.1:8000/v1}"
26+
VLLM_API_KEY="${VLLM_API_KEY:-1234}"
27+
OUTPUT_DIR="${OUTPUT_DIR:-results/reasoning_mode_eval}"
28+
SAMPLES="${SAMPLES:-10}"
29+
CONCURRENT="${CONCURRENT:-1}"
30+
31+
# Default datasets: MMLU (primary) and non-MMLU (for comparison)
32+
DATASETS="${DATASETS:-mmlu gpqa truthfulqa}"
33+
34+
echo "=============================================="
35+
echo "🧠 Reasoning Mode Evaluation (Issue #42)"
36+
echo "=============================================="
37+
echo ""
38+
echo "Configuration:"
39+
echo " Endpoint: ${VLLM_ENDPOINT}"
40+
echo " Datasets: ${DATASETS}"
41+
echo " Samples: ${SAMPLES} per category"
42+
echo " Concurrent: ${CONCURRENT} requests"
43+
echo " Output: ${OUTPUT_DIR}"
44+
echo ""
45+
46+
# Build command
47+
CMD="python -m vllm_semantic_router_bench.reasoning_mode_eval \
48+
--datasets ${DATASETS} \
49+
--endpoint ${VLLM_ENDPOINT} \
50+
--api-key ${VLLM_API_KEY} \
51+
--samples-per-category ${SAMPLES} \
52+
--concurrent-requests ${CONCURRENT} \
53+
--output-dir ${OUTPUT_DIR}"
54+
55+
# Add model if specified
56+
if [ -n "${MODEL}" ]; then
57+
CMD="${CMD} --model ${MODEL}"
58+
echo " Model: ${MODEL}"
59+
fi
60+
61+
echo ""
62+
echo "Running evaluation..."
63+
echo ""
64+
65+
cd "${SCRIPT_DIR}"
66+
eval "${CMD}"
67+
68+
echo ""
69+
echo "=============================================="
70+
echo "✅ Evaluation Complete"
71+
echo "=============================================="
72+
echo ""
73+
echo "Results saved to: ${OUTPUT_DIR}"
74+
echo ""
75+
echo "Key outputs:"
76+
echo " - reasoning_mode_eval_summary.json (JSON summary)"
77+
echo " - REASONING_MODE_EVALUATION_REPORT.md (Markdown report)"
78+
echo " - plots/ (Visualization plots)"
79+
echo " - <dataset>/detailed_results.csv (Per-question results)"
80+
echo ""
81+

bench/vllm_semantic_router_bench/__init__.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,24 @@
1616
- Dataset-agnostic architecture with factory pattern
1717
- Router vs direct vLLM comparison
1818
- Multiple evaluation modes (NR, XC, NR_REASONING)
19+
- Reasoning mode evaluation (Issue #42) - standard vs reasoning comparison
1920
- Comprehensive plotting and analysis tools
2021
- Research-ready CSV output
2122
- Configurable token limits per dataset
2223
"""
2324

24-
__version__ = "1.0.0"
25+
__version__ = "1.1.0"
2526
__author__ = "vLLM Semantic Router Team"
2627

2728
from .dataset_factory import DatasetFactory, list_available_datasets
2829
from .dataset_interface import DatasetInfo, DatasetInterface, PromptFormatter, Question
2930

31+
# Reasoning mode evaluation (Issue #42)
32+
from .reasoning_mode_eval import (
33+
ReasoningModeMetrics,
34+
ReasoningModeComparison,
35+
)
36+
3037
# Make key classes available at package level
3138
__all__ = [
3239
"DatasetInterface",
@@ -35,5 +42,8 @@
3542
"PromptFormatter",
3643
"DatasetFactory",
3744
"list_available_datasets",
45+
# Reasoning mode evaluation (Issue #42)
46+
"ReasoningModeMetrics",
47+
"ReasoningModeComparison",
3848
"__version__",
3949
]

0 commit comments

Comments
 (0)