feat(bench): Add reasoning mode evaluation benchmark (Issue #42) (#791)

asaadbalum · Asaad Balum · rootfs · web-flow · commit c911705d19e1 · 2025-12-10T12:33:38.000-05:00
Implements comprehensive benchmarking for comparing standard vs reasoning mode as specified in Issue #42 acceptance criteria. Key Features: - New reasoning_mode_eval.py module for dedicated standard vs reasoning comparison - Implements all Issue #42 metrics: - Response correctness (accuracy) on MMLU(-Pro) and non-MMLU test sets - Token usage ratio (completion_tokens / prompt_tokens) - Response time per output token (ms) - Per-category breakdown of all metrics - Automatic plot generation comparing modes - Comprehensive markdown report generation - CLI integration via 'reasoning-eval' command New Files: - bench/vllm_semantic_router_bench/reasoning_mode_eval.py - bench/reasoning_mode_eval.sh (convenience script) - bench/test_mock_server.py (testing mock server) Modified: - bench/vllm_semantic_router_bench/cli.py (new command) - bench/vllm_semantic_router_bench/__init__.py (exports) - bench/pyproject.toml (entry point, version bump to 1.1.0) - bench/README.md (documentation) Usage: vllm-semantic-router-bench reasoning-eval --datasets mmlu gpqa --samples 10 reasoning-mode-eval --datasets mmlu truthfulqa --samples-per-category 5 Closes #42 Signed-off-by: Asaad Balum <abalum@abalum-thinkpadp16vgen1.raanaii.csb> Co-authored-by: Asaad Balum <abalum@abalum-thinkpadp16vgen1.raanaii.csb> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com>
diff --git a/bench/README.md b/bench/README.md
@@ -10,6 +10,7 @@ A comprehensive benchmark suite for evaluating **semantic router** performance a
 - **6 Major Reasoning Datasets**: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag
 - **Router vs vLLM Comparison**: Side-by-side performance evaluation
 - **Multiple Evaluation Modes**: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)
+- **Reasoning Mode Evaluation** (Issue #42): Dedicated standard vs reasoning mode comparison
 - **Research-Ready Output**: CSV files and publication-quality plots
 - **Dataset-Agnostic Architecture**: Easy to extend with new datasets
 - **CLI Tools**: Simple command-line interface for common operations
@@ -31,13 +32,67 @@ vllm-semantic-router-bench test --dataset mmlu --samples 5
 # Full comparison between router and vLLM
 vllm-semantic-router-bench compare --dataset arc --samples 10
 
+# Reasoning mode evaluation (Issue #42)
+vllm-semantic-router-bench reasoning-eval --datasets mmlu gpqa --samples 10
+
 # List available datasets
 vllm-semantic-router-bench list-datasets
 
 # Run comprehensive multi-dataset benchmark
 vllm-semantic-router-bench comprehensive
 ```
 
+### Reasoning Mode Evaluation (Issue #42)
+
+Dedicated benchmark comparing standard vs reasoning mode with key metrics:
+
+```bash
+# Run reasoning mode evaluation
+reasoning-mode-eval --datasets mmlu gpqa truthfulqa --samples-per-category 10
+
+# Or use the shell script
+./reasoning_mode_eval.sh
+```
+
+**Key Metrics Evaluated:**
+
+- **Response Correctness**: Accuracy on MMLU(-Pro) and non-MMLU test sets
+- **Token Usage Ratio**: `completion_tokens / prompt_tokens`
+- **Time per Output Token**: Response time efficiency metric (ms)
+
+**Automated vSR Config Generation:**
+
+The benchmark automatically generates vLLM Semantic Router (vSR) model configuration based on evaluation results:
+
+```bash
+# Generate vSR config with reasoning family specification
+reasoning-mode-eval \
+  --datasets mmlu gpqa \
+  --model qwen3-14b \
+  --reasoning-family qwen3 \
+  --samples-per-category 20
+```
+
+**Output includes:**
+
+- `vsr_model_config.yaml` - Ready-to-use YAML config snippet for `config/config.yaml`
+- `vsr_model_config_recommendation.json` - Detailed performance analysis and recommendations
+- Automatic recommendation based on accuracy vs. cost/latency trade-offs
+
+**Example generated config:**
+
+```yaml
+model_config:
+  qwen3-14b:
+    reasoning_family: qwen3
+```
+
+**Supported reasoning families:**
+
+- `qwen3` - For Qwen-3 models with `chat_template_kwargs`
+- `deepseek` - For DeepSeek-R1 models with `thinking` parameter
+- `gpt-oss` - For GPT-OSS models with `reasoning_effort`
+
 ### Python API
 
 ```python
@@ -104,10 +159,12 @@ The benchmark generates research-ready outputs:
 - **Plots**: Accuracy and token usage comparisons
 - **Summary Reports**: Markdown reports with key findings
 
-### Example Output Structure
+### Generated Output Structure
+
+**Note**: The following directory structure is created locally when you run the benchmark. These files are not committed to the repository.
 
 ```
-results/
+results/  # Created locally when running benchmarks
 ├── research_results_master.csv          # Main research data
 ├── comparison_20250115_143022/
 │   ├── router_mmlu/
@@ -118,6 +175,89 @@ results/
 │   │   ├── accuracy_comparison.png
 │   │   └── token_usage_comparison.png
 │   └── RESEARCH_SUMMARY.md
+└── reasoning_mode_eval/                  # Issue #42 evaluation results
+    ├── reasoning_mode_eval_summary.json  # Full evaluation summary with all metrics
+    ├── vsr_model_config.yaml             # Ready-to-use vSR config snippet
+    ├── vsr_model_config_recommendation.json  # Detailed recommendation & analysis
+    ├── REASONING_MODE_EVALUATION_REPORT.md   # Human-readable report
+    ├── plots/
+    │   ├── MMLU-Pro_overall_comparison.png
+    │   ├── MMLU-Pro_category_accuracy.png
+    │   ├── MMLU-Pro_token_usage_ratio.png
+    │   └── MMLU-Pro_time_per_token.png
+    └── MMLU-Pro/
+        ├── detailed_results.csv
+        ├── standard_mode_results.csv
+        └── reasoning_mode_results.csv
+```
+
+## 🚀 Using Generated vSR Config in Production
+
+After running the reasoning mode evaluation, integrate the generated configuration into your semantic-router deployment:
+
+### 1. Review the Recommendation
+
+```bash
+# Check the detailed recommendation
+cat results/reasoning_mode_eval/vsr_model_config_recommendation.json
+
+# View the generated config
+cat results/reasoning_mode_eval/vsr_model_config.yaml
+```
+
+### 2. Integrate into config.yaml
+
+Copy the generated `model_config` section to your `config/config.yaml`:
+
+```yaml
+# config/config.yaml
+
+model_config:
+  qwen3-14b:
+    reasoning_family: qwen3  # From generated config
+    preferred_endpoints: ["endpoint1"]  # Optional: your endpoint configuration
+```
+
+### 3. Enable Reasoning for Categories (Optional)
+
+To enable reasoning mode for specific categories, update your intelligent routing configuration:
+
+```yaml
+# config/config.yaml
+
+default_reasoning_effort: "medium"  # or "low", "high"
+
+# OR enable per-category
+categories:
+  - name: math
+    reasoning_enabled: true  # Enable reasoning for complex math queries
+  - name: casual
+    reasoning_enabled: false  # Disable for casual conversations
+```
+
+### 4. End-to-End Pipeline Example
+
+```bash
+# 1. Run evaluation
+reasoning-mode-eval \
+  --datasets mmlu gpqa truthfulqa \
+  --model qwen3-14b \
+  --reasoning-family qwen3 \
+  --endpoint http://your-vllm-server:8000/v1 \
+  --samples-per-category 50
+
+# 2. Review results
+cat results/reasoning_mode_eval/REASONING_MODE_EVALUATION_REPORT.md
+
+# 3. If recommendation is positive, merge generated config
+cp results/reasoning_mode_eval/vsr_model_config.yaml config/model_config_addition.yaml
+
+# 4. Update your main config.yaml with the new model_config section
+
+# 5. Restart semantic-router with updated config
+kubectl rollout restart deployment semantic-router  # For K8s
+# OR
+docker-compose restart semantic-router  # For Docker Compose
 ```
 
 ## 🛠️ Development
diff --git a/bench/pyproject.toml b/bench/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "vllm-semantic-router-bench"
-version = "1.0.0"
+version = "1.1.0"
 description = "Comprehensive benchmark suite for semantic router vs direct vLLM evaluation across multiple reasoning datasets"
 readme = "README.md"
 requires-python = ">=3.8"
@@ -77,6 +77,7 @@ Repository = "https://github.com/vllm-project/semantic-router"
 vllm-semantic-router-bench = "vllm_semantic_router_bench.cli:main"
 router-bench = "vllm_semantic_router_bench.router_reason_bench_multi_dataset:main"
 bench-plot = "vllm_semantic_router_bench.bench_plot:main"
+reasoning-mode-eval = "vllm_semantic_router_bench.reasoning_mode_eval:main"
 
 [tool.setuptools.packages.find]
 where = ["."]
diff --git a/bench/reasoning_mode_eval.sh b/bench/reasoning_mode_eval.sh
@@ -0,0 +1,81 @@
+#!/bin/bash
+#
+# Reasoning Mode Evaluation Script
+# Issue #42: [v0.1]Bench: Reasoning mode evaluation
+#
+# Compares standard vs reasoning mode using:
+# - Response correctness on MMLU(-Pro) and non-MMLU test sets
+# - Token usage (completion_tokens/prompt_tokens ratio)
+# - Response time per output token
+#
+# Usage:
+#   ./reasoning_mode_eval.sh [options]
+#
+# Environment Variables:
+#   VLLM_ENDPOINT - vLLM endpoint URL (default: http://127.0.0.1:8000/v1)
+#   VLLM_API_KEY  - API key for vLLM endpoint (default: 1234)
+#   MODEL         - Model to evaluate (fetches from endpoint if not set)
+#
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# Default configuration
+VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://127.0.0.1:8000/v1}"
+VLLM_API_KEY="${VLLM_API_KEY:-1234}"
+OUTPUT_DIR="${OUTPUT_DIR:-results/reasoning_mode_eval}"
+SAMPLES="${SAMPLES:-10}"
+CONCURRENT="${CONCURRENT:-1}"
+
+# Default datasets: MMLU (primary) and non-MMLU (for comparison)
+DATASETS="${DATASETS:-mmlu gpqa truthfulqa}"
+
+echo "=============================================="
+echo "🧠 Reasoning Mode Evaluation (Issue #42)"
+echo "=============================================="
+echo ""
+echo "Configuration:"
+echo "  Endpoint:    ${VLLM_ENDPOINT}"
+echo "  Datasets:    ${DATASETS}"
+echo "  Samples:     ${SAMPLES} per category"
+echo "  Concurrent:  ${CONCURRENT} requests"
+echo "  Output:      ${OUTPUT_DIR}"
+echo ""
+
+# Build command
+CMD="python -m vllm_semantic_router_bench.reasoning_mode_eval \
+    --datasets ${DATASETS} \
+    --endpoint ${VLLM_ENDPOINT} \
+    --api-key ${VLLM_API_KEY} \
+    --samples-per-category ${SAMPLES} \
+    --concurrent-requests ${CONCURRENT} \
+    --output-dir ${OUTPUT_DIR}"
+
+# Add model if specified
+if [ -n "${MODEL}" ]; then
+    CMD="${CMD} --model ${MODEL}"
+    echo "  Model:       ${MODEL}"
+fi
+
+echo ""
+echo "Running evaluation..."
+echo ""
+
+cd "${SCRIPT_DIR}"
+eval "${CMD}"
+
+echo ""
+echo "=============================================="
+echo "✅ Evaluation Complete"
+echo "=============================================="
+echo ""
+echo "Results saved to: ${OUTPUT_DIR}"
+echo ""
+echo "Key outputs:"
+echo "  - reasoning_mode_eval_summary.json  (JSON summary)"
+echo "  - REASONING_MODE_EVALUATION_REPORT.md (Markdown report)"
+echo "  - plots/                            (Visualization plots)"
+echo "  - <dataset>/detailed_results.csv   (Per-question results)"
+echo ""
+
diff --git a/bench/vllm_semantic_router_bench/__init__.py b/bench/vllm_semantic_router_bench/__init__.py
@@ -16,17 +16,24 @@
 - Dataset-agnostic architecture with factory pattern
 - Router vs direct vLLM comparison
 - Multiple evaluation modes (NR, XC, NR_REASONING)
+- Reasoning mode evaluation (Issue #42) - standard vs reasoning comparison
 - Comprehensive plotting and analysis tools
 - Research-ready CSV output
 - Configurable token limits per dataset
 """
 
-__version__ = "1.0.0"
+__version__ = "1.1.0"
 __author__ = "vLLM Semantic Router Team"
 
 from .dataset_factory import DatasetFactory, list_available_datasets
 from .dataset_interface import DatasetInfo, DatasetInterface, PromptFormatter, Question
 
+# Reasoning mode evaluation (Issue #42)
+from .reasoning_mode_eval import (
+    ReasoningModeMetrics,
+    ReasoningModeComparison,
+)
+
 # Make key classes available at package level
 __all__ = [
     "DatasetInterface",
@@ -35,5 +42,8 @@
     "PromptFormatter",
     "DatasetFactory",
     "list_available_datasets",
+    # Reasoning mode evaluation (Issue #42)
+    "ReasoningModeMetrics",
+    "ReasoningModeComparison",
     "__version__",
 ]
diff --git a/bench/vllm_semantic_router_bench/cli.py b/bench/vllm_semantic_router_bench/cli.py
diff --git a/bench/vllm_semantic_router_bench/reasoning_mode_eval.py b/bench/vllm_semantic_router_bench/reasoning_mode_eval.py