| 
 | 1 | +# Reasoning Routing Quickstart  | 
 | 2 | + | 
 | 3 | +This short guide shows how to enable and verify “reasoning routing” in the Semantic Router:  | 
 | 4 | +- Minimal config.yaml fields you need  | 
 | 5 | +- Example request/response (OpenAI-compatible)  | 
 | 6 | +- A comprehensive evaluation command you can run  | 
 | 7 | + | 
 | 8 | +Prerequisites  | 
 | 9 | +- A running OpenAI-compatible backend for your models (e.g., vLLM or any OpenAI-compatible server). It must be reachable at the addresses you configure under vllm_endpoints (address:port).  | 
 | 10 | +- Envoy + the router (see Start the router section)  | 
 | 11 | + | 
 | 12 | +1) Minimal configuration  | 
 | 13 | +Put this in config/config.yaml (or merge into your existing config). It defines:  | 
 | 14 | +- Categories that require reasoning (e.g., math)  | 
 | 15 | +- Reasoning families for model syntax differences (DeepSeek/Qwen3 use chat_template_kwargs; GPT-OSS/GPT use reasoning_effort)  | 
 | 16 | +- Which concrete models use which reasoning family  | 
 | 17 | +- The classifier (required for category detection; without it, reasoning will not be enabled)  | 
 | 18 | + | 
 | 19 | +```yaml  | 
 | 20 | +# Category classifier (required for reasoning to trigger)  | 
 | 21 | +classifier:  | 
 | 22 | +  category_model:  | 
 | 23 | +    model_id: "models/category_classifier_modernbert-base_model"  | 
 | 24 | +    use_modernbert: true  | 
 | 25 | +    threshold: 0.6  | 
 | 26 | +    use_cpu: true  | 
 | 27 | +    category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"  | 
 | 28 | + | 
 | 29 | +# vLLM endpoints that host your models  | 
 | 30 | +vllm_endpoints:  | 
 | 31 | +  - name: "endpoint1"  | 
 | 32 | +    address: "127.0.0.1"  | 
 | 33 | +    port: 8000  | 
 | 34 | +    models: ["deepseek-v31", "qwen3-30b", "openai/gpt-oss-20b"]  | 
 | 35 | +    weight: 1  | 
 | 36 | + | 
 | 37 | +# Reasoning family configurations (how to express reasoning for a family)  | 
 | 38 | +reasoning_families:  | 
 | 39 | +  deepseek:  | 
 | 40 | +    type: "chat_template_kwargs"  | 
 | 41 | +    parameter: "thinking"  | 
 | 42 | +  qwen3:  | 
 | 43 | +    type: "chat_template_kwargs"  | 
 | 44 | +    parameter: "enable_thinking"  | 
 | 45 | +  gpt-oss:  | 
 | 46 | +    type: "reasoning_effort"  | 
 | 47 | +    parameter: "reasoning_effort"  | 
 | 48 | +  gpt:  | 
 | 49 | +    type: "reasoning_effort"  | 
 | 50 | +    parameter: "reasoning_effort"  | 
 | 51 | + | 
 | 52 | +# Default effort used when a category doesn’t specify one  | 
 | 53 | +default_reasoning_effort: medium  # low | medium | high  | 
 | 54 | + | 
 | 55 | +# Map concrete model names to a reasoning family  | 
 | 56 | +model_config:  | 
 | 57 | +  "deepseek-v31":  | 
 | 58 | +    reasoning_family: "deepseek"  | 
 | 59 | +    preferred_endpoints: ["endpoint1"]  | 
 | 60 | +  "qwen3-30b":  | 
 | 61 | +    reasoning_family: "qwen3"  | 
 | 62 | +    preferred_endpoints: ["endpoint1"]  | 
 | 63 | +  "openai/gpt-oss-20b":  | 
 | 64 | +    reasoning_family: "gpt-oss"  | 
 | 65 | +    preferred_endpoints: ["endpoint1"]  | 
 | 66 | + | 
 | 67 | +# Categories: which kinds of queries require reasoning and at what effort  | 
 | 68 | +categories:  | 
 | 69 | +- name: math  | 
 | 70 | +  use_reasoning: true  | 
 | 71 | +  reasoning_effort: high  # overrides default_reasoning_effort  | 
 | 72 | +  reasoning_description: "Mathematical problems require step-by-step reasoning"  | 
 | 73 | +  model_scores:  | 
 | 74 | +  - model: openai/gpt-oss-20b  | 
 | 75 | +    score: 1.0  | 
 | 76 | +  - model: deepseek-v31  | 
 | 77 | +    score: 0.8  | 
 | 78 | +  - model: qwen3-30b  | 
 | 79 | +    score: 0.8  | 
 | 80 | + | 
 | 81 | + | 
 | 82 | +# A safe default when no category is confidently selected  | 
 | 83 | +default_model: qwen3-30b  | 
 | 84 | +```  | 
 | 85 | +
  | 
 | 86 | +Notes  | 
 | 87 | +- Reasoning is controlled by categories.use_reasoning and optionally categories.reasoning_effort.  | 
 | 88 | +- A model only gets reasoning fields if it has a model_config.<MODEL>.reasoning_family that maps to a reasoning_families entry.  | 
 | 89 | +- DeepSeek/Qwen3 (chat_template_kwargs): the router injects chat_template_kwargs only when reasoning is enabled. When disabled, no chat_template_kwargs are added.  | 
 | 90 | +- GPT/GPT-OSS (reasoning_effort): when reasoning is enabled, the router sets reasoning_effort based on the category (fallback to default_reasoning_effort). When reasoning is disabled, if the request already contains reasoning_effort and the model’s family type is reasoning_effort, the router preserves the original value; otherwise it is absent.  | 
 | 91 | +- Category descriptions (for example, description and reasoning_description) are informational only today; they do not affect routing or classification.  | 
 | 92 | +- Categories must be from MMLU-Pro at the moment; avoid free-form categories like "general". If you want generic categories, consider opening an issue to map them to MMLU-Pro.  | 
 | 93 | +
  | 
 | 94 | +2) Start the router  | 
 | 95 | +Option A: Local build + Envoy  | 
 | 96 | +- Download classifier models and mappings (required)  | 
 | 97 | +  - make download-models  | 
 | 98 | +- Build and run the router  | 
 | 99 | +  - make build  | 
 | 100 | +  - make run-router  | 
 | 101 | +- Start Envoy (install func-e once with make prepare-envoy if needed)  | 
 | 102 | +  - func-e run --config-path config/envoy.yaml --component-log-level "ext_proc:trace,router:trace,http:trace"  | 
 | 103 | +
  | 
 | 104 | +Option B: Docker Compose  | 
 | 105 | +- docker compose up -d  | 
 | 106 | +  - Exposes Envoy at http://localhost:8801 (proxying /v1/* to backends via the router)  | 
 | 107 | +
  | 
 | 108 | +Note: Ensure your OpenAI-compatible backend is running and reachable (e.g., http://127.0.0.1:8000) so that vllm_endpoints address:port matches a live server. Without a running backend, routing will fail at the Envoy hop.  | 
 | 109 | +
  | 
 | 110 | +3) Send example requests  | 
 | 111 | +Math (reasoning should be ON and effort high)  | 
 | 112 | +```bash  | 
 | 113 | +curl -sS http://localhost:8801/v1/chat/completions \  | 
 | 114 | +  -H "Content-Type: application/json" \  | 
 | 115 | +  -d '{  | 
 | 116 | +    "model": "auto",  | 
 | 117 | +    "messages": [  | 
 | 118 | +      {"role": "system", "content": "You are a math teacher."},  | 
 | 119 | +      {"role": "user",   "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"}  | 
 | 120 | +    ]  | 
 | 121 | +  }' | jq  | 
 | 122 | +```  | 
 | 123 | + | 
 | 124 | +General (reasoning should be OFF)  | 
 | 125 | +```bash  | 
 | 126 | +curl -sS http://localhost:8801/v1/chat/completions \  | 
 | 127 | +  -H "Content-Type: application/json" \  | 
 | 128 | +  -d '{  | 
 | 129 | +    "model": "auto",  | 
 | 130 | +    "messages": [  | 
 | 131 | +      {"role": "system", "content": "You are a helpful assistant."},  | 
 | 132 | +      {"role": "user",   "content": "Who are you?"}  | 
 | 133 | +    ]  | 
 | 134 | +  }' | jq  | 
 | 135 | +```  | 
 | 136 | + | 
 | 137 | +Verify routing via response headers  | 
 | 138 | +The router does not inject routing metadata into the JSON body. Instead, inspect the response headers added by the router:  | 
 | 139 | +- X-Selected-Model  | 
 | 140 | +- X-Semantic-Destination-Endpoint  | 
 | 141 | + | 
 | 142 | +Example:  | 
 | 143 | +```bash  | 
 | 144 | +curl -i http://localhost:8801/v1/chat/completions \  | 
 | 145 | +  -H "Content-Type: application/json" \  | 
 | 146 | +  -d '{  | 
 | 147 | +    "model": "auto",  | 
 | 148 | +    "messages": [  | 
 | 149 | +      {"role": "system", "content": "You are a math teacher."},  | 
 | 150 | +      {"role": "user",   "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"}  | 
 | 151 | +    ]  | 
 | 152 | +  }'  | 
 | 153 | +# In the response headers, look for:  | 
 | 154 | +#   X-Selected-Model: <your-selected-model>  | 
 | 155 | +#   X-Semantic-Destination-Endpoint: <address:port>  | 
 | 156 | +```  | 
 | 157 | + | 
 | 158 | +4) Run a comprehensive evaluation  | 
 | 159 | +You can benchmark the router vs a direct vLLM endpoint across categories using the included script. This runs a ReasoningBench based on MMLU-Pro and produces summaries and plots.  | 
 | 160 | + | 
 | 161 | +Quick start (router + vLLM):  | 
 | 162 | +```bash  | 
 | 163 | +SAMPLES_PER_CATEGORY=25 \  | 
 | 164 | +CONCURRENT_REQUESTS=4 \  | 
 | 165 | +ROUTER_MODELS="auto" \  | 
 | 166 | +VLLM_MODELS="openai/gpt-oss-20b" \  | 
 | 167 | +./bench/run_bench.sh  | 
 | 168 | +```  | 
 | 169 | + | 
 | 170 | +Router-only benchmark:  | 
 | 171 | +```bash  | 
 | 172 | +BENCHMARK_ROUTER_ONLY=true \  | 
 | 173 | +SAMPLES_PER_CATEGORY=25 \  | 
 | 174 | +CONCURRENT_REQUESTS=4 \  | 
 | 175 | +ROUTER_MODELS="auto" \  | 
 | 176 | +./bench/run_bench.sh  | 
 | 177 | +```  | 
 | 178 | + | 
 | 179 | +Direct invocation (advanced):  | 
 | 180 | +```bash  | 
 | 181 | +python bench/router_reason_bench.py \  | 
 | 182 | +  --run-router \  | 
 | 183 | +  --router-endpoint http://localhost:8801/v1 \  | 
 | 184 | +  --router-models auto \  | 
 | 185 | +  --run-vllm \  | 
 | 186 | +  --vllm-endpoint http://localhost:8000/v1 \  | 
 | 187 | +  --vllm-models openai/gpt-oss-20b \  | 
 | 188 | +  --samples-per-category 25 \  | 
 | 189 | +  --concurrent-requests 4 \  | 
 | 190 | +  --output-dir results/reasonbench  | 
 | 191 | +```  | 
 | 192 | + | 
 | 193 | +Tips  | 
 | 194 | +- If your math request doesn’t enable reasoning, confirm the classifier assigns the "math" category with sufficient confidence (see classifier.category_model.threshold) and that the target model has a reasoning_family.  | 
 | 195 | +- For models without a reasoning_family, the router will not inject reasoning fields even when the category requires reasoning (this is by design to avoid invalid requests).  | 
 | 196 | +- You can override the effort per category via categories.reasoning_effort or set a global default via default_reasoning_effort.  | 
 | 197 | +- Ensure your OpenAI-compatible backend is reachable at the configured vllm_endpoints (address:port). If it’s not running, routing will fail even though the router and Envoy are up.  | 
 | 198 | + | 
0 commit comments