docs: add more details

tao12345666333 · tao12345666333 · commit 0371f5c2e887 · 2025-09-12T21:26:10.000+08:00
Signed-off-by: Jintao Zhang &lt;zhangjintao9020@gmail.com&gt;
diff --git a/website/docs/getting-started/reasoning-routing-quickstart.md b/website/docs/getting-started/reasoning-routing-quickstart.md
@@ -6,16 +6,26 @@ This short guide shows how to enable and verify “reasoning routing” in the S
 - A comprehensive evaluation command you can run
 
 Prerequisites
-- A running OpenAI-compatible backend for your models (e.g., vLLM, OpenAI-compatible server)
+- A running OpenAI-compatible backend for your models (e.g., vLLM or any OpenAI-compatible server). It must be reachable at the addresses you configure under vllm_endpoints (address:port).
 - Envoy + the router (see Start the router section)
 
 1) Minimal configuration
 Put this in config/config.yaml (or merge into your existing config). It defines:
 - Categories that require reasoning (e.g., math)
 - Reasoning families for model syntax differences (DeepSeek/Qwen3 use chat_template_kwargs; GPT-OSS/GPT use reasoning_effort)
 - Which concrete models use which reasoning family
+- The classifier (required for category detection; without it, reasoning will not be enabled)
 
 ```yaml
+# Category classifier (required for reasoning to trigger)
+classifier:
+  category_model:
+    model_id: "models/category_classifier_modernbert-base_model"
+    use_modernbert: true
+    threshold: 0.6
+    use_cpu: true
+    category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
+
 # vLLM endpoints that host your models
 vllm_endpoints:
   - name: "endpoint1"
@@ -83,12 +93,15 @@ default_model: qwen3-7b
 
 Notes
 - Reasoning is controlled by categories.use_reasoning and optionally categories.reasoning_effort.
-- A model only gets reasoning fields if it has a model_config.<MODEL>.reasoning_family that maps to a reasoning_families entry.
-- DeepSeek/Qwen3: router sets chat_template_kwargs: { parameter: true } when reasoning is enabled.
-- GPT/GPT-OSS: router sets reasoning_effort to the category/default effort when reasoning is enabled.
+- A model only gets reasoning fields if it has a model_config.&lt;MODEL&gt;.reasoning_family that maps to a reasoning_families entry.
+- DeepSeek/Qwen3 (chat_template_kwargs): the router injects chat_template_kwargs only when reasoning is enabled. When disabled, no chat_template_kwargs are added.
+- GPT/GPT-OSS (reasoning_effort): when reasoning is enabled, the router sets reasoning_effort based on the category (fallback to default_reasoning_effort). When reasoning is disabled, if the request already contains reasoning_effort and the model’s family type is reasoning_effort, the router preserves the original value; otherwise it is absent.
+- For more stable classification, you can add category descriptions in config and keep them semantically distinctive.
 
 2) Start the router
 Option A: Local build + Envoy
+- Download classifier models and mappings (required)
+  - make download-models
 - Build and run the router
   - make build
   - make run-router
@@ -99,6 +112,8 @@ Option B: Docker Compose
 - docker compose up -d
   - Exposes Envoy at http://localhost:8801 (proxying /v1/* to backends via the router)
 
+Note: Ensure your OpenAI-compatible backend is running and reachable (e.g., http://127.0.0.1:8000) so that vllm_endpoints address:port matches a live server. Without a running backend, routing will fail at the Envoy hop.
+
 3) Send example requests
 Math (reasoning should be ON and effort high)
 ```bash
@@ -126,30 +141,25 @@ curl -sS http://localhost:8801/v1/chat/completions \
   }' | jq
 ```
 
-Example response (shape)
-The exact fields depend on your backend. The router keeps the OpenAI-compatible shape and may add metadata.
-
-```json
-{
-  "id": "chatcmpl-...",
-  "object": "chat.completion",
-  "created": 1726000000,
-  "model": "openai/gpt-oss-20b",
-  "choices": [
-    {
-      "index": 0,
-      "message": { "role": "assistant", "content": "The derivative is 3x^2 + 4x - 5." },
-      "finish_reason": "stop"
-    }
-  ],
-  "usage": { "prompt_tokens": 85, "completion_tokens": 43, "total_tokens": 128 },
-  "routing_metadata": {
-    "category": "math",
-    "selected_model": "openai/gpt-oss-20b",
-    "reasoning_enabled": true,
-    "reasoning_effort": "high"
-  }
-}
+Verify routing via response headers
+The router does not inject routing metadata into the JSON body. Instead, inspect the response headers added by the router:
+- X-Selected-Model
+- X-Semantic-Destination-Endpoint
+
+Example:
+```bash
+curl -i http://localhost:8801/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "auto",
+    "messages": [
+      {"role": "system", "content": "You are a math teacher."},
+      {"role": "user",   "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"}
+    ]
+  }'
+# In the response headers, look for:
+#   X-Selected-Model: <your-selected-model>
+#   X-Semantic-Destination-Endpoint: <address:port>
 ```
 
 4) Run a comprehensive evaluation
@@ -188,7 +198,8 @@ python bench/router_reason_bench.py \
 ```
 
 Tips
-- If your math request doesn’t enable reasoning, confirm the classifier assigns the "math" category with sufficient confidence (see categories.threshold in your setup) and that the target model has a reasoning_family.
+- If your math request doesn’t enable reasoning, confirm the classifier assigns the "math" category with sufficient confidence (see classifier.category_model.threshold) and that the target model has a reasoning_family.
 - For models without a reasoning_family, the router will not inject reasoning fields even when the category requires reasoning (this is by design to avoid invalid requests).
 - You can override the effort per category via categories.reasoning_effort or set a global default via default_reasoning_effort.
+- Ensure your OpenAI-compatible backend is reachable at the configured vllm_endpoints (address:port). If it’s not running, routing will fail even though the router and Envoy are up.