vllm-project · yuezhu1 · Nov 29, 2025 · Nov 28, 2025
@@ -60,6 +60,17 @@ model_config:
       allow_by_default: true
       pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
     preferred_endpoints: ["endpoint1"]
+  # Example: DeepSeek model with custom name
+  "ds-v31-custom":
+    reasoning_family: "deepseek"  # Uses DeepSeek reasoning syntax
+    preferred_endpoints: ["endpoint1"]
+  # Example: Qwen3 model with custom name
+  "my-qwen3-model":
+    reasoning_family: "qwen3"     # Uses Qwen3 reasoning syntax
+    preferred_endpoints: ["endpoint2"]
+  # Example: Model without reasoning support
+  "phi4":
+    preferred_endpoints: ["endpoint1"]
 
 # Classification models
 classifier:
@@ -154,24 +165,10 @@ reasoning_families:
 # Global default reasoning effort level
 default_reasoning_effort: "medium"
 
-# Model configurations - assign reasoning families to specific models
-model_config:
-  # Example: DeepSeek model with custom name
-  "ds-v31-custom":
-    reasoning_family: "deepseek"  # This model uses DeepSeek reasoning syntax
-    preferred_endpoints: ["endpoint1"]
-
-  # Example: Qwen3 model with custom name
-  "my-qwen3-model":
-    reasoning_family: "qwen3"     # This model uses Qwen3 reasoning syntax  
-    preferred_endpoints: ["endpoint2"]
-
-  # Example: Model without reasoning support
-  "phi4":
-    # No reasoning_family field - this model doesn't support reasoning mode
-    preferred_endpoints: ["endpoint1"]
 ```
 
+Assign reasoning families inside the same `model_config` block above—use `reasoning_family` per model (see `ds-v31-custom` and `my-qwen3-model` in the example). Models without reasoning syntax simply omit the field (e.g., `phi4`).
+
 ## Configuration Recipes (presets)
 
 We provide curated, versioned presets you can use directly or as a starting point:
@@ -205,6 +202,23 @@ vllm_endpoints:
 model_config:
   "llama2-7b":            # Model name - must match vLLM --served-model-name
     preferred_endpoints: ["my_endpoint"]
+  "qwen3":               # Another model served by the same endpoint
+    preferred_endpoints: ["my_endpoint"]
+```
+
+### Example: Llama / Qwen Backend Configuration
+
+```yaml
+vllm_endpoints:
+  - name: "local-vllm"
+    address: "127.0.0.1"
+    port: 8000
+
+model_config:
+  "llama2-7b":
+    preferred_endpoints: ["local-vllm"]
+  "qwen3":
+    preferred_endpoints: ["local-vllm"]
 ```
 
 #### Address Format Requirements
@@ -240,20 +254,19 @@ address: "127.0.0.1:8080"     # ❌ Use separate 'port' field
 
 #### Model Name Consistency
 
-The model names in the `models` array must **exactly match** the `--served-model-name` parameter used when starting your vLLM server:
+Model names in `model_config` must **exactly match** the `--served-model-name` parameter used when starting your vLLM server:
 
 ```bash
-# vLLM server command:
-vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b
+# vLLM server command (examples):
+vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b --port 8000
+vllm serve Qwen/Qwen3-1.8B --served-model-name qwen3 --port 8000
 
 # config.yaml must reference the model in model_config:
 model_config:
   "llama2-7b":  # ✅ Matches --served-model-name
     preferred_endpoints: ["your-endpoint"]
-
-vllm_endpoints:
-  "llama2-7b":             # ✅ Matches --served-model-name
-    # ... configuration
+  "qwen3":      # ✅ Matches --served-model-name
+    preferred_endpoints: ["your-endpoint"]
 ```
 
 ### Model Settings

@@ -14,10 +14,10 @@ No GPU required - the router runs efficiently on CPU using optimized BERT models
 
 Semantic Router depends on the following software:
 
-- **Go**: V1.24.1 or higher (matches the module requirements)
-- **Rust**: V1.90.0 or higher (for Candle bindings)
-- **Python**: V3.8 or higher (for model downloads)
-- **HuggingFace CLI**: Required for fetching models (`pip install huggingface_hub`)
+- **Go**: v1.24.1 or higher (matches the module requirements)
+- **Rust**: v1.90.0 or higher (for Candle bindings)
+- **Python**: v3.8 or higher (for model downloads)
+- **HuggingFace CLI**: Required for fetching models
 
 ## Local Installation
 
@@ -102,7 +102,7 @@ This downloads the CPU-optimized BERT models for:
 
 ### 5. Configure Backend Endpoints
 
-Edit `config/config.yaml` to point to your LLM endpoints:
+Edit `config/config.yaml` to point to your vLLM or OpenAI-compatible backend:
 
 ```yaml
 # Example: Configure your vLLM or Ollama endpoints
@@ -118,6 +118,8 @@ model_config:
       allow_by_default: false  # Deny all PII by default
       pii_types_allowed: ["EMAIL_ADDRESS", "PERSON", "GPE", "PHONE_NUMBER"]  # Only allow these specific PII types
     preferred_endpoints: ["your-endpoint"]
+
+default_model: "your-model-name"
 ```
 
 :::note[**Important: Address Format Requirements**]
@@ -138,26 +140,57 @@ The `address` field **must** contain a valid IP address (IPv4 or IPv6). Domain n
 :::
 
 :::note[**Important: Model Name Consistency**]
-The model name in your configuration **must exactly match** the `--served-model-name` parameter used when starting your vLLM server:
+The model name in `model_config` must **exactly match** the `--served-model-name` used when starting vLLM. If they don't match, the router won't route requests to your model.
+
+If `--served-model-name` is not set, you can also use the default `id` returned by `/v1/models` (e.g., `Qwen/Qwen3-1.8B`) as the key in `model_config` and for `default_model`.
+:::
+
+#### Example: Llama Model
 
 ```bash
-# When starting vLLM server:
-vllm serve microsoft/phi-4 --port 11434 --served-model-name your-model-name
+# Start vLLM with Llama
+vllm serve meta-llama/Llama-2-7b-hf --port 8000 --served-model-name llama2-7b
+```
+
+```yaml
+# config.yaml
+vllm_endpoints:
+  - name: "llama-endpoint"
+    address: "127.0.0.1"
+    port: 8000
+    weight: 1
 
-# The config.yaml must reference the model in model_config:
 model_config:
-  "your-model-name":  # ✅ Must match --served-model-name
-    preferred_endpoints: ["your-endpoint"]
+  "llama2-7b":                    # Must match --served-model-name
+    preferred_endpoints: ["llama-endpoint"]
 
-vllm_endpoints:
-  "your-model-name":             # ✅ Must match --served-model-name
-    # ... configuration
+default_model: "llama2-7b"
 ```
 
-If these names don't match, the router won't be able to route requests to your model.
+#### Example: Qwen Model
 
-The default configuration includes example endpoints that you should update for your setup.
-:::
+```bash
+# Start vLLM with Qwen
+vllm serve Qwen/Qwen3-1.8B --port 8000 --served-model-name qwen3
+```
+
+```yaml
+# config.yaml
+vllm_endpoints:
+  - name: "qwen-endpoint"
+    address: "127.0.0.1"
+    port: 8000
+    weight: 1
+
+model_config:
+  "qwen3":                        # Must match --served-model-name
+    reasoning_family: "qwen3"     # Enable Qwen3 reasoning syntax
+    preferred_endpoints: ["qwen-endpoint"]
+
+default_model: "qwen3"
+```
+
+For more configuration options, see the [Configuration Guide](configuration.md).
 
 ## Running the Router
 
@@ -192,10 +225,32 @@ curl -X POST http://localhost:8801/v1/chat/completions \
   }'
 ```
 
-:::tip[VSR Decision Tracking]
-The router automatically adds response headers (`x-vsr-selected-category`, `x-vsr-selected-reasoning`, `x-vsr-selected-model`) to help you understand how requests are being processed. Use `curl -i` to see these headers in action. See [VSR Headers Documentation](../troubleshooting/vsr-headers.md) for details.
+Using `"model": "MoM"` (Mixture of Models) lets the router automatically select the best model based on the query category.
+
+:::tip[VSR Decision Headers]
+Use `curl -i` to see routing decision headers (`x-vsr-selected-category`, `x-vsr-selected-model`). See [VSR Headers](../troubleshooting/vsr-headers.md) for details.
 :::
 
+### 3. Monitoring (Optional)
+
+By default, the router exposes Prometheus metrics at `:9190/metrics`. To disable monitoring:
+
+**Option A: CLI flag**
+
+```bash
+./bin/router -metrics-port=0
+```
+
+**Option B: Configuration**
+
+```yaml
+observability:
+  metrics:
+    enabled: false
+```
+
+When disabled, the `/metrics` endpoint won't start, but all other functionality remains unaffected.
+
 ## Next Steps
 
 After successful installation:
@@ -206,7 +261,7 @@ After successful installation:
 
 ## Getting Help
 
-- **Issues**: Report bugs on [GitHub Issues](https://github.com/your-org/semantic-router/issues)
+- **Issues**: Report bugs on [GitHub Issues](https://github.com/vllm-project/semantic-router/issues)
 - **Documentation**: Full documentation at [Read the Docs](https://vllm-semantic-router.com/)
 
 You now have a working Semantic Router that runs entirely on CPU and intelligently routes requests to specialized models!