diff --git a/website/docs/installation/configuration.md b/website/docs/installation/configuration.md index d39b6dfb6..9537939a9 100644 --- a/website/docs/installation/configuration.md +++ b/website/docs/installation/configuration.md @@ -60,6 +60,17 @@ model_config: allow_by_default: true pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"] preferred_endpoints: ["endpoint1"] + # Example: DeepSeek model with custom name + "ds-v31-custom": + reasoning_family: "deepseek" # Uses DeepSeek reasoning syntax + preferred_endpoints: ["endpoint1"] + # Example: Qwen3 model with custom name + "my-qwen3-model": + reasoning_family: "qwen3" # Uses Qwen3 reasoning syntax + preferred_endpoints: ["endpoint2"] + # Example: Model without reasoning support + "phi4": + preferred_endpoints: ["endpoint1"] # Classification models classifier: @@ -154,24 +165,10 @@ reasoning_families: # Global default reasoning effort level default_reasoning_effort: "medium" -# Model configurations - assign reasoning families to specific models -model_config: - # Example: DeepSeek model with custom name - "ds-v31-custom": - reasoning_family: "deepseek" # This model uses DeepSeek reasoning syntax - preferred_endpoints: ["endpoint1"] - - # Example: Qwen3 model with custom name - "my-qwen3-model": - reasoning_family: "qwen3" # This model uses Qwen3 reasoning syntax - preferred_endpoints: ["endpoint2"] - - # Example: Model without reasoning support - "phi4": - # No reasoning_family field - this model doesn't support reasoning mode - preferred_endpoints: ["endpoint1"] ``` +Assign reasoning families inside the same `model_config` block above—use `reasoning_family` per model (see `ds-v31-custom` and `my-qwen3-model` in the example). Models without reasoning syntax simply omit the field (e.g., `phi4`). + ## Configuration Recipes (presets) We provide curated, versioned presets you can use directly or as a starting point: @@ -205,6 +202,23 @@ vllm_endpoints: model_config: "llama2-7b": # Model name - must match vLLM --served-model-name preferred_endpoints: ["my_endpoint"] + "qwen3": # Another model served by the same endpoint + preferred_endpoints: ["my_endpoint"] +``` + +### Example: Llama / Qwen Backend Configuration + +```yaml +vllm_endpoints: + - name: "local-vllm" + address: "127.0.0.1" + port: 8000 + +model_config: + "llama2-7b": + preferred_endpoints: ["local-vllm"] + "qwen3": + preferred_endpoints: ["local-vllm"] ``` #### Address Format Requirements @@ -240,20 +254,19 @@ address: "127.0.0.1:8080" # ❌ Use separate 'port' field #### Model Name Consistency -The model names in the `models` array must **exactly match** the `--served-model-name` parameter used when starting your vLLM server: +Model names in `model_config` must **exactly match** the `--served-model-name` parameter used when starting your vLLM server: ```bash -# vLLM server command: -vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b +# vLLM server command (examples): +vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b --port 8000 +vllm serve Qwen/Qwen3-1.8B --served-model-name qwen3 --port 8000 # config.yaml must reference the model in model_config: model_config: "llama2-7b": # ✅ Matches --served-model-name preferred_endpoints: ["your-endpoint"] - -vllm_endpoints: - "llama2-7b": # ✅ Matches --served-model-name - # ... configuration + "qwen3": # ✅ Matches --served-model-name + preferred_endpoints: ["your-endpoint"] ``` ### Model Settings diff --git a/website/docs/installation/installation.md b/website/docs/installation/installation.md index 2f4e151a2..cd4fd5d25 100644 --- a/website/docs/installation/installation.md +++ b/website/docs/installation/installation.md @@ -14,10 +14,10 @@ No GPU required - the router runs efficiently on CPU using optimized BERT models Semantic Router depends on the following software: -- **Go**: V1.24.1 or higher (matches the module requirements) -- **Rust**: V1.90.0 or higher (for Candle bindings) -- **Python**: V3.8 or higher (for model downloads) -- **HuggingFace CLI**: Required for fetching models (`pip install huggingface_hub`) +- **Go**: v1.24.1 or higher (matches the module requirements) +- **Rust**: v1.90.0 or higher (for Candle bindings) +- **Python**: v3.8 or higher (for model downloads) +- **HuggingFace CLI**: Required for fetching models ## Local Installation @@ -102,7 +102,7 @@ This downloads the CPU-optimized BERT models for: ### 5. Configure Backend Endpoints -Edit `config/config.yaml` to point to your LLM endpoints: +Edit `config/config.yaml` to point to your vLLM or OpenAI-compatible backend: ```yaml # Example: Configure your vLLM or Ollama endpoints @@ -118,6 +118,8 @@ model_config: allow_by_default: false # Deny all PII by default pii_types_allowed: ["EMAIL_ADDRESS", "PERSON", "GPE", "PHONE_NUMBER"] # Only allow these specific PII types preferred_endpoints: ["your-endpoint"] + +default_model: "your-model-name" ``` :::note[**Important: Address Format Requirements**] @@ -138,26 +140,57 @@ The `address` field **must** contain a valid IP address (IPv4 or IPv6). Domain n ::: :::note[**Important: Model Name Consistency**] -The model name in your configuration **must exactly match** the `--served-model-name` parameter used when starting your vLLM server: +The model name in `model_config` must **exactly match** the `--served-model-name` used when starting vLLM. If they don't match, the router won't route requests to your model. + +If `--served-model-name` is not set, you can also use the default `id` returned by `/v1/models` (e.g., `Qwen/Qwen3-1.8B`) as the key in `model_config` and for `default_model`. +::: + +#### Example: Llama Model ```bash -# When starting vLLM server: -vllm serve microsoft/phi-4 --port 11434 --served-model-name your-model-name +# Start vLLM with Llama +vllm serve meta-llama/Llama-2-7b-hf --port 8000 --served-model-name llama2-7b +``` + +```yaml +# config.yaml +vllm_endpoints: + - name: "llama-endpoint" + address: "127.0.0.1" + port: 8000 + weight: 1 -# The config.yaml must reference the model in model_config: model_config: - "your-model-name": # ✅ Must match --served-model-name - preferred_endpoints: ["your-endpoint"] + "llama2-7b": # Must match --served-model-name + preferred_endpoints: ["llama-endpoint"] -vllm_endpoints: - "your-model-name": # ✅ Must match --served-model-name - # ... configuration +default_model: "llama2-7b" ``` -If these names don't match, the router won't be able to route requests to your model. +#### Example: Qwen Model -The default configuration includes example endpoints that you should update for your setup. -::: +```bash +# Start vLLM with Qwen +vllm serve Qwen/Qwen3-1.8B --port 8000 --served-model-name qwen3 +``` + +```yaml +# config.yaml +vllm_endpoints: + - name: "qwen-endpoint" + address: "127.0.0.1" + port: 8000 + weight: 1 + +model_config: + "qwen3": # Must match --served-model-name + reasoning_family: "qwen3" # Enable Qwen3 reasoning syntax + preferred_endpoints: ["qwen-endpoint"] + +default_model: "qwen3" +``` + +For more configuration options, see the [Configuration Guide](configuration.md). ## Running the Router @@ -192,10 +225,32 @@ curl -X POST http://localhost:8801/v1/chat/completions \ }' ``` -:::tip[VSR Decision Tracking] -The router automatically adds response headers (`x-vsr-selected-category`, `x-vsr-selected-reasoning`, `x-vsr-selected-model`) to help you understand how requests are being processed. Use `curl -i` to see these headers in action. See [VSR Headers Documentation](../troubleshooting/vsr-headers.md) for details. +Using `"model": "MoM"` (Mixture of Models) lets the router automatically select the best model based on the query category. + +:::tip[VSR Decision Headers] +Use `curl -i` to see routing decision headers (`x-vsr-selected-category`, `x-vsr-selected-model`). See [VSR Headers](../troubleshooting/vsr-headers.md) for details. ::: +### 3. Monitoring (Optional) + +By default, the router exposes Prometheus metrics at `:9190/metrics`. To disable monitoring: + +**Option A: CLI flag** + +```bash +./bin/router -metrics-port=0 +``` + +**Option B: Configuration** + +```yaml +observability: + metrics: + enabled: false +``` + +When disabled, the `/metrics` endpoint won't start, but all other functionality remains unaffected. + ## Next Steps After successful installation: @@ -206,7 +261,7 @@ After successful installation: ## Getting Help -- **Issues**: Report bugs on [GitHub Issues](https://github.com/your-org/semantic-router/issues) +- **Issues**: Report bugs on [GitHub Issues](https://github.com/vllm-project/semantic-router/issues) - **Documentation**: Full documentation at [Read the Docs](https://vllm-semantic-router.com/) You now have a working Semantic Router that runs entirely on CPU and intelligently routes requests to specialized models!