Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 36 additions & 23 deletions website/docs/installation/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,17 @@ model_config:
allow_by_default: true
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
preferred_endpoints: ["endpoint1"]
# Example: DeepSeek model with custom name
"ds-v31-custom":
reasoning_family: "deepseek" # Uses DeepSeek reasoning syntax
preferred_endpoints: ["endpoint1"]
# Example: Qwen3 model with custom name
"my-qwen3-model":
reasoning_family: "qwen3" # Uses Qwen3 reasoning syntax
preferred_endpoints: ["endpoint2"]
# Example: Model without reasoning support
"phi4":
preferred_endpoints: ["endpoint1"]

# Classification models
classifier:
Expand Down Expand Up @@ -154,24 +165,10 @@ reasoning_families:
# Global default reasoning effort level
default_reasoning_effort: "medium"

# Model configurations - assign reasoning families to specific models
model_config:
# Example: DeepSeek model with custom name
"ds-v31-custom":
reasoning_family: "deepseek" # This model uses DeepSeek reasoning syntax
preferred_endpoints: ["endpoint1"]

# Example: Qwen3 model with custom name
"my-qwen3-model":
reasoning_family: "qwen3" # This model uses Qwen3 reasoning syntax
preferred_endpoints: ["endpoint2"]

# Example: Model without reasoning support
"phi4":
# No reasoning_family field - this model doesn't support reasoning mode
preferred_endpoints: ["endpoint1"]
```

Assign reasoning families inside the same `model_config` block above—use `reasoning_family` per model (see `ds-v31-custom` and `my-qwen3-model` in the example). Models without reasoning syntax simply omit the field (e.g., `phi4`).

## Configuration Recipes (presets)

We provide curated, versioned presets you can use directly or as a starting point:
Expand Down Expand Up @@ -205,6 +202,23 @@ vllm_endpoints:
model_config:
"llama2-7b": # Model name - must match vLLM --served-model-name
preferred_endpoints: ["my_endpoint"]
"qwen3": # Another model served by the same endpoint
preferred_endpoints: ["my_endpoint"]
```

### Example: Llama / Qwen Backend Configuration

```yaml
vllm_endpoints:
- name: "local-vllm"
address: "127.0.0.1"
port: 8000

model_config:
"llama2-7b":
preferred_endpoints: ["local-vllm"]
"qwen3":
preferred_endpoints: ["local-vllm"]
```

#### Address Format Requirements
Expand Down Expand Up @@ -240,20 +254,19 @@ address: "127.0.0.1:8080" # ❌ Use separate 'port' field

#### Model Name Consistency

The model names in the `models` array must **exactly match** the `--served-model-name` parameter used when starting your vLLM server:
Model names in `model_config` must **exactly match** the `--served-model-name` parameter used when starting your vLLM server:

```bash
# vLLM server command:
vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b
# vLLM server command (examples):
vllm serve meta-llama/Llama-2-7b-hf --served-model-name llama2-7b --port 8000
vllm serve Qwen/Qwen3-1.8B --served-model-name qwen3 --port 8000

# config.yaml must reference the model in model_config:
model_config:
"llama2-7b": # ✅ Matches --served-model-name
preferred_endpoints: ["your-endpoint"]

vllm_endpoints:
"llama2-7b": # ✅ Matches --served-model-name
# ... configuration
"qwen3": # ✅ Matches --served-model-name
preferred_endpoints: ["your-endpoint"]
```

### Model Settings
Expand Down
95 changes: 75 additions & 20 deletions website/docs/installation/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ No GPU required - the router runs efficiently on CPU using optimized BERT models

Semantic Router depends on the following software:

- **Go**: V1.24.1 or higher (matches the module requirements)
- **Rust**: V1.90.0 or higher (for Candle bindings)
- **Python**: V3.8 or higher (for model downloads)
- **HuggingFace CLI**: Required for fetching models (`pip install huggingface_hub`)
- **Go**: v1.24.1 or higher (matches the module requirements)
- **Rust**: v1.90.0 or higher (for Candle bindings)
- **Python**: v3.8 or higher (for model downloads)
- **HuggingFace CLI**: Required for fetching models

## Local Installation

Expand Down Expand Up @@ -102,7 +102,7 @@ This downloads the CPU-optimized BERT models for:

### 5. Configure Backend Endpoints

Edit `config/config.yaml` to point to your LLM endpoints:
Edit `config/config.yaml` to point to your vLLM or OpenAI-compatible backend:

```yaml
# Example: Configure your vLLM or Ollama endpoints
Expand All @@ -118,6 +118,8 @@ model_config:
allow_by_default: false # Deny all PII by default
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON", "GPE", "PHONE_NUMBER"] # Only allow these specific PII types
preferred_endpoints: ["your-endpoint"]

default_model: "your-model-name"
```

:::note[**Important: Address Format Requirements**]
Expand All @@ -138,26 +140,57 @@ The `address` field **must** contain a valid IP address (IPv4 or IPv6). Domain n
:::

:::note[**Important: Model Name Consistency**]
The model name in your configuration **must exactly match** the `--served-model-name` parameter used when starting your vLLM server:
The model name in `model_config` must **exactly match** the `--served-model-name` used when starting vLLM. If they don't match, the router won't route requests to your model.

If `--served-model-name` is not set, you can also use the default `id` returned by `/v1/models` (e.g., `Qwen/Qwen3-1.8B`) as the key in `model_config` and for `default_model`.
:::

#### Example: Llama Model

```bash
# When starting vLLM server:
vllm serve microsoft/phi-4 --port 11434 --served-model-name your-model-name
# Start vLLM with Llama
vllm serve meta-llama/Llama-2-7b-hf --port 8000 --served-model-name llama2-7b
```

```yaml
# config.yaml
vllm_endpoints:
- name: "llama-endpoint"
address: "127.0.0.1"
port: 8000
weight: 1

# The config.yaml must reference the model in model_config:
model_config:
"your-model-name": # ✅ Must match --served-model-name
preferred_endpoints: ["your-endpoint"]
"llama2-7b": # Must match --served-model-name
preferred_endpoints: ["llama-endpoint"]

vllm_endpoints:
"your-model-name": # ✅ Must match --served-model-name
# ... configuration
default_model: "llama2-7b"
```

If these names don't match, the router won't be able to route requests to your model.
#### Example: Qwen Model

The default configuration includes example endpoints that you should update for your setup.
:::
```bash
# Start vLLM with Qwen
vllm serve Qwen/Qwen3-1.8B --port 8000 --served-model-name qwen3
```

```yaml
# config.yaml
vllm_endpoints:
- name: "qwen-endpoint"
address: "127.0.0.1"
port: 8000
weight: 1

model_config:
"qwen3": # Must match --served-model-name
reasoning_family: "qwen3" # Enable Qwen3 reasoning syntax
preferred_endpoints: ["qwen-endpoint"]

default_model: "qwen3"
```

For more configuration options, see the [Configuration Guide](configuration.md).

## Running the Router

Expand Down Expand Up @@ -192,10 +225,32 @@ curl -X POST http://localhost:8801/v1/chat/completions \
}'
```

:::tip[VSR Decision Tracking]
The router automatically adds response headers (`x-vsr-selected-category`, `x-vsr-selected-reasoning`, `x-vsr-selected-model`) to help you understand how requests are being processed. Use `curl -i` to see these headers in action. See [VSR Headers Documentation](../troubleshooting/vsr-headers.md) for details.
Using `"model": "MoM"` (Mixture of Models) lets the router automatically select the best model based on the query category.

:::tip[VSR Decision Headers]
Use `curl -i` to see routing decision headers (`x-vsr-selected-category`, `x-vsr-selected-model`). See [VSR Headers](../troubleshooting/vsr-headers.md) for details.
:::

### 3. Monitoring (Optional)

By default, the router exposes Prometheus metrics at `:9190/metrics`. To disable monitoring:

**Option A: CLI flag**

```bash
./bin/router -metrics-port=0
```

**Option B: Configuration**

```yaml
observability:
metrics:
enabled: false
```

When disabled, the `/metrics` endpoint won't start, but all other functionality remains unaffected.

## Next Steps

After successful installation:
Expand All @@ -206,7 +261,7 @@ After successful installation:

## Getting Help

- **Issues**: Report bugs on [GitHub Issues](https://github.com/your-org/semantic-router/issues)
- **Issues**: Report bugs on [GitHub Issues](https://github.com/vllm-project/semantic-router/issues)
- **Documentation**: Full documentation at [Read the Docs](https://vllm-semantic-router.com/)

You now have a working Semantic Router that runs entirely on CPU and intelligently routes requests to specialized models!
Loading