Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions website/docs/troubleshooting/container-connectivity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
title: Container Connectivity Troubleshooting
sidebar_label: Container Connectivity
---

This guide summarizes common connectivity issues we hit when running the router with Docker Compose or Kubernetes and how we fixed them. It also covers the “No data” problem in Grafana and how to validate the full metrics chain.

## 1. Use IPv4 addresses for backend endpoints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if similar kubernetes networking diagnostics help?
https://goteleport.com/blog/troubleshooting-kubernetes-networking/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the link! I’ve checked it, and while it might be helpful, it doesn’t fully cover the issues I ran into on my lab servers. This PR is based on those practical troubleshooting experiences. Do you think it would make sense to keep both references, so the docs can complement each other?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's keep your inline diagnostics


Symptoms

- Router/Envoy timeouts, 5xx, or “up/down” flapping in Prometheus. Curl from inside containers/pods fails.

Root causes

- Backend bound only to 127.0.0.1 (not reachable from containers/pods).
- Using IPv6 or hostnames that resolve to IPv6 where IPv6 is disabled/blocked.
- Using localhost/127.0.0.1 in the router config, which refers to the container itself, not the host.

Fixes

- Ensure backends bind to all interfaces: 0.0.0.0.
- In Docker Compose, configure the router to call the host via a reachable IPv4 address.
- On macOS, host.docker.internal usually works; if not, use the host’s LAN IPv4 address.
- On Linux or custom networks, use the Docker host gateway IPv4 for your network.

Example: start vLLM on the host

```bash
# Make vLLM listen on all interfaces
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 11434 \
--served-model-name phi4
```

Router config example (Docker Compose)

```yaml
# config/config.yaml (snippet)
llm_backends:
- name: phi4
# Use a reachable IPv4; replace with your host’s IP
address: http://172.28.0.1:11434
```

Kubernetes recommended pattern: use a Service

```yaml
apiVersion: v1
kind: Service
metadata:
name: my-vllm
spec:
selector:
app: my-vllm
ports:
- name: http
port: 8000
targetPort: 8000
```

Router config then uses: http://my-vllm.default.svc.cluster.local:8000

**Tip**: discover the host gateway from inside a container (mostly Linux)

```bash
# Inside the container/pod
ip route | awk '/default/ {print $3}'
```

## 2. Host firewall blocking container/pod traffic

Symptoms

- Host can curl the backend, but containers/pods time out until the firewall is opened.

Fixes

- macOS: System Settings → Network → Firewall. Allow incoming connections for the backend process (e.g., Python/uvicorn) or temporarily disable the firewall to test.
- Linux examples:

```bash
# UFW (Ubuntu/Debian)
sudo ufw allow 11434/tcp
sudo ufw allow 11435/tcp

# firewalld (RHEL/CentOS/Fedora)
sudo firewall-cmd --add-port=11434/tcp --permanent
sudo firewall-cmd --add-port=11435/tcp --permanent
sudo firewall-cmd --reload
```

- Cloud hosts: also open security group/ACL rules.

Validate from the container/pod:

```bash
docker compose exec semantic-router curl -sS http://<IPv4>:11434/v1/models
```

## 3. Docker Compose: publish the router’s ports (not just expose)

Symptoms

- Can’t access /metrics or API from the host. docker ps shows no published ports.

Root cause

- Using `expose` only keeps ports internal to the Compose network; it doesn’t publish to the host.

Fix

- Map the needed ports with `ports:`.

Example docker-compose.yml snippet

```yaml
services:
semantic-router:
# ...
ports:
- "9190:9190" # Prometheus /metrics
- "50051:50051" # gRPC/HTTP API (use your actual service port)
```

Validate from the host:

```bash
curl -sS http://localhost:9190/metrics | head -n 5
```

## 4. Grafana dashboard shows “No data”

Common causes and fixes

- Metrics not emitted yet
- Some panels are empty until code paths are hit. Examples:
- Cost: `llm_model_cost_total{currency="USD"}` grows only when cost is recorded.
- Refusals: `llm_request_errors_total{reason="pii_policy_denied"|"jailbreak_block"}` grows only when policies block requests.
- Generate relevant traffic or enable filters/policies to see data.

- Panel query nuances
- Classification bar gauge often needs instant query.
- Quantiles require histogram buckets.

Useful PromQL examples (for Explore)

```promql
# Category classification (instant)
sum by (category) (llm_category_classifications_count)

# Cost rate (USD/sec)
sum by (model) (rate(llm_model_cost_total{currency="USD"}[5m]))

# Refusals per model
sum by (model) (rate(llm_request_errors_total{reason=~"pii_policy_denied|jailbreak_block"}[5m]))

# Refusal rate percentage
100 * sum by (model) (rate(llm_request_errors_total{reason=~"pii_policy_denied|jailbreak_block"}[5m]))
/ sum by (model) (rate(llm_model_requests_total[5m]))

# Latency p95
histogram_quantile(0.95, sum by (le) (rate(llm_model_completion_latency_seconds_bucket[5m])))
```

Prometheus scrape config (verify targets are UP)

```yaml
scrape_configs:
- job_name: semantic-router
static_configs:
- targets: ["semantic-router:9190"]

- job_name: envoy
metrics_path: /stats/prometheus
static_configs:
- targets: ["envoy-proxy:19000"]
```

Time range & refresh

- Select a window that includes your recent traffic (Last 5–15 minutes) and refresh the dashboard after sending test requests.

## Quick checklist

- Backends listen on 0.0.0.0; router uses a reachable IPv4 address (or k8s Service DNS that resolves to IPv4).
- Host firewall allows the backend ports; cloud SG/ACL opened if applicable.
- In Docker Compose, router ports are published (e.g., 9190 for /metrics, service port for API).
- Prometheus targets for `semantic-router:9190` and `envoy-proxy:19000` are UP.
- Send traffic that triggers the metrics you expect (cost/refusals) and adjust panel query mode (instant vs. range) where needed.
2 changes: 2 additions & 0 deletions website/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ const sidebars: SidebarsConfig = {
label: 'Troubleshooting',
items: [
'troubleshooting/network-tips',
'troubleshooting/container-connectivity',
'troubleshooting/vsr-headers',
],
},
],
Expand Down
Loading