-
Notifications
You must be signed in to change notification settings - Fork 181
docs: container connectivity troubleshooting #346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
--- | ||
title: Container Connectivity Troubleshooting | ||
sidebar_label: Container Connectivity | ||
--- | ||
|
||
This guide summarizes common connectivity issues we hit when running the router with Docker Compose or Kubernetes and how we fixed them. It also covers the “No data” problem in Grafana and how to validate the full metrics chain. | ||
|
||
## 1. Use IPv4 addresses for backend endpoints | ||
|
||
Symptoms | ||
|
||
- Router/Envoy timeouts, 5xx, or “up/down” flapping in Prometheus. Curl from inside containers/pods fails. | ||
|
||
Root causes | ||
|
||
- Backend bound only to 127.0.0.1 (not reachable from containers/pods). | ||
- Using IPv6 or hostnames that resolve to IPv6 where IPv6 is disabled/blocked. | ||
- Using localhost/127.0.0.1 in the router config, which refers to the container itself, not the host. | ||
|
||
Fixes | ||
|
||
- Ensure backends bind to all interfaces: 0.0.0.0. | ||
- In Docker Compose, configure the router to call the host via a reachable IPv4 address. | ||
- On macOS, host.docker.internal usually works; if not, use the host’s LAN IPv4 address. | ||
- On Linux or custom networks, use the Docker host gateway IPv4 for your network. | ||
|
||
Example: start vLLM on the host | ||
|
||
```bash | ||
# Make vLLM listen on all interfaces | ||
python -m vllm.entrypoints.openai.api_server \ | ||
--host 0.0.0.0 --port 11434 \ | ||
--served-model-name phi4 | ||
``` | ||
|
||
Router config example (Docker Compose) | ||
|
||
```yaml | ||
# config/config.yaml (snippet) | ||
llm_backends: | ||
- name: phi4 | ||
# Use a reachable IPv4; replace with your host’s IP | ||
address: http://172.28.0.1:11434 | ||
``` | ||
|
||
Kubernetes recommended pattern: use a Service | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: my-vllm | ||
spec: | ||
selector: | ||
app: my-vllm | ||
ports: | ||
- name: http | ||
port: 8000 | ||
targetPort: 8000 | ||
``` | ||
|
||
Router config then uses: http://my-vllm.default.svc.cluster.local:8000 | ||
|
||
**Tip**: discover the host gateway from inside a container (mostly Linux) | ||
|
||
```bash | ||
# Inside the container/pod | ||
ip route | awk '/default/ {print $3}' | ||
``` | ||
|
||
## 2. Host firewall blocking container/pod traffic | ||
|
||
Symptoms | ||
|
||
- Host can curl the backend, but containers/pods time out until the firewall is opened. | ||
|
||
Fixes | ||
|
||
- macOS: System Settings → Network → Firewall. Allow incoming connections for the backend process (e.g., Python/uvicorn) or temporarily disable the firewall to test. | ||
- Linux examples: | ||
|
||
```bash | ||
# UFW (Ubuntu/Debian) | ||
sudo ufw allow 11434/tcp | ||
sudo ufw allow 11435/tcp | ||
|
||
# firewalld (RHEL/CentOS/Fedora) | ||
sudo firewall-cmd --add-port=11434/tcp --permanent | ||
sudo firewall-cmd --add-port=11435/tcp --permanent | ||
sudo firewall-cmd --reload | ||
``` | ||
|
||
- Cloud hosts: also open security group/ACL rules. | ||
|
||
Validate from the container/pod: | ||
|
||
```bash | ||
docker compose exec semantic-router curl -sS http://<IPv4>:11434/v1/models | ||
``` | ||
|
||
## 3. Docker Compose: publish the router’s ports (not just expose) | ||
|
||
Symptoms | ||
|
||
- Can’t access /metrics or API from the host. docker ps shows no published ports. | ||
|
||
Root cause | ||
|
||
- Using `expose` only keeps ports internal to the Compose network; it doesn’t publish to the host. | ||
|
||
Fix | ||
|
||
- Map the needed ports with `ports:`. | ||
|
||
Example docker-compose.yml snippet | ||
|
||
```yaml | ||
services: | ||
semantic-router: | ||
# ... | ||
ports: | ||
- "9190:9190" # Prometheus /metrics | ||
- "50051:50051" # gRPC/HTTP API (use your actual service port) | ||
``` | ||
|
||
Validate from the host: | ||
|
||
```bash | ||
curl -sS http://localhost:9190/metrics | head -n 5 | ||
``` | ||
|
||
## 4. Grafana dashboard shows “No data” | ||
|
||
Common causes and fixes | ||
|
||
- Metrics not emitted yet | ||
- Some panels are empty until code paths are hit. Examples: | ||
- Cost: `llm_model_cost_total{currency="USD"}` grows only when cost is recorded. | ||
- Refusals: `llm_request_errors_total{reason="pii_policy_denied"|"jailbreak_block"}` grows only when policies block requests. | ||
- Generate relevant traffic or enable filters/policies to see data. | ||
|
||
- Panel query nuances | ||
- Classification bar gauge often needs instant query. | ||
- Quantiles require histogram buckets. | ||
|
||
Useful PromQL examples (for Explore) | ||
|
||
```promql | ||
# Category classification (instant) | ||
sum by (category) (llm_category_classifications_count) | ||
|
||
# Cost rate (USD/sec) | ||
sum by (model) (rate(llm_model_cost_total{currency="USD"}[5m])) | ||
|
||
# Refusals per model | ||
sum by (model) (rate(llm_request_errors_total{reason=~"pii_policy_denied|jailbreak_block"}[5m])) | ||
|
||
# Refusal rate percentage | ||
100 * sum by (model) (rate(llm_request_errors_total{reason=~"pii_policy_denied|jailbreak_block"}[5m])) | ||
/ sum by (model) (rate(llm_model_requests_total[5m])) | ||
|
||
# Latency p95 | ||
histogram_quantile(0.95, sum by (le) (rate(llm_model_completion_latency_seconds_bucket[5m]))) | ||
``` | ||
|
||
Prometheus scrape config (verify targets are UP) | ||
|
||
```yaml | ||
scrape_configs: | ||
- job_name: semantic-router | ||
static_configs: | ||
- targets: ["semantic-router:9190"] | ||
|
||
- job_name: envoy | ||
metrics_path: /stats/prometheus | ||
static_configs: | ||
- targets: ["envoy-proxy:19000"] | ||
``` | ||
|
||
Time range & refresh | ||
|
||
- Select a window that includes your recent traffic (Last 5–15 minutes) and refresh the dashboard after sending test requests. | ||
|
||
## Quick checklist | ||
|
||
- Backends listen on 0.0.0.0; router uses a reachable IPv4 address (or k8s Service DNS that resolves to IPv4). | ||
- Host firewall allows the backend ports; cloud SG/ACL opened if applicable. | ||
- In Docker Compose, router ports are published (e.g., 9190 for /metrics, service port for API). | ||
- Prometheus targets for `semantic-router:9190` and `envoy-proxy:19000` are UP. | ||
- Send traffic that triggers the metrics you expect (cost/refusals) and adjust panel query mode (instant vs. range) where needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if similar kubernetes networking diagnostics help?
https://goteleport.com/blog/troubleshooting-kubernetes-networking/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the link! I’ve checked it, and while it might be helpful, it doesn’t fully cover the issues I ran into on my lab servers. This PR is based on those practical troubleshooting experiences. Do you think it would make sense to keep both references, so the docs can complement each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's keep your inline diagnostics