-
Notifications
You must be signed in to change notification settings - Fork 180
Open
Labels
area/environmentarea/networkingarea/observabilityhelp wantedExtra attention is neededExtra attention is neededpriority/P2Nice-to-Have / ExploratoryNice-to-Have / Exploratory
Description
Requirement
Add OpenTelemetry (OTEL) distributed tracing integration example to illustrate end-to-end observability from client applications through the router to vLLM backends. This will provide comprehensive visibility into request flows, routing decisions, performance bottlenecks, and error propagation across the entire LLM inference pipeline.
Motivation
Currently, semantic-router lacks distributed tracing capabilities, making it difficult to:
- Debug performance issues across the application → semantic-router → vLLM chain
- Monitor routing decisions and their impact on latency/quality
- Correlate errors between different components in the stack
- Optimize model selection based on end-to-end performance data
- Track cache hit/miss patterns in relation to overall request performance
- Measure Time-to-First-Token (TTFT) and completion latencies in context
OpenAI Python Client
from openai import OpenAI
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Auto-instrument for automatic trace header injection
RequestsInstrumentor().instrument()
OpenAIInstrumentor().instrument()
client = OpenAI(base_url="http://semantic-router:8000")
response = client.chat.completions.create(
model="auto", # Triggers semantic routing
messages=[{"role": "user", "content": "What is quantum computing?"}]
)
Trace Context Flow
Application Request
↓ (HTTP headers: traceparent, tracestate)
Semantic Router ExtProc
↓ (Extract trace context)
Processing Spans (classification, routing, etc.)
↓ (Inject trace context)
vLLM Backend Request
↓ (HTTP headers: traceparent, tracestate)
vLLM Processing (if OTEL-enabled)
↓
OTLP Collector / Jaeger
Persona
For Developers
- End-to-end visibility from application to vLLM
- Performance debugging with detailed timing breakdowns
- Error correlation across service boundaries
- Routing decision analysis with context
For Operations
- SLA monitoring with distributed latency tracking
- Capacity planning based on actual usage patterns
- Incident response with complete request traces
- Cost optimization through routing efficiency analysis
For Product Teams
- User experience insights with real performance data
- A/B testing of routing strategies with trace correlation
- Quality metrics tied to specific routing decisions
Example Trace Visualization
Trace: user-query-quantum-computing (2.3s total)
├── app.chat_completion (2.3s)
│ └── HTTP POST /v1/chat/completions (2.2s)
│ ├── extproc.process_request (45ms)
│ │ ├── extproc.handle_request_headers (2ms)
│ │ └── extproc.handle_request_body (43ms)
│ │ ├── classification.classify_intent (15ms) [category=science]
│ │ ├── cache.lookup (3ms) [cache_miss=true]
│ │ ├── security.check_pii (2ms) [pii_detected=false]
│ │ └── routing.select_model (23ms) [selected=llama-3.1-70b]
│ └── vllm.chat_completion (2.1s)
│ ├── vllm.process_request (50ms)
│ ├── vllm.generate_tokens (2.0s) [tokens=156]
│ └── vllm.format_response (5ms)
Copilot
Metadata
Metadata
Assignees
Labels
area/environmentarea/networkingarea/observabilityhelp wantedExtra attention is neededExtra attention is neededpriority/P2Nice-to-Have / ExploratoryNice-to-Have / Exploratory