Skip to content

Add OpenTelemetry (OTEL) distributed tracing integration #328

@rootfs

Description

@rootfs

Requirement

Add OpenTelemetry (OTEL) distributed tracing integration example to illustrate end-to-end observability from client applications through the router to vLLM backends. This will provide comprehensive visibility into request flows, routing decisions, performance bottlenecks, and error propagation across the entire LLM inference pipeline.

Motivation

Currently, semantic-router lacks distributed tracing capabilities, making it difficult to:

  • Debug performance issues across the application → semantic-router → vLLM chain
  • Monitor routing decisions and their impact on latency/quality
  • Correlate errors between different components in the stack
  • Optimize model selection based on end-to-end performance data
  • Track cache hit/miss patterns in relation to overall request performance
  • Measure Time-to-First-Token (TTFT) and completion latencies in context

OpenAI Python Client

from openai import OpenAI
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Auto-instrument for automatic trace header injection
RequestsInstrumentor().instrument()
OpenAIInstrumentor().instrument()

client = OpenAI(base_url="http://semantic-router:8000")
response = client.chat.completions.create(
    model="auto",  # Triggers semantic routing
    messages=[{"role": "user", "content": "What is quantum computing?"}]
)

Trace Context Flow

Application Request
    ↓ (HTTP headers: traceparent, tracestate)
Semantic Router ExtProc
    ↓ (Extract trace context)
Processing Spans (classification, routing, etc.)
    ↓ (Inject trace context)
vLLM Backend Request
    ↓ (HTTP headers: traceparent, tracestate)
vLLM Processing (if OTEL-enabled)
    ↓
OTLP Collector / Jaeger

Persona

For Developers

  • End-to-end visibility from application to vLLM
  • Performance debugging with detailed timing breakdowns
  • Error correlation across service boundaries
  • Routing decision analysis with context

For Operations

  • SLA monitoring with distributed latency tracking
  • Capacity planning based on actual usage patterns
  • Incident response with complete request traces
  • Cost optimization through routing efficiency analysis

For Product Teams

  • User experience insights with real performance data
  • A/B testing of routing strategies with trace correlation
  • Quality metrics tied to specific routing decisions

Example Trace Visualization

Trace: user-query-quantum-computing (2.3s total)
├── app.chat_completion (2.3s)
│   └── HTTP POST /v1/chat/completions (2.2s)
│       ├── extproc.process_request (45ms)
│       │   ├── extproc.handle_request_headers (2ms)
│       │   └── extproc.handle_request_body (43ms)
│       │       ├── classification.classify_intent (15ms) [category=science]
│       │       ├── cache.lookup (3ms) [cache_miss=true]
│       │       ├── security.check_pii (2ms) [pii_detected=false]
│       │       └── routing.select_model (23ms) [selected=llama-3.1-70b]
│       └── vllm.chat_completion (2.1s)
│           ├── vllm.process_request (50ms)
│           ├── vllm.generate_tokens (2.0s) [tokens=156]
│           └── vllm.format_response (5ms)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions