Skip to content

Conversation

@r33drichards
Copy link
Collaborator

Summary

This PR adds comprehensive OpenTelemetry (OTEL) instrumentation to the CUA platform to monitor the Four Golden Signals (Latency, Traffic, Errors, and Saturation) across all three services: cua-docs, cua-mcp-server, and cua-docs-indexer.

Key Changes

Documentation & Guidance

  • docs/GRAFANA_DASHBOARD_PROMPT.md (new): Complete guide for generating Grafana dashboards with metric naming conventions, dashboard requirements, alert rules, and example PromQL queries for all three services

CUA Docs (Next.js)

  • docs/src/instrumentation.ts (new): Next.js instrumentation entry point that initializes OTEL SDK on server startup
  • docs/src/lib/otel/instrumentation.ts (new): OTEL SDK configuration with OTLP exporter to otel.cua.ai, resource attributes, and auto-instrumentation setup
  • docs/src/lib/otel/index.ts (new): Four Golden Signals metrics implementation with helpers for recording requests, CopilotKit interactions, tool executions, and saturation metrics
  • docs/src/app/api/copilotkit/route.ts (modified): Instrumented CopilotKit endpoint to track latency, traffic, errors, and concurrent request saturation
  • docs/next.config.mjs (modified): Enabled instrumentationHook and added OTEL packages to serverExternalPackages
  • docs/package.json (modified): Added OTEL dependencies (@opentelemetry/sdk-node, exporters, auto-instrumentations, semantic-conventions)

CUA Docs Indexer (Modal/Python)

  • docs/scripts/modal_app.py (modified):
    • Added OTEL dependencies to Modal image
    • Implemented init_otel() function to initialize metrics exporter
    • Created IndexerMetrics class with Four Golden Signals for crawling, indexing, and querying
    • Instrumented all indexing jobs (crawl_docs, generate_vector_db, generate_sqlite_db, index_component) with latency and traffic metrics
    • Instrumented all query tools (query_docs_db, query_docs_vectors, query_code_db, query_code_vectors) with latency, traffic, and error tracking

Implementation Details

Metric Naming Convention

All metrics follow the pattern: {service_prefix}.{signal_category}.{metric_name}

  • cua.docs: Next.js documentation service
  • cua.mcp_server: Python MCP server (instrumentation in separate PR)
  • cua.indexer: Modal-based indexing service

Four Golden Signals Coverage

Latency (Histograms)

  • Request/operation duration in milliseconds or seconds
  • Includes CopilotKit response times, tool execution times, and job durations

Traffic (Counters)

  • Total requests, messages, tool calls, pages crawled, chunks indexed
  • Tracks volume of operations across all services

Errors (Counters)

  • Error counts by type and source
  • Includes error categorization for debugging

Saturation (Gauges/UpDownCounters)

  • Concurrent requests, active jobs, queue depths
  • Memory usage tracking
  • Resource utilization indicators

Export Configuration

  • OTEL endpoint: https://otel.cua.ai (configurable via OTEL_EXPORTER_OTLP_ENDPOINT)
  • Export interval: 15 seconds for metrics
  • Includes resource attributes (service name, version, environment, namespace)

Testing & Verification

The included Grafana dashboard prompt provides:

  • 5 comprehensive dashboards (Platform Overview, CopilotKit Deep Dive, MCP Operations, Indexing Pipeline, SLI/SLO Tracking)
  • Alert rules for high error rates, latency, saturation, and job failures
  • Example PromQL queries for validation
  • Environment variable setup instructions

Notes

  • MCP server instrumentation is in a separate PR
  • Metrics are exported via OTLP

Implement comprehensive OTEL instrumentation across all CUA services to
monitor the Four Golden Signals (Latency, Traffic, Errors, Saturation)
with metrics exported to otel.cua.ai.

Components instrumented:
- CopilotKit agent (docs/src/app/api/copilotkit/route.ts)
- MCP server (libs/python/mcp-server)
- Modal indexing services (docs/scripts/modal_app.py)

New files:
- docs/src/lib/otel/ - TypeScript OTEL library for Next.js
- docs/src/instrumentation.ts - Next.js instrumentation hook
- libs/python/mcp-server/mcp_server/otel.py - Python OTEL library
- docs/GRAFANA_DASHBOARD_PROMPT.md - Prompt for Grafana agent

Metrics tracked:
- Latency: request/response times, tool execution duration
- Traffic: requests, messages, tool calls, jobs
- Errors: by type, source, and severity
- Saturation: concurrent requests, active sessions, queue depth
@vercel
Copy link
Contributor

vercel bot commented Jan 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
docs Error Error Jan 24, 2026 8:55pm

Request Review

@sentry
Copy link

sentry bot commented Jan 24, 2026

Codecov Report

❌ Patch coverage is 28.50467% with 153 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
libs/python/mcp-server/mcp_server/otel.py 32.75% 117 Missing ⚠️
libs/python/mcp-server/mcp_server/server.py 10.00% 36 Missing ⚠️

📢 Thoughts on this report? Let us know!

Update pnpm-lock.yaml to include the OpenTelemetry packages added in
the previous commit. Fixes Vercel build failure due to frozen lockfile.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants