Skip to content

Comments

observability: kps + otel-collector + p95 budgets + labels#569

Merged
shayancoin merged 1 commit intomainfrom
codex/add-production-ready-observability-stack
Oct 22, 2025
Merged

observability: kps + otel-collector + p95 budgets + labels#569
shayancoin merged 1 commit intomainfrom
codex/add-production-ready-observability-stack

Conversation

@shayancoin
Copy link
Owner

@shayancoin shayancoin commented Oct 22, 2025

Summary

  • add a monitoring helmfile for kube-prometheus-stack with dashboards and latency alerting
  • introduce an otel collector chart that emits span metrics to prometheus and forwards traces to tempo
  • propagate route, tenant, and service labels from backend/frontend code and document the observability rollout steps
  • define an api p95 budget and validation workflow for observability assets

Testing

  • pnpm --dir frontend lint

https://chatgpt.com/codex/tasks/task_e_68f8aea896f483309c40ed2ebbba75c3

Summary by CodeRabbit

  • New Features

    • Integrated OpenTelemetry to capture and trace requests end-to-end across the platform
    • Added Prometheus and Grafana monitoring with pre-configured dashboards for service latency visibility
    • Established service-level objectives for tracking API performance against targets
  • Documentation

    • Added comprehensive observability deployment and management procedures to the runbook

@vercel
Copy link

vercel bot commented Oct 22, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
paform Ready Ready Preview Comment Oct 22, 2025 10:36am

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This PR introduces comprehensive OpenTelemetry observability integration across the full stack: a GitHub Actions validation workflow, backend request tracing middleware, frontend page load tracing, Kubernetes Helm charts for OpenTelemetry Collector deployment, Prometheus/Grafana monitoring stack configuration, and observability SLO budgets targeting API p95 latency metrics.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/obs-validate.yml
New validation workflow with helm_lint, jsonlint, and yq-validate jobs to check Helm charts, Grafana dashboards, and observability budgets.
Backend OpenTelemetry Integration
backend/api/main.py,
backend/api/middleware/observability.py
Sets default OTEL_SERVICE_NAME, registers ObservabilityMiddleware to extract tracing context from headers and annotate spans with service name, http.route, and tenant_id.
Frontend OpenTelemetry Integration
frontend/src/app/configurator/page.tsx,
frontend/src/lib/otel-route.ts
Adds page load tracing initialization on ConfiguratorPage mount; exposes initOtelRoute() function to start page_load span with service and route attributes sourced from localStorage tenant_id.
OpenTelemetry Collector Helm Chart
ops/helm/otel-collector/Chart.yaml,
ops/helm/otel-collector/templates/_helpers.tpl,
ops/helm/otel-collector/templates/configmap.yaml,
ops/helm/otel-collector/templates/deployment.yaml,
ops/helm/otel-collector/templates/service.yaml,
ops/helm/otel-collector/templates/serviceaccount.yaml,
ops/helm/otel-collector/templates/servicemonitor.yaml,
ops/helm/otel-collector/values.yaml
Complete Helm chart for deploying OpenTelemetry Collector with OTLP receivers, batch/memory-limiter processors, spanmetrics connectors (computing service, route, tenant_id dimensions), and dual-pipeline export (traces to Tempo, metrics to Prometheus).
Monitoring Stack Configuration
ops/helm/monitoring/helmfile.yaml,
ops/helm/monitoring/kps-values.yaml
Helmfile and values for kube-prometheus-stack v55.10.0; includes embedded Grafana dashboard for paform-api latency visualization and Prometheus recording/alert rules for p95 latency SLO.
Observability Budgets & Prod Config
observability-budgets.yml,
ops/helm/values-prod.example.yaml
Replaces multi-metric budgets with single budget api_p95_route_configurator targeting 0.300s threshold over 5m using Tempo spanmetrics; adds tempoEndpoint configuration to prod values.
Deployment Documentation
docs/runbooks/deploy.md
New Observability section (section 9) detailing Prometheus, Alertmanager, Grafana, and OpenTelemetry Collector deployment steps including Tempo wiring and SLI guidance.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client / Browser
    participant Frontend as Frontend App
    participant Backend as Backend API
    participant Collector as OT Collector
    participant Tempo as Tempo
    participant Prometheus as Prometheus
    participant Grafana as Grafana

    Note over Frontend,Collector: Initialization & Request Phase
    Frontend->>Frontend: initOtelRoute() on mount
    activate Frontend
    Frontend->>Frontend: Start page_load span<br/>(service, route, tenant_id)
    Frontend->>Frontend: End page_load span
    deactivate Frontend

    Client->>Frontend: HTTP Request
    Frontend->>Backend: HTTP Request + Trace Headers

    Note over Backend,Collector: Backend Request Tracing
    Backend->>Backend: ObservabilityMiddleware<br/>Extract trace context
    activate Backend
    Backend->>Backend: Create/link span<br/>(service, http.route, tenant_id)
    Backend->>Backend: Process request
    Backend->>Backend: End span
    deactivate Backend
    Backend->>Client: HTTP Response

    Note over Collector,Prometheus: Telemetry Pipeline
    Backend->>Collector: Send traces (OTLP)
    Collector->>Collector: spanmetrics connector<br/>(extract service, route, tenant_id)
    Collector->>Tempo: Export traces
    Collector->>Prometheus: Export metrics<br/>(p95 latency histogram)

    Note over Prometheus,Grafana: Alerting & Visualization
    Prometheus->>Prometheus: Recording rule:<br/>paform:api_p95_5m
    Prometheus->>Grafana: Scrape metrics
    Grafana->>Grafana: Display latency dashboard<br/>Alert on threshold breach
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Rationale: The PR spans heterogeneous components (backend middleware, frontend hooks, Kubernetes manifests, Helm charts, GitHub Actions, YAML configs) with varying logic density. The backend middleware introduces context propagation logic requiring careful review of OpenTelemetry semantics; the Helm chart templates involve multiple interdependent resources (Deployment, ConfigMap, Service, ServiceAccount, ServiceMonitor) with conditional rendering and port/secret wiring; the monitoring stack chains Prometheus rules, Grafana dashboards, and Tempo integration; observability budgets introduce SLO thresholds and query logic. While individual sections are not overly complex, the breadth and interdependencies demand multi-pass reasoning across domains.

Possibly related PRs

Poem

🐰 Hops through the traces we now see,
Spans of latency, flowing free,
From frontend hops to backend bounds,
OTEL metrics dance all around!
Grafana's dashboards glow so bright,
Observability done just right!

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/add-production-ready-observability-stack

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 28bd780 and 66809c4.

📒 Files selected for processing (18)
  • .github/workflows/obs-validate.yml (1 hunks)
  • backend/api/main.py (5 hunks)
  • backend/api/middleware/observability.py (1 hunks)
  • docs/runbooks/deploy.md (1 hunks)
  • frontend/src/app/configurator/page.tsx (1 hunks)
  • frontend/src/lib/otel-route.ts (1 hunks)
  • observability-budgets.yml (1 hunks)
  • ops/helm/monitoring/helmfile.yaml (1 hunks)
  • ops/helm/monitoring/kps-values.yaml (1 hunks)
  • ops/helm/otel-collector/Chart.yaml (1 hunks)
  • ops/helm/otel-collector/templates/_helpers.tpl (1 hunks)
  • ops/helm/otel-collector/templates/configmap.yaml (1 hunks)
  • ops/helm/otel-collector/templates/deployment.yaml (1 hunks)
  • ops/helm/otel-collector/templates/service.yaml (1 hunks)
  • ops/helm/otel-collector/templates/serviceaccount.yaml (1 hunks)
  • ops/helm/otel-collector/templates/servicemonitor.yaml (1 hunks)
  • ops/helm/otel-collector/values.yaml (1 hunks)
  • ops/helm/values-prod.example.yaml (1 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

@shayancoin shayancoin merged commit 975d110 into main Oct 22, 2025
5 of 13 checks passed
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #570

coderabbitai bot added a commit that referenced this pull request Oct 22, 2025
Docstrings generation was requested by @shayancoin.

* #569 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/middleware/observability.py`
* `frontend/src/app/configurator/page.tsx`
* `frontend/src/lib/otel-route.ts`
shayancoin pushed a commit that referenced this pull request Oct 22, 2025
…570)

Docstrings generation was requested by @shayancoin.

* #569 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/middleware/observability.py`
* `frontend/src/app/configurator/page.tsx`
* `frontend/src/lib/otel-route.ts`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1 to +9
budgets:
- name: api_p95_route_configurator
target: 0.300
query: >
histogram_quantile(0.95, sum by (le, service, route, tenant_id)(
rate(traces_spanmetrics_latency_bucket{service="paform-api", route="/configurator"}[5m])
))
window: 5m
action_on_violation: fail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Config update disables observability budget checks

The new observability budget file now defines a budgets array without the prometheus/tempo sections or the global window/baseline keys that tools/ci/check_observability_budgets.py consumes. The Python check still expects those keys and iterates over providers named prometheus and tempo; when run against this file it prints "No observability providers configured." and exits successfully, so no thresholds are ever evaluated. This effectively turns off the SLO gate for deployments. Either keep the previous schema or update the script and workflow to read the new structure.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant