Skip to content

Add observability dashboards and alerting config#76

Merged
shayancoin merged 3 commits intomainfrom
codex/create-json-dashboards-and-configure-alerts
Oct 16, 2025
Merged

Add observability dashboards and alerting config#76
shayancoin merged 3 commits intomainfrom
codex/create-json-dashboards-and-configure-alerts

Conversation

@shayancoin
Copy link
Owner

@shayancoin shayancoin commented Oct 16, 2025

Summary

  • add Grafana dashboards for backend golden signals, sync pipeline KPIs, frontend UX, and tracing exploration
  • wire dashboards into Grafana provisioning and configure alert contact points
  • define Prometheus alert rules for error rate, slow p95 latency, and degraded frontend LCP

Testing

  • not run (not applicable)

https://chatgpt.com/codex/tasks/task_e_68f12adca0c88330a336cb42934b4279

Summary by CodeRabbit

  • New Features
    • Added multiple Grafana dashboards and provisioning for automatic updates
    • Added alerting notifications (Slack, Email) and Prometheus alert rules
    • Added backend request instrumentation and an observability ingest endpoint
    • Added frontend LCP reporting that sends web-vitals to the backend
  • Tests
    • Added tests validating observability ingestion and HTTP metrics
  • Chores
    • Updated frontend dependencies and dev tooling configuration

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Grafana provisioning (dashboards + alerting notifications), four Grafana dashboards, Prometheus alerting rules, Prometheus client metrics + middleware in the backend, a backend observability ingest route, frontend web-vitals reporting, and tests validating metrics ingestion and HTTP instrumentation.

Changes

Cohort / File(s) Summary
Grafana provisioning & alerting config
dashboards/dashboards.yaml, ops/grafana/provisioning/alerting/notifications.yaml
Added a Grafana dashboards provider config and alerting notification/contact-point policy (Slack and Email receivers) with routing rules and grouping.
Grafana dashboard JSONs
ops/grafana/provisioning/dashboards/backend_golden_signals.json, ops/grafana/provisioning/dashboards/frontend_ux_metrics.json, ops/grafana/provisioning/dashboards/sync_pipeline_kpis.json, ops/grafana/provisioning/dashboards/tracing_explorer.json
Added four dashboard definitions (backend golden signals, frontend UX metrics, sync pipeline KPIs, tracing explorer) targeting Prometheus/Tempo/Loki with templating, panels, and metadata.
Prometheus alert rules
ops/prometheus/alerts.yml
Added alert groups and rules for backend (HighErrorRate, SlowP95API) and frontend (FrontendLCPDegraded) with labels, annotations and runbook URLs.
Backend metrics, middleware, and routes
backend/api/metrics.py, backend/api/main.py, backend/api/routes_observability.py
Introduced Prometheus metrics (counter, histogram with buckets, web_vitals_lcp), observe_* APIs, a PrometheusInstrumentationMiddleware that records per-request metrics (skips /metrics), route label resolver, and an /observability POST endpoint for ingesting LCP web-vitals.
Backend tests
backend/tests/test_observability_metrics.py
Added tests exercising LCP ingestion, rejection of unsupported metrics, and http_requests_total increment for an example request.
Frontend web-vitals reporting & deps
frontend/src/app/reportWebVitals.ts, frontend/package.json
Added client-side reportWebVitals that sends LCP to backend (beacon -> fetch fallback) with env-driven endpoint resolution; updated frontend dependencies/devDependencies (added @react-three/*, ts-node adjustments).

Sequence Diagram(s)

sequenceDiagram
  participant Browser
  participant Frontend as Next.js app
  participant Backend as FastAPI app
  participant PromRegistry as Prometheus REGISTRY
  Note over Browser,Frontend #D3E4CD: LCP observed in browser
  Browser->>Frontend: reportWebVitals(LCP) (sendBeacon / fetch)
  Frontend->>Backend: POST /observability/web-vitals {name:"LCP", value, app}
  Backend->>PromRegistry: observe_lcp(app, seconds)
  PromRegistry-->>Backend: updated histogram
  Note over Browser,Backend #F7E9D9: HTTP requests instrumented
  Browser->>Backend: GET /healthcheck
  Backend->>Backend: PrometheusInstrumentationMiddleware measures duration/status
  Backend->>PromRegistry: observe_http_request(service, route, method, status, duration)
  PromRegistry-->>Backend: updated counter & histogram
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Reason: Multiple heterogeneous changes across infra (Grafana, Prometheus), backend instrumentation/middleware, HTTP route + metrics API, frontend reporting, and tests. Review requires validating config semantics, metric names/buckets, middleware correctness (skipping /metrics, error handling), and test robustness.

Possibly related PRs

  • Add ts-node to frontend dev dependencies #20 — touches frontend/package.json and reintroduces ts-node in devDependencies; likely related to the frontend dependency adjustments here.
  • Paform #7 — introduces backend observability changes (metrics and instrumentation) similar to the backend additions in this PR.

Poem

🐰 I watched dashboards sprout where carrots once lay,

Grafana gardens hum, and metrics dance all day,
Alerts nudge my whiskers when thresholds hop high,
LCP sent by beacon beneath a blue sky,
A rabbit cheers—observability’s pie! 🥕📈

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not follow the repository’s required template, as it lacks the mandated “PR Type,” “Short Description,” and “Tests Added” sections and instead uses custom headings “Summary” and “Testing.” Although it outlines the changes and notes testing status, it omits the specific template headings and structure that ensure consistency across pull requests. This deviation makes it difficult to quickly identify the type of change and test coverage. Please update the PR description to match the repository template by adding a “PR Type” section (e.g., Feature), a “Short Description” summarizing the change in one sentence, and a “Tests Added” section indicating what tests were added or updated. Ensure each section uses the exact headings from the template for consistency.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly and concisely summarizes the primary changes of adding observability dashboards and configuring alerting, matching the main objectives of the pull request without unnecessary detail or noise. It’s a single sentence that informs reviewers at a glance about the key feature being implemented. The phrasing is specific and relevant to the actual contents of the changeset.
✨ Finishing touches
  • 📝 Docstrings were successfully generated.
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/create-json-dashboards-and-configure-alerts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 5 to 24
- alert: HighErrorRate
expr: sum(rate(http_requests_total{service="backend",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="backend"}[5m])) > 0.05
for: 5m
labels:
severity: critical
service: backend
team: platform
annotations:
summary: "Backend error rate is above 5%"
description: "The backend service is returning 5xx responses at >5% of total traffic over the last 5 minutes. Investigate recent deploys or upstream dependencies."
runbook_url: "https://runbooks.example.com/backend/high-error-rate"
- alert: SlowP95API
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="backend",route!=""}[5m])) by (route, le)) > 0.8
for: 10m
labels:
severity: warning
service: backend
team: platform

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Alert queries target nonexistent backend metric labels

Both backend alert expressions rely on http_requests_total and http_request_duration_seconds_bucket being labelled with service and route, but the backend only registers a few custom Counter metrics and never initializes the prometheus-fastapi-instrumentator (or any middleware that would emit these series). As written, the PromQL filters will always return an empty vector and neither alert will ever fire, so operators receive no signal even if the API is failing. Consider emitting these metrics or adjusting the queries to match the labels that actually exist (e.g. job, handler).

Useful? React with 👍 / 👎.

Comment on lines 32 to 34
- alert: FrontendLCPDegraded
expr: histogram_quantile(0.95, sum(rate(web_vitals_lcp_bucket{app="frontend"}[5m])) by (le)) > 4
for: 15m

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Frontend LCP alert references metrics never produced

The FrontendLCPDegraded rule assumes a Prometheus histogram named web_vitals_lcp_bucket with an app label, but the repo contains no instrumentation or frontend code that exports Web Vitals metrics. Because the source metric does not exist, the alert expression evaluates to an empty result and user experience regressions will go undetected. Either add instrumentation that publishes web_vitals_lcp_bucket or update the alert to target an existing metric.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 5 to 24
- alert: HighErrorRate
expr: sum(rate(http_requests_total{service="backend",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="backend"}[5m])) > 0.05
for: 5m
labels:
severity: critical
service: backend
team: platform
annotations:
summary: "Backend error rate is above 5%"
description: "The backend service is returning 5xx responses at >5% of total traffic over the last 5 minutes. Investigate recent deploys or upstream dependencies."
runbook_url: "https://runbooks.example.com/backend/high-error-rate"
- alert: SlowP95API
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="backend",route!=""}[5m])) by (route, le)) > 0.8
for: 10m
labels:
severity: warning
service: backend
team: platform

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Alert queries target nonexistent backend metric labels

Both backend alert expressions rely on http_requests_total and http_request_duration_seconds_bucket being labelled with service and route, but the backend only registers a few custom Counter metrics and never initializes the prometheus-fastapi-instrumentator (or any middleware that would emit these series). As written, the PromQL filters will always return an empty vector and neither alert will ever fire, so operators receive no signal even if the API is failing. Consider emitting these metrics or adjusting the queries to match the labels that actually exist (e.g. job, handler).

Useful? React with 👍 / 👎.

Comment on lines 32 to 34
- alert: FrontendLCPDegraded
expr: histogram_quantile(0.95, sum(rate(web_vitals_lcp_bucket{app="frontend"}[5m])) by (le)) > 4
for: 15m

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Frontend LCP alert references metrics never produced

The FrontendLCPDegraded rule assumes a Prometheus histogram named web_vitals_lcp_bucket with an app label, but the repo contains no instrumentation or frontend code that exports Web Vitals metrics. Because the source metric does not exist, the alert expression evaluates to an empty result and user experience regressions will go undetected. Either add instrumentation that publishes web_vitals_lcp_bucket or update the alert to target an existing metric.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 5 to 24
- alert: HighErrorRate
expr: sum(rate(http_requests_total{service="backend",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="backend"}[5m])) > 0.05
for: 5m
labels:
severity: critical
service: backend
team: platform
annotations:
summary: "Backend error rate is above 5%"
description: "The backend service is returning 5xx responses at >5% of total traffic over the last 5 minutes. Investigate recent deploys or upstream dependencies."
runbook_url: "https://runbooks.example.com/backend/high-error-rate"
- alert: SlowP95API
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="backend",route!=""}[5m])) by (route, le)) > 0.8
for: 10m
labels:
severity: warning
service: backend
team: platform

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Alert queries target nonexistent backend metric labels

Both backend alert expressions rely on http_requests_total and http_request_duration_seconds_bucket being labelled with service and route, but the backend only registers a few custom Counter metrics and never initializes the prometheus-fastapi-instrumentator (or any middleware that would emit these series). As written, the PromQL filters will always return an empty vector and neither alert will ever fire, so operators receive no signal even if the API is failing. Consider emitting these metrics or adjusting the queries to match the labels that actually exist (e.g. job, handler).

Useful? React with 👍 / 👎.

Comment on lines 32 to 34
- alert: FrontendLCPDegraded
expr: histogram_quantile(0.95, sum(rate(web_vitals_lcp_bucket{app="frontend"}[5m])) by (le)) > 4
for: 15m

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Frontend LCP alert references metrics never produced

The FrontendLCPDegraded rule assumes a Prometheus histogram named web_vitals_lcp_bucket with an app label, but the repo contains no instrumentation or frontend code that exports Web Vitals metrics. Because the source metric does not exist, the alert expression evaluates to an empty result and user experience regressions will go undetected. Either add instrumentation that publishes web_vitals_lcp_bucket or update the alert to target an existing metric.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2a35c4 and c3fb7df.

📒 Files selected for processing (7)
  • dashboards/dashboards.yaml (1 hunks)
  • ops/grafana/provisioning/alerting/notifications.yaml (1 hunks)
  • ops/grafana/provisioning/dashboards/backend_golden_signals.json (1 hunks)
  • ops/grafana/provisioning/dashboards/frontend_ux_metrics.json (1 hunks)
  • ops/grafana/provisioning/dashboards/sync_pipeline_kpis.json (1 hunks)
  • ops/grafana/provisioning/dashboards/tracing_explorer.json (1 hunks)
  • ops/prometheus/alerts.yml (1 hunks)

Comment on lines +25 to +37
groupBy:
- alertname
- service
matchers:
- severity =~ ".*"
routes:
- receiver: email_sre
groupBy:
- alertname
objectMatchers:
- - severity
- '='
- warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use Grafana’s expected snake_case keys in notification policies

Grafana provisioning only understands group_by and object_matchers. Using the camelCase variants (groupBy, objectMatchers) means your policy won’t group alerts or fan out warnings to email.

Update the block to:

-    groupBy:
+    group_by:
       - alertname
       - service
...
-      - receiver: email_sre
-        groupBy:
+      - receiver: email_sre
+        group_by:
           - alertname
-        objectMatchers:
+        object_matchers:
           - - severity
             - '='
             - warning
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
groupBy:
- alertname
- service
matchers:
- severity =~ ".*"
routes:
- receiver: email_sre
groupBy:
- alertname
objectMatchers:
- - severity
- '='
- warning
group_by:
- alertname
- service
matchers:
- severity =~ ".*"
routes:
- receiver: email_sre
group_by:
- alertname
object_matchers:
- - severity
- '='
- warning
🤖 Prompt for AI Agents
In ops/grafana/provisioning/alerting/notifications.yaml around lines 25 to 37,
the route uses CamelCase keys that Grafana doesn't recognize; change "groupBy"
to "group_by" and "objectMatchers" to "object_matchers" under the route
(preserving the nested list structure of the object matchers), so the routing
will correctly group alerts and fan out warnings to the email_sre receiver.

Comment on lines +187 to +225
"datasource": {
"type": "prometheus",
"uid": "$prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 4,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"targets": [
{
"expr": "sum(rate(frontend_api_requests_total{outcome=\"success\"}[5m])) by (route)",
"legendFormat": "{{route}}",
"refId": "A"
}
],
"title": "Frontend API Success Rate",
"type": "timeseries"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the “Frontend API Success Rate” query

This panel is titled as a success rate but the query only counts successful requests per second, so we never notice when the success percentage drops. Please turn it into a true ratio and report it as a percentage.

Apply:

-          "unit": "reqps"
+          "unit": "percentunit"
...
-          "expr": "sum(rate(frontend_api_requests_total{outcome=\"success\"}[5m])) by (route)",
+          "expr": "sum(rate(frontend_api_requests_total{outcome=\"success\"}[5m])) by (route)\n            /\n            clamp_min(sum(rate(frontend_api_requests_total[5m])) by (route), 1e-6)",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"datasource": {
"type": "prometheus",
"uid": "$prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 4,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"targets": [
{
"expr": "sum(rate(frontend_api_requests_total{outcome=\"success\"}[5m])) by (route)",
"legendFormat": "{{route}}",
"refId": "A"
}
],
"title": "Frontend API Success Rate",
"type": "timeseries"
},
"datasource": {
"type": "prometheus",
"uid": "$prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "percentunit"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 4,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi"
}
},
"targets": [
{
"expr": "sum(rate(frontend_api_requests_total{outcome=\"success\"}[5m])) by (route)\n /\n clamp_min(sum(rate(frontend_api_requests_total[5m])) by (route), 1e-6)",
"legendFormat": "{{route}}",
"refId": "A"
}
],
"title": "Frontend API Success Rate",
"type": "timeseries"
},
🤖 Prompt for AI Agents
In ops/grafana/provisioning/dashboards/frontend_ux_metrics.json around lines 187
to 225, the panel titled "Frontend API Success Rate" currently uses a query that
only counts successful requests/sec; replace the expr with a PromQL ratio that
divides successful requests by total requests over the same window (e.g.
sum(rate(frontend_api_requests_total{outcome="success"}[5m])) /
sum(rate(frontend_api_requests_total[5m])) ), multiply by 100 or set the panel
unit to percent so the panel shows percentage values, and ensure
legend/legendFormat remains appropriate (e.g. {{route}}) while keeping the
datasource/type unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3fb7df and 2ec0890.

⛔ Files ignored due to path filters (1)
  • frontend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (6)
  • backend/api/main.py (3 hunks)
  • backend/api/metrics.py (2 hunks)
  • backend/api/routes_observability.py (1 hunks)
  • backend/tests/test_observability_metrics.py (1 hunks)
  • frontend/package.json (2 hunks)
  • frontend/src/app/reportWebVitals.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/api/routes_observability.py (2)
backend/api/main.py (1)
  • metrics (150-152)
backend/api/metrics.py (1)
  • observe_lcp (77-80)
backend/api/main.py (1)
backend/api/metrics.py (1)
  • observe_http_request (59-74)
🔇 Additional comments (1)
frontend/package.json (1)

23-32: Dependency additions align with the new observability + 3D work

@react-three/drei/@react-three/fiber land with compatible versions alongside three, and web-vitals ^4.2.4 is the right pick for piping LCP into the new reporting hook. No issues spotted here.

Comment on lines +66 to +93
class PrometheusInstrumentationMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next): # type: ignore[override]
if request.url.path == "/metrics":
return await call_next(request)

start = perf_counter()
try:
response = await call_next(request)
except Exception:
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status="500",
duration_seconds=duration,
)
raise

duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status=str(response.status_code),
duration_seconds=duration,
)
return response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Record the real status code for HTTPException responses.

call_next raises HTTPException for anticipated errors (e.g. 404, 422), but our broad except Exception block records every one of them as status 500. That skews the metrics for client errors and validation failures. Handle HTTPException separately (use its status_code) before catching generic exceptions.

Apply this diff to record accurate status codes:

-from starlette.middleware.base import BaseHTTPMiddleware
+from starlette.middleware.base import BaseHTTPMiddleware
+from fastapi import HTTPException
@@
-        try:
-            response = await call_next(request)
-        except Exception:
+        try:
+            response = await call_next(request)
+        except HTTPException as exc:
+            duration = perf_counter() - start
+            observe_http_request(
+                service="backend",
+                route=_resolve_route_label(request),
+                method=request.method,
+                status=str(exc.status_code),
+                duration_seconds=duration,
+            )
+            raise
+        except Exception:
             duration = perf_counter() - start
             observe_http_request(
                 service="backend",
                 route=_resolve_route_label(request),
                 method=request.method,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class PrometheusInstrumentationMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next): # type: ignore[override]
if request.url.path == "/metrics":
return await call_next(request)
start = perf_counter()
try:
response = await call_next(request)
except Exception:
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status="500",
duration_seconds=duration,
)
raise
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status=str(response.status_code),
duration_seconds=duration,
)
return response
from starlette.middleware.base import BaseHTTPMiddleware
from fastapi import HTTPException
class PrometheusInstrumentationMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next): # type: ignore[override]
if request.url.path == "/metrics":
return await call_next(request)
start = perf_counter()
try:
response = await call_next(request)
except HTTPException as exc:
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status=str(exc.status_code),
duration_seconds=duration,
)
raise
except Exception:
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status="500",
duration_seconds=duration,
)
raise
duration = perf_counter() - start
observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
status=str(response.status_code),
duration_seconds=duration,
)
return response
🤖 Prompt for AI Agents
In backend/api/main.py around lines 66 to 93, the middleware’s broad except
block treats all raised HTTPException instances as 500s; update the error
handling to catch fastapi.exceptions.HTTPException (or
starlette.exceptions.HTTPException) separately, call observe_http_request using
the exception’s status_code, then re-raise it, and keep a separate generic
except Exception block that records status "500" and re-raises; ensure the
/metrics bypass remains unchanged and that duration is measured before recording
in both exception branches.

Comment on lines +11 to +18
const protocol = process.env.NEXT_PUBLIC_BACKEND_PROTOCOL ?? window.location.protocol ?? 'http:';
const host = process.env.NEXT_PUBLIC_BACKEND_HOST ?? window.location.hostname;
const port = process.env.NEXT_PUBLIC_BACKEND_PORT ?? '';
const endpoint = process.env.NEXT_PUBLIC_WEB_VITALS_ENDPOINT ?? DEFAULT_ENDPOINT;

const portSegment = port ? `:${port}` : '';
return `${protocol}//${host}${portSegment}${endpoint}`;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize protocol when building the reporting URL.

If NEXT_PUBLIC_BACKEND_PROTOCOL is set to common values like "https", the computed endpoint becomes https//… (missing the colon) because we directly interpolate the raw string. That breaks both sendBeacon and the fetch fallback. Please normalize the protocol (ensure it ends with :) or build the URL via new URL(...) so env-provided values work.

Apply this diff to normalize the protocol before interpolation:

-  const protocol = process.env.NEXT_PUBLIC_BACKEND_PROTOCOL ?? window.location.protocol ?? 'http:';
+  const protocol = process.env.NEXT_PUBLIC_BACKEND_PROTOCOL ?? window.location.protocol ?? 'http:';
+  const normalizedProtocol = protocol.endsWith(':') ? protocol : `${protocol}:`;
@@
-  return `${protocol}//${host}${portSegment}${endpoint}`;
+  return `${normalizedProtocol}//${host}${portSegment}${endpoint}`;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const protocol = process.env.NEXT_PUBLIC_BACKEND_PROTOCOL ?? window.location.protocol ?? 'http:';
const host = process.env.NEXT_PUBLIC_BACKEND_HOST ?? window.location.hostname;
const port = process.env.NEXT_PUBLIC_BACKEND_PORT ?? '';
const endpoint = process.env.NEXT_PUBLIC_WEB_VITALS_ENDPOINT ?? DEFAULT_ENDPOINT;
const portSegment = port ? `:${port}` : '';
return `${protocol}//${host}${portSegment}${endpoint}`;
}
const protocol = process.env.NEXT_PUBLIC_BACKEND_PROTOCOL ?? window.location.protocol ?? 'http:';
const normalizedProtocol = protocol.endsWith(':') ? protocol : `${protocol}:`;
const host = process.env.NEXT_PUBLIC_BACKEND_HOST ?? window.location.hostname;
const port = process.env.NEXT_PUBLIC_BACKEND_PORT ?? '';
const endpoint = process.env.NEXT_PUBLIC_WEB_VITALS_ENDPOINT ?? DEFAULT_ENDPOINT;
const portSegment = port ? `:${port}` : '';
return `${normalizedProtocol}//${host}${portSegment}${endpoint}`;
}
🤖 Prompt for AI Agents
In frontend/src/app/reportWebVitals.ts around lines 11 to 18, the protocol value
from NEXT_PUBLIC_BACKEND_PROTOCOL may be missing the trailing colon (e.g.
"https") which produces URLs like "https//..."; normalize the protocol before
building the URL by trimming whitespace, checking if it is non-empty and
endsWith(':') and if not appending ':' (or alternately construct the full URL
with new URL by combining host/port/endpoint), then use that normalized protocol
when interpolating so the final URL is valid for sendBeacon/fetch.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #133

coderabbitai bot added a commit that referenced this pull request Oct 16, 2025
Docstrings generation was requested by @shayancoin.

* #76 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/metrics.py`
* `backend/api/routes_observability.py`
* `backend/tests/test_observability_metrics.py`
* `frontend/src/app/reportWebVitals.ts`
@shayancoin shayancoin merged commit f6d661f into main Oct 16, 2025
2 of 7 checks passed
coderabbitai bot added a commit that referenced this pull request Oct 16, 2025
Docstrings generation was requested by @shayancoin.

* #76 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/metrics.py`
* `backend/api/routes_observability.py`
* `backend/tests/test_observability_metrics.py`
* `frontend/src/app/reportWebVitals.ts`
shayancoin pushed a commit that referenced this pull request Oct 18, 2025
…s` (#132)

Docstrings generation was requested by @shayancoin.

* #76 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/metrics.py`
* `backend/api/routes_observability.py`
* `backend/tests/test_observability_metrics.py`
* `frontend/src/app/reportWebVitals.ts`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
shayancoin added a commit that referenced this pull request Oct 18, 2025
…s` (#133)

Docstrings generation was requested by @shayancoin.

* #76 (comment)

The following files were modified:

* `backend/api/main.py`
* `backend/api/metrics.py`
* `backend/api/routes_observability.py`
* `backend/tests/test_observability_metrics.py`
* `frontend/src/app/reportWebVitals.ts`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Shayan <shayan@coin.link>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant