Add observability dashboards and alerting configuration by shayancoin · Pull Request #98 · shayancoin/paform

shayancoin · 2025-10-16T18:05:03Z

Summary

add Grafana dashboards for backend golden signals, sync pipeline KPIs, frontend UX metrics, and tracing exploration
configure Grafana provisioning for dashboards plus Slack and email alert contact points
add Prometheus alert rules to align with Grafana alerting conventions

Testing

not run (configuration changes only)

https://chatgpt.com/codex/tasks/task_e_68f12ebc874c8330aaac4ab465b80e6b

Summary by CodeRabbit

New Features
- Added observability dashboards for backend golden signals, frontend UX, sync pipeline KPIs, and tracing explorer; dashboards are auto-provisioned from the repo structure.
- Added Prometheus alert rules covering backend errors, API latency, and frontend LCP.
New Features / Alerting
- Configured alerting contact points (Slack and email) and notification policies routing alerts by severity.
Documentation
- Added Grafana provisioning and alerting setup documentation.

coderabbitai · 2025-10-16T18:05:22Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Grafana provisioning and dashboard files, alerting contact points and notification policies, and Prometheus alerting rules to enable file-backed dashboards, Grafana alert routing (Slack/email), and Prometheus rule evaluation for backend/frontend alerts.

Changes

Cohort / File(s)	Summary
Grafana provisioning index & docs `dashboards/dashboards.yaml`, `ops/grafana/README.md`	Adds Grafana provisioning entry enabling file-based dashboards (path: `ops/grafana/provisioning/dashboards`, foldersFromFilesStructure) and documents provisioning scope, alerting, and Prometheus alert integration.
Grafana alerting configuration `ops/grafana/provisioning/alerting/contact-points.yaml`, `ops/grafana/provisioning/alerting/notification-policies.yaml`	Introduces contact points: Slack (`slack-oncall`) and Email (`email-sre`), plus two notification policies routing critical -> Slack and warning -> Email, grouped by `alertname` and `service`.
Grafana dashboards (provisioned files) `ops/grafana/provisioning/dashboards/*` `ops/grafana/provisioning/dashboards/backend-golden-signals.json`, `.../frontend-ux-metrics.json`, `.../sync-pipeline-kpis.json`, `.../tracing-explorer.json`	Adds four JSON dashboards: Backend Golden Signals, Frontend UX Metrics, Sync Pipeline KPIs, and Tracing Explorer with panels using Prometheus/Tempo/Loki datasources and templating variables.
Prometheus alerting rules `ops/prometheus/alerts.yml`	Adds alerting rules for backend HighErrorRate and SlowP95API, and frontend FrontendLCPDegraded, with severity labels, durations, annotations and runbook URLs.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Grafana
    participant Filesystem as "File storage\nops/grafana/provisioning/dashboards"
    participant GrafanaProv as "Grafana\nProvisioning (dashboards.yaml)"
    participant AlertingProv as "Grafana Alerting\n(contact-points, policies)"
    participant Prometheus
    participant Alertmanager

    Note over Grafana,GrafanaProv: Grafana startup provisioning flow
    Grafana ->> GrafanaProv: read `dashboards/dashboards.yaml`
    GrafanaProv ->> Filesystem: load dashboards (foldersFromFilesStructure)
    Filesystem -->> Grafana: JSON dashboard files
    Grafana ->> Grafana: provision dashboards

    Note over Grafana,AlertingProv: Alerting provisioning
    Grafana ->> AlertingProv: load `contact-points.yaml` & `notification-policies.yaml`
    AlertingProv -->> Grafana: receivers configured (Slack, Email)

    Note over Prometheus,Alertmanager: Prometheus alert flow (added rules)
    Prometheus ->> Prometheus: evaluate `ops/prometheus/alerts.yml`
    Prometheus ->> Alertmanager: send alerts
    Alertmanager ->> Grafana: (optional) route alerts or Grafana uses its contact points
    Alertmanager ->> Slack: deliver critical (via configured webhook)
    Alertmanager ->> Email: deliver warning (via configured SMTP/email)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through JSON, YAML, and light,

Dashboards arranged in tidy sight,
Slack bells and emails ready to chime,
Metrics and traces dancing in time,
Observability nibbles carrots of rhyme.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description does not follow the repository’s required template because it lacks the “PR Type,” “Short Description,” and “Tests Added” sections as specified, and instead uses “Summary” and “Testing” headings that do not match the template.	Please update the description to include the required “PR Type,” “Short Description,” and “Tests Added” sections using the repository’s template headings and format, ensuring all template fields are filled out.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately describes the primary change of adding observability dashboards and alerting configuration, which aligns with the pull request’s main updates.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch codex/create-grafana-dashboards-and-alerts

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81d4ed0 and b602d20.

📒 Files selected for processing (2)

ops/grafana/provisioning/alerting/notification-policies.yaml (1 hunks)
ops/grafana/provisioning/dashboards/tracing-explorer.json (1 hunks)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-16T18:07:24Z

ops/grafana/provisioning/alerting/notification-policies.yaml

+    matching_labels:
+      severity: critical


Use valid Grafana matcher keys in notification policies

The alert routing file uses a matching_labels map, but Grafana’s provisioning schema only recognizes matchers/object_matchers inside notification policy routes. Because the key is ignored, the first policy becomes the default route and both warning and critical alerts will be sent to Slack instead of splitting Slack vs email as intended. Update the configuration to use Grafana’s supported matcher fields (and, if necessary, nest the routes) so label-based routing actually takes effect across both entries.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-16T18:07:25Z

ops/grafana/provisioning/dashboards/tracing-explorer.json

+      "targets": [
+        {
+          "query": "{ service.name = \"$service\" } | trace_id != ''",
+          "refId": "A"


Make tracing queries work when the Service variable is “All”

Every Tempo/Loki query compares service with = (e.g. { service.name = "$service" }), yet the variable is configured with includeAll: true and defaults to an empty/regex value. When the dashboard first loads the expression becomes service.name="", which matches nothing, so all panels render empty until a specific service is manually chosen. Switch the filters to regex matching (=~"$service") or remove the All option so the dashboard shows data by default and the multi-select works.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2a35c4 and 81d4ed0.

📒 Files selected for processing (9)

dashboards/dashboards.yaml (1 hunks)
ops/grafana/README.md (1 hunks)
ops/grafana/provisioning/alerting/contact-points.yaml (1 hunks)
ops/grafana/provisioning/alerting/notification-policies.yaml (1 hunks)
ops/grafana/provisioning/dashboards/backend-golden-signals.json (1 hunks)
ops/grafana/provisioning/dashboards/frontend-ux-metrics.json (1 hunks)
ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json (1 hunks)
ops/grafana/provisioning/dashboards/tracing-explorer.json (1 hunks)
ops/prometheus/alerts.yml (1 hunks)

🧰 Additional context used

🪛 LanguageTool

ops/grafana/README.md

[grammar] ~1-~1: Use correct spacing
Context: # Grafana Provisioning This directory contains provisioning fil...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~3-~3: Use correct spacing
Context: ...d alerting configuration used by Paform. ## Dashboards * provisioning/dashboards/...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~5-~5: There might be a problem here.
Context: ...iguration used by Paform. ## Dashboards * provisioning/dashboards/ includes JSON definitions for backend, s...

(QB_NEW_EN_MERGED_MATCH)

[grammar] ~7-~7: There might be a mistake here.
Context: ...end UX, and tracing explorer dashboards. * The root dashboards/dashboards.yaml fi...

(QB_NEW_EN)

[grammar] ~8-~8: Use correct spacing
Context: ...rds are loaded automatically on startup. ## Alerting Contact Points * `provisioning...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~10-~10: Use correct spacing
Context: ... on startup. ## Alerting Contact Points * provisioning/alerting/contact-points.yaml defines Slack and email receivers. Set ...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~13-~13: Use correct spacing
Context: ...ing/alerting/notification-policies.yamlroutescriticalalerts to Slack andw...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~13-~13: Use correct spacing
Context: ...label matching if your topology changes. ## Prometheus Alerts * `../prometheus/aler...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

[grammar] ~15-~15: Use correct spacing
Context: ... topology changes. ## Prometheus Alerts * ../prometheus/alerts.yml contains alerting rules that surface in...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

🔇 Additional comments (1)

ops/prometheus/alerts.yml (1)
15-23: Fix the SlowP95API histogram aggregation

histogram_quantile needs per-series bucket counts. You’re summing http_request_duration_seconds_bucket{job="backend"} by (le, route) but the denominator omits route, so the buckets no longer represent a single route and p95 becomes garbage. Either add route to the group-by (and match it in the legend) or drop the label entirely.
Example fix:
-        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8
+        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8
…but make sure you actually keep route in both numerator and denominator. Right now it’s missing, so the alert will evaluate incorrectly. Please correct the aggregation.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-16T18:09:17Z

ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json

+      "targets": [
+        {
+          "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",
+          "legendFormat": "Failure rate",
+          "refId": "A"
+        }
+      ],
+      "title": "Failure Rate",
+      "type": "timeseries"


⚠️ Potential issue | 🟠 Major

Fix failure rate formula to report a true ratio.

The panel labeled “Failure Rate” plots sum(rate(sync_job_failures_total{…}[5m])), which is an absolute throughput of failing jobs (failures/sec). That value grows with traffic volume and is not a failure rate or percentage, so the panel misleads operators about pipeline health. Divide the failure counter by the total job counter over the same window (and scale/format as needed) to present an actual failure rate.

- "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))", - "legendFormat": "Failure rate", + "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))", + "legendFormat": "Failure rate", + "unit": "percent",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"targets": [

{

"expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",

"legendFormat": "Failure rate",

"refId": "A"

}

],

"title": "Failure Rate",

"type": "timeseries"

"targets": [

{

"expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))",

"legendFormat": "Failure rate",

"unit": "percent",

"refId": "A"

}

],

"title": "Failure Rate",

"type": "timeseries"

🤖 Prompt for AI Agents

In ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json around lines 99 to 107, the "Failure Rate" panel currently plots the absolute failure throughput (sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m]))) instead of a true failure ratio; replace the expression with the ratio of failing jobs to total jobs over the same window (e.g. sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m])) / sum(rate(sync_jobs_total{pipeline=~"$pipeline"}[5m]))) and adjust the panel’s unit/scale (multiply by 100 and set percent formatting if you want a percentage) so the chart shows a true failure rate rather than raw failures/sec.

Add observability dashboards and alerting configuration

81d4ed0

shayancoin added the codex label Oct 16, 2025 — with ChatGPT Codex Connector

chatgpt-codex-connector bot reviewed Oct 16, 2025

View reviewed changes

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

shayancoin added 2 commits October 16, 2025 11:57

Fix Grafana alert routing and tracing filters

b602d20

Merge branch 'main' into codex/create-grafana-dashboards-and-alerts

2fb81c7

shayancoin merged commit ce6a12d into main Oct 17, 2025
0 of 5 checks passed

coderabbitai bot mentioned this pull request Oct 21, 2025

Tighten API latency budget to 300 ms #532

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add observability dashboards and alerting configuration#98

Add observability dashboards and alerting configuration#98
shayancoin merged 3 commits intomainfrom
codex/create-grafana-dashboards-and-alerts

shayancoin commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shayancoin commented Oct 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shayancoin commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 16, 2025 •

edited

Loading