Skip to content

Add observability dashboards and alerting configuration#98

Merged
shayancoin merged 3 commits intomainfrom
codex/create-grafana-dashboards-and-alerts
Oct 17, 2025
Merged

Add observability dashboards and alerting configuration#98
shayancoin merged 3 commits intomainfrom
codex/create-grafana-dashboards-and-alerts

Conversation

@shayancoin
Copy link
Owner

@shayancoin shayancoin commented Oct 16, 2025

Summary

  • add Grafana dashboards for backend golden signals, sync pipeline KPIs, frontend UX metrics, and tracing exploration
  • configure Grafana provisioning for dashboards plus Slack and email alert contact points
  • add Prometheus alert rules to align with Grafana alerting conventions

Testing

  • not run (configuration changes only)

https://chatgpt.com/codex/tasks/task_e_68f12ebc874c8330aaac4ab465b80e6b

Summary by CodeRabbit

  • New Features

    • Added observability dashboards for backend golden signals, frontend UX, sync pipeline KPIs, and tracing explorer; dashboards are auto-provisioned from the repo structure.
    • Added Prometheus alert rules covering backend errors, API latency, and frontend LCP.
  • New Features / Alerting

    • Configured alerting contact points (Slack and email) and notification policies routing alerts by severity.
  • Documentation

    • Added Grafana provisioning and alerting setup documentation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Grafana provisioning and dashboard files, alerting contact points and notification policies, and Prometheus alerting rules to enable file-backed dashboards, Grafana alert routing (Slack/email), and Prometheus rule evaluation for backend/frontend alerts.

Changes

Cohort / File(s) Summary
Grafana provisioning index & docs
dashboards/dashboards.yaml, ops/grafana/README.md
Adds Grafana provisioning entry enabling file-based dashboards (path: ops/grafana/provisioning/dashboards, foldersFromFilesStructure) and documents provisioning scope, alerting, and Prometheus alert integration.
Grafana alerting configuration
ops/grafana/provisioning/alerting/contact-points.yaml, ops/grafana/provisioning/alerting/notification-policies.yaml
Introduces contact points: Slack (slack-oncall) and Email (email-sre), plus two notification policies routing critical -> Slack and warning -> Email, grouped by alertname and service.
Grafana dashboards (provisioned files)
ops/grafana/provisioning/dashboards/*
ops/grafana/provisioning/dashboards/backend-golden-signals.json, .../frontend-ux-metrics.json, .../sync-pipeline-kpis.json, .../tracing-explorer.json
Adds four JSON dashboards: Backend Golden Signals, Frontend UX Metrics, Sync Pipeline KPIs, and Tracing Explorer with panels using Prometheus/Tempo/Loki datasources and templating variables.
Prometheus alerting rules
ops/prometheus/alerts.yml
Adds alerting rules for backend HighErrorRate and SlowP95API, and frontend FrontendLCPDegraded, with severity labels, durations, annotations and runbook URLs.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Grafana
    participant Filesystem as "File storage\nops/grafana/provisioning/dashboards"
    participant GrafanaProv as "Grafana\nProvisioning (dashboards.yaml)"
    participant AlertingProv as "Grafana Alerting\n(contact-points, policies)"
    participant Prometheus
    participant Alertmanager

    Note over Grafana,GrafanaProv: Grafana startup provisioning flow
    Grafana ->> GrafanaProv: read `dashboards/dashboards.yaml`
    GrafanaProv ->> Filesystem: load dashboards (foldersFromFilesStructure)
    Filesystem -->> Grafana: JSON dashboard files
    Grafana ->> Grafana: provision dashboards

    Note over Grafana,AlertingProv: Alerting provisioning
    Grafana ->> AlertingProv: load `contact-points.yaml` & `notification-policies.yaml`
    AlertingProv -->> Grafana: receivers configured (Slack, Email)

    Note over Prometheus,Alertmanager: Prometheus alert flow (added rules)
    Prometheus ->> Prometheus: evaluate `ops/prometheus/alerts.yml`
    Prometheus ->> Alertmanager: send alerts
    Alertmanager ->> Grafana: (optional) route alerts or Grafana uses its contact points
    Alertmanager ->> Slack: deliver critical (via configured webhook)
    Alertmanager ->> Email: deliver warning (via configured SMTP/email)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through JSON, YAML, and light,

Dashboards arranged in tidy sight,
Slack bells and emails ready to chime,
Metrics and traces dancing in time,
Observability nibbles carrots of rhyme.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not follow the repository’s required template because it lacks the “PR Type,” “Short Description,” and “Tests Added” sections as specified, and instead uses “Summary” and “Testing” headings that do not match the template. Please update the description to include the required “PR Type,” “Short Description,” and “Tests Added” sections using the repository’s template headings and format, ensuring all template fields are filled out.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately describes the primary change of adding observability dashboards and alerting configuration, which aligns with the pull request’s main updates.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/create-grafana-dashboards-and-alerts

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81d4ed0 and b602d20.

📒 Files selected for processing (2)
  • ops/grafana/provisioning/alerting/notification-policies.yaml (1 hunks)
  • ops/grafana/provisioning/dashboards/tracing-explorer.json (1 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 8 to 9
matching_labels:
severity: critical

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use valid Grafana matcher keys in notification policies

The alert routing file uses a matching_labels map, but Grafana’s provisioning schema only recognizes matchers/object_matchers inside notification policy routes. Because the key is ignored, the first policy becomes the default route and both warning and critical alerts will be sent to Slack instead of splitting Slack vs email as intended. Update the configuration to use Grafana’s supported matcher fields (and, if necessary, nest the routes) so label-based routing actually takes effect across both entries.

Useful? React with 👍 / 👎.

Comment on lines 36 to 39
"targets": [
{
"query": "{ service.name = \"$service\" } | trace_id != ''",
"refId": "A"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make tracing queries work when the Service variable is “All”

Every Tempo/Loki query compares service with = (e.g. { service.name = "$service" }), yet the variable is configured with includeAll: true and defaults to an empty/regex value. When the dashboard first loads the expression becomes service.name="", which matches nothing, so all panels render empty until a specific service is manually chosen. Switch the filters to regex matching (=~"$service") or remove the All option so the dashboard shows data by default and the multi-select works.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2a35c4 and 81d4ed0.

📒 Files selected for processing (9)
  • dashboards/dashboards.yaml (1 hunks)
  • ops/grafana/README.md (1 hunks)
  • ops/grafana/provisioning/alerting/contact-points.yaml (1 hunks)
  • ops/grafana/provisioning/alerting/notification-policies.yaml (1 hunks)
  • ops/grafana/provisioning/dashboards/backend-golden-signals.json (1 hunks)
  • ops/grafana/provisioning/dashboards/frontend-ux-metrics.json (1 hunks)
  • ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json (1 hunks)
  • ops/grafana/provisioning/dashboards/tracing-explorer.json (1 hunks)
  • ops/prometheus/alerts.yml (1 hunks)
🧰 Additional context used
🪛 LanguageTool
ops/grafana/README.md

[grammar] ~1-~1: Use correct spacing
Context: # Grafana Provisioning This directory contains provisioning fil...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~3-~3: Use correct spacing
Context: ...d alerting configuration used by Paform. ## Dashboards * provisioning/dashboards/...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~5-~5: There might be a problem here.
Context: ...iguration used by Paform. ## Dashboards * provisioning/dashboards/ includes JSON definitions for backend, s...

(QB_NEW_EN_MERGED_MATCH)


[grammar] ~7-~7: There might be a mistake here.
Context: ...end UX, and tracing explorer dashboards. * The root dashboards/dashboards.yaml fi...

(QB_NEW_EN)


[grammar] ~8-~8: Use correct spacing
Context: ...rds are loaded automatically on startup. ## Alerting Contact Points * `provisioning...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~10-~10: Use correct spacing
Context: ... on startup. ## Alerting Contact Points * provisioning/alerting/contact-points.yaml defines Slack and email receivers. Set ...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~13-~13: Use correct spacing
Context: ...ing/alerting/notification-policies.yamlroutescriticalalerts to Slack andw...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~13-~13: Use correct spacing
Context: ...label matching if your topology changes. ## Prometheus Alerts * `../prometheus/aler...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~15-~15: Use correct spacing
Context: ... topology changes. ## Prometheus Alerts * ../prometheus/alerts.yml contains alerting rules that surface in...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

🔇 Additional comments (1)
ops/prometheus/alerts.yml (1)

15-23: Fix the SlowP95API histogram aggregation

histogram_quantile needs per-series bucket counts. You’re summing http_request_duration_seconds_bucket{job="backend"} by (le, route) but the denominator omits route, so the buckets no longer represent a single route and p95 becomes garbage. Either add route to the group-by (and match it in the legend) or drop the label entirely.
Example fix:

-        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8
+        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8

…but make sure you actually keep route in both numerator and denominator. Right now it’s missing, so the alert will evaluate incorrectly. Please correct the aggregation.

Likely an incorrect or invalid review comment.

Comment on lines +99 to +107
"targets": [
{
"expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",
"legendFormat": "Failure rate",
"refId": "A"
}
],
"title": "Failure Rate",
"type": "timeseries"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix failure rate formula to report a true ratio.

The panel labeled “Failure Rate” plots sum(rate(sync_job_failures_total{…}[5m])), which is an absolute throughput of failing jobs (failures/sec). That value grows with traffic volume and is not a failure rate or percentage, so the panel misleads operators about pipeline health. Divide the failure counter by the total job counter over the same window (and scale/format as needed) to present an actual failure rate.

-          "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",
-          "legendFormat": "Failure rate",
+          "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))",
+          "legendFormat": "Failure rate",
+          "unit": "percent",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"targets": [
{
"expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",
"legendFormat": "Failure rate",
"refId": "A"
}
],
"title": "Failure Rate",
"type": "timeseries"
"targets": [
{
"expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))",
"legendFormat": "Failure rate",
"unit": "percent",
"refId": "A"
}
],
"title": "Failure Rate",
"type": "timeseries"
🤖 Prompt for AI Agents
In ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json around lines 99
to 107, the "Failure Rate" panel currently plots the absolute failure throughput
(sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m]))) instead of a
true failure ratio; replace the expression with the ratio of failing jobs to
total jobs over the same window (e.g.
sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m])) /
sum(rate(sync_jobs_total{pipeline=~"$pipeline"}[5m]))) and adjust the panel’s
unit/scale (multiply by 100 and set percent formatting if you want a percentage)
so the chart shows a true failure rate rather than raw failures/sec.

@shayancoin shayancoin merged commit ce6a12d into main Oct 17, 2025
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant