Add observability dashboards and alerting configuration#98
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds Grafana provisioning and dashboard files, alerting contact points and notification policies, and Prometheus alerting rules to enable file-backed dashboards, Grafana alert routing (Slack/email), and Prometheus rule evaluation for backend/frontend alerts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Grafana
participant Filesystem as "File storage\nops/grafana/provisioning/dashboards"
participant GrafanaProv as "Grafana\nProvisioning (dashboards.yaml)"
participant AlertingProv as "Grafana Alerting\n(contact-points, policies)"
participant Prometheus
participant Alertmanager
Note over Grafana,GrafanaProv: Grafana startup provisioning flow
Grafana ->> GrafanaProv: read `dashboards/dashboards.yaml`
GrafanaProv ->> Filesystem: load dashboards (foldersFromFilesStructure)
Filesystem -->> Grafana: JSON dashboard files
Grafana ->> Grafana: provision dashboards
Note over Grafana,AlertingProv: Alerting provisioning
Grafana ->> AlertingProv: load `contact-points.yaml` & `notification-policies.yaml`
AlertingProv -->> Grafana: receivers configured (Slack, Email)
Note over Prometheus,Alertmanager: Prometheus alert flow (added rules)
Prometheus ->> Prometheus: evaluate `ops/prometheus/alerts.yml`
Prometheus ->> Alertmanager: send alerts
Alertmanager ->> Grafana: (optional) route alerts or Grafana uses its contact points
Alertmanager ->> Slack: deliver critical (via configured webhook)
Alertmanager ->> Email: deliver warning (via configured SMTP/email)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| matching_labels: | ||
| severity: critical |
There was a problem hiding this comment.
Use valid Grafana matcher keys in notification policies
The alert routing file uses a matching_labels map, but Grafana’s provisioning schema only recognizes matchers/object_matchers inside notification policy routes. Because the key is ignored, the first policy becomes the default route and both warning and critical alerts will be sent to Slack instead of splitting Slack vs email as intended. Update the configuration to use Grafana’s supported matcher fields (and, if necessary, nest the routes) so label-based routing actually takes effect across both entries.
Useful? React with 👍 / 👎.
| "targets": [ | ||
| { | ||
| "query": "{ service.name = \"$service\" } | trace_id != ''", | ||
| "refId": "A" |
There was a problem hiding this comment.
Make tracing queries work when the Service variable is “All”
Every Tempo/Loki query compares service with = (e.g. { service.name = "$service" }), yet the variable is configured with includeAll: true and defaults to an empty/regex value. When the dashboard first loads the expression becomes service.name="", which matches nothing, so all panels render empty until a specific service is manually chosen. Switch the filters to regex matching (=~"$service") or remove the All option so the dashboard shows data by default and the multi-select works.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (9)
dashboards/dashboards.yaml(1 hunks)ops/grafana/README.md(1 hunks)ops/grafana/provisioning/alerting/contact-points.yaml(1 hunks)ops/grafana/provisioning/alerting/notification-policies.yaml(1 hunks)ops/grafana/provisioning/dashboards/backend-golden-signals.json(1 hunks)ops/grafana/provisioning/dashboards/frontend-ux-metrics.json(1 hunks)ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json(1 hunks)ops/grafana/provisioning/dashboards/tracing-explorer.json(1 hunks)ops/prometheus/alerts.yml(1 hunks)
🧰 Additional context used
🪛 LanguageTool
ops/grafana/README.md
[grammar] ~1-~1: Use correct spacing
Context: # Grafana Provisioning This directory contains provisioning fil...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~3-~3: Use correct spacing
Context: ...d alerting configuration used by Paform. ## Dashboards * provisioning/dashboards/...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~5-~5: There might be a problem here.
Context: ...iguration used by Paform. ## Dashboards * provisioning/dashboards/ includes JSON definitions for backend, s...
(QB_NEW_EN_MERGED_MATCH)
[grammar] ~7-~7: There might be a mistake here.
Context: ...end UX, and tracing explorer dashboards. * The root dashboards/dashboards.yaml fi...
(QB_NEW_EN)
[grammar] ~8-~8: Use correct spacing
Context: ...rds are loaded automatically on startup. ## Alerting Contact Points * `provisioning...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~10-~10: Use correct spacing
Context: ... on startup. ## Alerting Contact Points * provisioning/alerting/contact-points.yaml defines Slack and email receivers. Set ...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~13-~13: Use correct spacing
Context: ...ing/alerting/notification-policies.yamlroutescriticalalerts to Slack andw...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~13-~13: Use correct spacing
Context: ...label matching if your topology changes. ## Prometheus Alerts * `../prometheus/aler...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
[grammar] ~15-~15: Use correct spacing
Context: ... topology changes. ## Prometheus Alerts * ../prometheus/alerts.yml contains alerting rules that surface in...
(QB_NEW_EN_OTHER_ERROR_IDS_5)
🔇 Additional comments (1)
ops/prometheus/alerts.yml (1)
15-23: Fix the SlowP95API histogram aggregation
histogram_quantileneeds per-series bucket counts. You’re summinghttp_request_duration_seconds_bucket{job="backend"}by(le, route)but the denominator omitsroute, so the buckets no longer represent a single route and p95 becomes garbage. Either addrouteto the group-by (and match it in the legend) or drop the label entirely.
Example fix:- expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8 + expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="backend"}[5m])) by (le, route)) > 0.8…but make sure you actually keep
routein both numerator and denominator. Right now it’s missing, so the alert will evaluate incorrectly. Please correct the aggregation.Likely an incorrect or invalid review comment.
| "targets": [ | ||
| { | ||
| "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))", | ||
| "legendFormat": "Failure rate", | ||
| "refId": "A" | ||
| } | ||
| ], | ||
| "title": "Failure Rate", | ||
| "type": "timeseries" |
There was a problem hiding this comment.
Fix failure rate formula to report a true ratio.
The panel labeled “Failure Rate” plots sum(rate(sync_job_failures_total{…}[5m])), which is an absolute throughput of failing jobs (failures/sec). That value grows with traffic volume and is not a failure rate or percentage, so the panel misleads operators about pipeline health. Divide the failure counter by the total job counter over the same window (and scale/format as needed) to present an actual failure rate.
- "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))",
- "legendFormat": "Failure rate",
+ "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))",
+ "legendFormat": "Failure rate",
+ "unit": "percent",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "targets": [ | |
| { | |
| "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m]))", | |
| "legendFormat": "Failure rate", | |
| "refId": "A" | |
| } | |
| ], | |
| "title": "Failure Rate", | |
| "type": "timeseries" | |
| "targets": [ | |
| { | |
| "expr": "sum(rate(sync_job_failures_total{pipeline=~\"$pipeline\"}[5m])) / sum(rate(sync_jobs_total{pipeline=~\"$pipeline\"}[5m]))", | |
| "legendFormat": "Failure rate", | |
| "unit": "percent", | |
| "refId": "A" | |
| } | |
| ], | |
| "title": "Failure Rate", | |
| "type": "timeseries" |
🤖 Prompt for AI Agents
In ops/grafana/provisioning/dashboards/sync-pipeline-kpis.json around lines 99
to 107, the "Failure Rate" panel currently plots the absolute failure throughput
(sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m]))) instead of a
true failure ratio; replace the expression with the ratio of failing jobs to
total jobs over the same window (e.g.
sum(rate(sync_job_failures_total{pipeline=~"$pipeline"}[5m])) /
sum(rate(sync_jobs_total{pipeline=~"$pipeline"}[5m]))) and adjust the panel’s
unit/scale (multiply by 100 and set percent formatting if you want a percentage)
so the chart shows a true failure rate rather than raw failures/sec.
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68f12ebc874c8330aaac4ab465b80e6b
Summary by CodeRabbit
New Features
New Features / Alerting
Documentation