Skip to content

Alerting rules#19

Merged
lgahdl merged 2 commits intojefferson/cow-616-cow-593-task-2-remaining-dashboards-resources-comparisonfrom
jefferson/cow-598-13-alerting-rules
Feb 25, 2026
Merged

Alerting rules#19
lgahdl merged 2 commits intojefferson/cow-616-cow-593-task-2-remaining-dashboards-resources-comparisonfrom
jefferson/cow-598-13-alerting-rules

Conversation

@jeffersonBastos
Copy link
Collaborator

@jeffersonBastos jeffersonBastos commented Feb 13, 2026

Summary

Implements Prometheus alerting rules for the CoW Performance Testing Suite (COW-598). Adds 7 core alerts that notify developers when performance degrades, error rates spike, or resource utilization exceeds thresholds during performance testing.

Approach: Option A (Prometheus alerting rules + Grafana visualization) - no Alertmanager required.

Alerts Implemented

Alert Severity Trigger
HighSubmissionLatency Warning P95 > 5s for 2m
CriticalSubmissionLatency Critical P95 > 10s for 1m
HighErrorRate Critical Error rate > 5% for 1m
LowThroughput Warning Actual < 80% target for 2m
TestStalled Critical No orders for 1m during active test
HighCPUUsage Warning CPU > 80% for 5m
CriticalMemoryUsage Critical Memory > 95% for 2m

Changes

New Files

  • configs/prometheus/alerts/performance-testing.yml - Alert rules with documented parameters at top for easy customization (TODO(COW-617) for future configurability)

Modified Files

  • configs/prometheus.yml - Enable rule_files: to load alert rules
  • docker-compose.yml - Mount alerts directory in Prometheus container
  • configs/dashboards/performance.json - Add alert annotations to show firing alerts
  • src/cow_performance/prometheus/metrics.py - Add cow_perf_container_memory_percent gauge
  • src/cow_performance/prometheus/exporter.py - Export memory percentage metric

Documentation

  • thoughts/plans/2026-02-13-cow-598-alerting-rules.md - Implementation plan
  • thoughts/tickets/COW-598-alerting-rules.md - Updated with implementation notes
  • thoughts/INDEX.md - Updated with plan reference

How to Test

  1. Start the monitoring stack:

    docker compose --profile monitoring up -d
  2. Verify alert rules loaded in Prometheus:

    curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
    # Expected: "cow_performance_testing"
  3. Check the Prometheus alerts page:

    open http://localhost:9090/alerts
  4. Run a test to generate metrics and verify Grafana annotations:

    cow-perf run --prometheus-port 9091 --duration 120
    open http://localhost:3000  # Navigate to Performance Overview dashboard

Checklist

  • Tests pass (poetry run pytest tests/unit/)
  • Linting passes (poetry run ruff check src/cow_performance/prometheus/)
  • Type checking passes (poetry run mypy src/cow_performance/prometheus/)
  • YAML config files valid
  • JSON dashboard files valid
  • Docker Compose config valid
  • Manual: Prometheus loads alert rules (requires Docker services)
  • Manual: Grafana shows alert annotations (requires Docker services)

Scope Decisions

Reduced scope (7 alerts instead of 15+) based on:

  • Grant proposal only specifies "alerting rules" without detail
  • Core alerts cover the most critical conditions
  • Alertmanager not required (no Slack/email notifications)
  • Thresholds hardcoded but organized for easy modification

Deferred to COW-617: Configurable thresholds via TOML/env variables

Breaking Changes

None

Related Issues

  • Implements: COW-598
  • Related: COW-617 (threshold configuration - future work)
  • Depends on: COW-591 (Prometheus Exporters), COW-593 (Grafana Dashboards)

🤖 Generated with Claude Code

jeffersonBastos and others added 2 commits February 13, 2026 17:35
- Create 7 core alerting rules (latency, error rate, throughput, resources, test execution)
- Enable rule_files in Prometheus configuration
- Add alerts volume mount in Docker Compose
- Add Grafana annotations to show firing alerts on dashboard
- Add container_memory_percent metric for CriticalMemoryUsage alert
- Add implementation plan: thoughts/plans/2026-02-13-cow-598-alerting-rules.md
- Add implementation notes to ticket file documenting scope decisions
- Update INDEX.md with plan entry and document cluster reference

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@linear
Copy link

linear bot commented Feb 13, 2026

@jeffersonBastos jeffersonBastos force-pushed the jefferson/cow-598-13-alerting-rules branch from 58b6816 to 721d397 Compare February 13, 2026 21:09
@jeffersonBastos jeffersonBastos marked this pull request as ready for review February 13, 2026 21:09
@lgahdl lgahdl force-pushed the jefferson/cow-598-13-alerting-rules branch from d1e8e58 to 721d397 Compare February 16, 2026 23:06
@lgahdl lgahdl merged commit 5184210 into jefferson/cow-616-cow-593-task-2-remaining-dashboards-resources-comparison Feb 25, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants