Skip to content

Commit 5184210

Browse files
authored
Merge pull request #19 from bleu/jefferson/cow-598-13-alerting-rules
Alerting rules
2 parents cc00167 + 721d397 commit 5184210

File tree

9 files changed

+997
-5
lines changed

9 files changed

+997
-5
lines changed

configs/dashboards/performance.json

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,31 @@
11
{
22
"annotations": {
3-
"list": []
3+
"list": [
4+
{
5+
"builtIn": 1,
6+
"datasource": {
7+
"type": "grafana",
8+
"uid": "-- Grafana --"
9+
},
10+
"enable": true,
11+
"hide": true,
12+
"iconColor": "rgba(0, 211, 255, 1)",
13+
"name": "Annotations & Alerts",
14+
"type": "dashboard"
15+
},
16+
{
17+
"datasource": {
18+
"type": "prometheus",
19+
"uid": "prometheus"
20+
},
21+
"enable": true,
22+
"expr": "ALERTS{alertstate=\"firing\", component=\"cow-performance-testing\"}",
23+
"iconColor": "red",
24+
"name": "Firing Alerts",
25+
"tagKeys": "alertname,severity",
26+
"titleFormat": "{{ alertname }}"
27+
}
28+
]
429
},
530
"editable": true,
631
"fiscalYearStartMonth": 0,

configs/prometheus.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,12 @@ scrape_configs:
7070
# Fail gracefully if exporter not running
7171
scrape_timeout: 5s
7272

73-
# Optional: Add alerting rules
74-
# rule_files:
75-
# - "/etc/prometheus/alerts/*.yml"
73+
# Alert rule files
74+
rule_files:
75+
- "/etc/prometheus/alerts/*.yml"
7676

77-
# Optional: Configure Alertmanager
77+
# Note: Alertmanager not configured - alerts visible in Prometheus UI and Grafana only
78+
# To enable Alertmanager notifications, uncomment below and add alertmanager service:
7879
# alerting:
7980
# alertmanagers:
8081
# - static_configs:
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# =============================================================================
2+
# CoW Performance Testing Suite - Prometheus Alert Rules
3+
# =============================================================================
4+
#
5+
# This file defines alerting rules for the CoW Performance Testing Suite.
6+
# Alerts are evaluated by Prometheus and can be viewed in the Prometheus UI
7+
# or visualized in Grafana dashboards.
8+
#
9+
# =============================================================================
10+
# ALERT PARAMETERS - Edit values here for easy customization
11+
# =============================================================================
12+
#
13+
# TODO(COW-617): Move these thresholds to configurable TOML/env variables
14+
#
15+
# LATENCY THRESHOLDS (seconds):
16+
# submission_latency_warning_threshold: 5 # P95 > 5s triggers warning
17+
# submission_latency_critical_threshold: 10 # P95 > 10s triggers critical
18+
#
19+
# ERROR RATE THRESHOLDS (decimal, where 0.05 = 5%):
20+
# error_rate_critical_threshold: 0.05 # > 5% error rate
21+
#
22+
# THROUGHPUT THRESHOLDS (ratio, where 0.8 = 80%):
23+
# throughput_low_threshold: 0.8 # < 80% of target rate
24+
#
25+
# RESOURCE THRESHOLDS (percentage):
26+
# cpu_warning_threshold: 80 # CPU > 80%
27+
# memory_critical_threshold: 95 # Memory > 95%
28+
#
29+
# ALERT DURATIONS (prevents flapping):
30+
# latency_warning_for: 2m
31+
# latency_critical_for: 1m
32+
# error_rate_for: 1m
33+
# throughput_for: 2m
34+
# cpu_for: 5m
35+
# memory_for: 2m
36+
# test_stalled_for: 1m
37+
#
38+
# =============================================================================
39+
40+
groups:
41+
- name: cow_performance_testing
42+
# Evaluation interval inherited from global config (5s)
43+
rules:
44+
# =========================================================================
45+
# LATENCY ALERTS
46+
# =========================================================================
47+
48+
# High Submission Latency (Warning)
49+
# Triggers when P95 submission latency exceeds warning threshold
50+
- alert: HighSubmissionLatency
51+
expr: |
52+
histogram_quantile(0.95,
53+
sum(rate(cow_perf_submission_latency_seconds_bucket[1m])) by (le, scenario)
54+
) > 5
55+
for: 2m
56+
labels:
57+
severity: warning
58+
component: cow-performance-testing
59+
category: latency
60+
annotations:
61+
summary: "High submission latency detected"
62+
description: "P95 submission latency is {{ $value | printf \"%.2f\" }}s (threshold: 5s) for scenario {{ $labels.scenario }}"
63+
runbook: "Check API logs, verify network connectivity, review recent code changes"
64+
65+
# Critical Submission Latency (Critical)
66+
# Triggers when P95 submission latency exceeds critical threshold
67+
- alert: CriticalSubmissionLatency
68+
expr: |
69+
histogram_quantile(0.95,
70+
sum(rate(cow_perf_submission_latency_seconds_bucket[1m])) by (le, scenario)
71+
) > 10
72+
for: 1m
73+
labels:
74+
severity: critical
75+
component: cow-performance-testing
76+
category: latency
77+
annotations:
78+
summary: "Critical submission latency - immediate attention required"
79+
description: "P95 submission latency is {{ $value | printf \"%.2f\" }}s (threshold: 10s) for scenario {{ $labels.scenario }}"
80+
runbook: "Immediate action: Check API health, container resources, database connections"
81+
82+
# =========================================================================
83+
# ERROR RATE ALERTS
84+
# =========================================================================
85+
86+
# High Error Rate (Critical)
87+
# Triggers when order failure rate exceeds threshold
88+
- alert: HighErrorRate
89+
expr: |
90+
(
91+
sum(rate(cow_perf_orders_failed_total[5m])) by (scenario)
92+
/
93+
sum(rate(cow_perf_orders_submitted_total[5m])) by (scenario)
94+
) > 0.05
95+
for: 1m
96+
labels:
97+
severity: critical
98+
component: cow-performance-testing
99+
category: errors
100+
annotations:
101+
summary: "High error rate detected"
102+
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%) for scenario {{ $labels.scenario }}"
103+
runbook: "Check order validation errors, API error responses, contract state"
104+
105+
# =========================================================================
106+
# THROUGHPUT ALERTS
107+
# =========================================================================
108+
109+
# Low Throughput (Warning)
110+
# Triggers when actual throughput falls below target
111+
- alert: LowThroughput
112+
expr: |
113+
(
114+
cow_perf_actual_rate
115+
/
116+
cow_perf_target_rate
117+
) < 0.8
118+
and cow_perf_target_rate > 0
119+
for: 2m
120+
labels:
121+
severity: warning
122+
component: cow-performance-testing
123+
category: throughput
124+
annotations:
125+
summary: "Low throughput - not meeting target rate"
126+
description: "Actual throughput is {{ $value | humanizePercentage }} of target for scenario {{ $labels.scenario }}"
127+
runbook: "Check for bottlenecks: API rate limits, network latency, resource constraints"
128+
129+
# =========================================================================
130+
# TEST EXECUTION ALERTS
131+
# =========================================================================
132+
133+
# Test Stalled (Critical)
134+
# Triggers when no orders are being submitted during an active test
135+
- alert: TestStalled
136+
expr: |
137+
rate(cow_perf_orders_submitted_total[1m]) == 0
138+
and
139+
cow_perf_test_progress_percent > 0
140+
and
141+
cow_perf_test_progress_percent < 100
142+
for: 1m
143+
labels:
144+
severity: critical
145+
component: cow-performance-testing
146+
category: test-execution
147+
annotations:
148+
summary: "Performance test appears to be stalled"
149+
description: "No orders submitted in the last minute for scenario {{ $labels.scenario }} (progress: {{ $value }}%)"
150+
runbook: "Check test process, verify API connectivity, review error logs"
151+
152+
# =========================================================================
153+
# RESOURCE ALERTS
154+
# =========================================================================
155+
156+
# High CPU Usage (Warning)
157+
# Triggers when container CPU usage is high
158+
- alert: HighCPUUsage
159+
expr: |
160+
cow_perf_container_cpu_percent > 80
161+
for: 5m
162+
labels:
163+
severity: warning
164+
component: cow-performance-testing
165+
category: resources
166+
annotations:
167+
summary: "High CPU usage on {{ $labels.container }}"
168+
description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold: 80%) on container {{ $labels.container }}"
169+
runbook: "Consider scaling resources, check for inefficient operations, review container limits"
170+
171+
# Critical Memory Usage (Critical)
172+
# Triggers when container memory usage approaches limit
173+
- alert: CriticalMemoryUsage
174+
expr: |
175+
cow_perf_container_memory_percent > 95
176+
for: 2m
177+
labels:
178+
severity: critical
179+
component: cow-performance-testing
180+
category: resources
181+
annotations:
182+
summary: "Critical memory usage on {{ $labels.container }}"
183+
description: "Memory usage is {{ $value | printf \"%.1f\" }}% (threshold: 95%) on container {{ $labels.container }}"
184+
runbook: "Immediate action: Check for memory leaks, increase container memory limit, restart if necessary"

docker-compose.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,7 @@ services:
245245
- "9090:9090"
246246
volumes:
247247
- ./configs/prometheus.yml:/etc/prometheus/prometheus.yml:ro
248+
- ./configs/prometheus/alerts:/etc/prometheus/alerts:ro
248249
- prometheus_data:/prometheus
249250
profiles:
250251
- monitoring

src/cow_performance/prometheus/exporter.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,9 @@ def _update_resource_metrics(self, metric: object) -> None:
272272
self._metrics.container_memory_bytes.labels(container=container_name).set(
273273
sample.memory_bytes
274274
)
275+
self._metrics.container_memory_percent.labels(container=container_name).set(
276+
sample.memory_percent
277+
)
275278
self._metrics.container_network_rx_bytes.labels(container=container_name).set(
276279
sample.network_rx_bytes
277280
)
@@ -406,10 +409,13 @@ def update_container_resources(
406409
memory_bytes: int,
407410
network_rx_bytes: int = 0,
408411
network_tx_bytes: int = 0,
412+
memory_percent: float | None = None,
409413
) -> None:
410414
"""Update resource metrics for a container."""
411415
self._metrics.container_cpu_percent.labels(container=container).set(cpu_percent)
412416
self._metrics.container_memory_bytes.labels(container=container).set(memory_bytes)
417+
if memory_percent is not None:
418+
self._metrics.container_memory_percent.labels(container=container).set(memory_percent)
413419
self._metrics.container_network_rx_bytes.labels(container=container).set(network_rx_bytes)
414420
self._metrics.container_network_tx_bytes.labels(container=container).set(network_tx_bytes)
415421

src/cow_performance/prometheus/metrics.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,12 @@ def _init_resource_metrics(self) -> None:
211211
["container"],
212212
registry=self.registry,
213213
)
214+
self.container_memory_percent = Gauge(
215+
"cow_perf_container_memory_percent",
216+
"Container memory usage as percentage of limit (0-100)",
217+
["container"],
218+
registry=self.registry,
219+
)
214220
self.container_network_rx_bytes = Gauge(
215221
"cow_perf_container_network_rx_bytes",
216222
"Container network bytes received",

thoughts/INDEX.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ Detailed implementation approaches for tickets. Read these before implementing t
122122
| [2026-02-02-cow-588-baseline-snapshot-system.md](plans/2026-02-02-cow-588-baseline-snapshot-system.md) | COW-588 | ✅ Complete | BaselineManager, git-info, UUID-index, serialization |
123123
| [2026-02-03-cow-589-comparison-engine.md](plans/2026-02-03-cow-589-comparison-engine.md) | COW-589 | ✅ Complete | ComparisonEngine, regression, statistics, p-value, Cohen's-d |
124124
| [2026-02-03-cow-590-automated-reporting.md](plans/2026-02-03-cow-590-automated-reporting.md) | COW-590 | ✅ Complete | ReportGenerator, formatters, CSV, recommendations, CLI |
125+
| [2026-02-13-cow-598-alerting-rules.md](plans/2026-02-13-cow-598-alerting-rules.md) | COW-598 | 🔲 Ready | Prometheus alerts, alerting rules, thresholds, Grafana annotations |
125126

126127
---
127128

@@ -227,6 +228,7 @@ tickets/COW-593-grafana-dashboards.md
227228
### Alerting Rules (COW-598) — M3
228229
```
229230
tickets/COW-598-alerting-rules.md
231+
└── plans/2026-02-13-cow-598-alerting-rules.md (execution plan)
230232
```
231233

232234
---

0 commit comments

Comments
 (0)