COW-591 phase 2 extended prometheus metrics#16
Merged
lgahdl merged 13 commits intojefferson/cow-591-11-prometheus-exportersfrom Feb 26, 2026
Conversation
Add API, resource, per-trader, and baseline comparison metrics to complete COW-591 Prometheus exporter deliverable: - API metrics: requests counter, response time histogram, errors counter - Resource metrics: container CPU, memory, network gauges - Per-trader metrics: orders submitted/filled by trader index - Comparison metrics: baseline percent change, regression detection Update MetricsStore to pass container name with resource callbacks. Add 21 new unit tests covering all Phase 2 functionality.
Add two Grafana dashboards for monitoring performance tests: - Overview dashboard: test progress, order rates, latency distributions - API Performance dashboard: response times, throughput, error rates Configure dashboard provisioning via docker-compose volume mount and add explicit UID to Prometheus datasource for dashboard compatibility.
Add upload_app_data_with_retry() and get_open_order_count() methods that were missing from the instrumented wrapper, causing AttributeError when used in place of the underlying OrderbookClient.
Add three new dashboards completing the Grafana visualization suite: - Resources dashboard: CPU, memory, network monitoring per container - Comparison dashboard: baseline vs current with regression indicators - Trader Activity dashboard: per-trader statistics and activity patterns Update existing dashboards with cross-navigation links to all 5 dashboards.
… COW-593 Document Prometheus exporter phases and Grafana dashboard implementation plans to track progress on metrics infrastructure work.
- Add prometheus_port config field with default 9091 - CLI uses config default, --prometheus-port 0 to disable - Enhance order timeout logging with status, age, token pair, lifecycle - Improve monitoring output with status breakdown counts - Show all terminal states in final summary (filled/expired/failed/cancelled) - Update README and CLI docs with monitoring instructions
Add concurrent Prometheus metrics update loop that exports test progress and throughput metrics every second during performance test runs. This fixes "No Data" panels in the Overview dashboard. Remove redundant P50 delta panels from the comparison dashboard and adjust grid positions for cleaner layout.
- Create 7 core alerting rules (latency, error rate, throughput, resources, test execution) - Enable rule_files in Prometheus configuration - Add alerts volume mount in Docker Compose - Add Grafana annotations to show firing alerts on dashboard - Add container_memory_percent metric for CriticalMemoryUsage alert
- Add implementation plan: thoughts/plans/2026-02-13-cow-598-alerting-rules.md - Add implementation notes to ticket file documenting scope decisions - Update INDEX.md with plan entry and document cluster reference Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…aining-dashboards-resources-comparison COW-593 task 2 remaining dashboards resources comparison
…ential-dashboards-overview-api feat(grafana): add performance and API monitoring dashboards
e4b069e
into
jefferson/cow-591-11-prometheus-exporters
10 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review Focus
Line count is high due to tests and metric declarations. Focus review on: (1) trader index-based cardinality management in
exporter.py:200-217to avoid high-cardinality address labels, (2) API error classification logic in_classify_api_error(), and (3) the(container_name, sample)tuple format change in MetricsStore callbacks.Summary
Implements Phase 2 of COW-591 Prometheus exporter, adding API performance, container resource, per-trader, and baseline comparison metrics. This completes the full Prometheus metrics deliverable for the grant.
Changes
New Metrics Added
API Metrics
cow_perf_api_requests_total- Counter for API requests by endpoint/method/statuscow_perf_api_response_time_seconds- Histogram of API response timescow_perf_api_errors_total- Counter for API errors by type (client_error, server_error, timeout, connection_error)Container Resource Metrics
cow_perf_container_cpu_percent- Gauge for container CPU usagecow_perf_container_memory_bytes- Gauge for container memory usagecow_perf_container_network_rx_bytes- Gauge for network bytes receivedcow_perf_container_network_tx_bytes- Gauge for network bytes transmittedPer-Trader Metrics
cow_perf_trader_orders_submitted_total- Counter for orders submitted by trader indexcow_perf_trader_orders_filled_total- Counter for orders filled by trader indexcow_perf_traders_active- Gauge for count of currently active tradersBaseline Comparison Metrics
cow_perf_baseline_comparison_percent- Gauge for percentage change from baselinecow_perf_regression_detected- Gauge for regression counts by severitycow_perf_regressions_total- Counter for total regressions detectedCode Changes
_update_api_metrics) and resource (_update_resource_metrics) types(container_name, sample)tupleHow to Test
Run the test suite:
Start exporter and verify metrics:
Run with CLI:
Checklist
poetry run pytest tests/unit/prometheus/- 56 passed)poetry run ruff check .)poetry run mypy src/cow_performance/prometheus/)thoughts/plans/2026-02-06-cow-591-phase-2-prometheus-exporter.md)Breaking Changes
Minor:
MetricsStore.add_resource_sample()now passes(container_name, sample)tuple to callbacks instead of justsample. This only affects code that registers callbacks for resource metrics.Related Issues
🤖 Generated with Claude Code