Skip to content

feat(grafana): add performance and API monitoring dashboards#17

Merged
lgahdl merged 10 commits intojefferson/cow-614-cow-591-phase-2-extended-prometheus-metricsfrom
jefferson/cow-615-cow-593-task-1-essential-dashboards-overview-api
Feb 25, 2026
Merged

feat(grafana): add performance and API monitoring dashboards#17
lgahdl merged 10 commits intojefferson/cow-614-cow-591-phase-2-extended-prometheus-metricsfrom
jefferson/cow-615-cow-593-task-1-essential-dashboards-overview-api

Conversation

@jeffersonBastos
Copy link
Collaborator

@jeffersonBastos jeffersonBastos commented Feb 10, 2026

Summary

Add Grafana dashboards for real-time monitoring of performance tests, completing Task 1 of COW-593. This includes an Overview dashboard for test progress and order metrics, plus an API Performance dashboard for detailed endpoint analysis.

Changes

Grafana Dashboards

  • Overview Dashboard (configs/dashboards/performance.json):

    • Test overview panel: scenario name, duration, trader count, progress gauge
    • Order submission rate: actual vs target rate comparison, cumulative orders
    • Latency distribution: submission/settlement heatmaps, percentile tracking (P50/P90/P95/P99)
    • Order status: pie chart distribution, success rate, counts by status
  • API Performance Dashboard (configs/dashboards/api-performance.json):

    • Response time panels: P95 by endpoint, heatmap distribution, P50/P99 stats
    • Throughput panels: requests/sec by endpoint, total requests, breakdown by HTTP method
    • Error tracking: error rate over time, pie chart by type, error breakdown table
    • Dashboard linking: navigation link to Overview dashboard

Infrastructure

  • Grafana datasource (configs/grafana-datasource.yml): Added explicit uid: prometheus for dashboard compatibility
  • Docker Compose (docker-compose.yml): Added volume mount for dashboard provisioning

Bug Fix

  • Instrumented client (src/cow_performance/api/instrumented_client.py):
    • Added upload_app_data_with_retry() method with exponential backoff
    • Added get_open_order_count() delegation method
    • Fixes AttributeError when using instrumented client in place of OrderbookClient

How to Test

  1. Start the Docker services:

    docker compose up -d prometheus grafana
  2. Access Grafana at http://localhost:3000 (admin/admin)

  3. Navigate to Dashboards → Browse and verify:

    • "CoW Performance Testing - Overview" dashboard loads
    • "CoW Performance Testing - API Performance" dashboard loads
    • Both dashboards show Prometheus as the data source
  4. Run a performance test to generate metrics and verify panels populate

Checklist

  • Tests pass (poetry run pytest)
  • Linting passes (poetry run ruff check .)
  • Type checking passes (poetry run mypy .)
  • Dashboards manually verified in Grafana
  • Breaking changes documented (if any)

Breaking Changes

None

Related Issues

  • COW-593: Grafana Dashboards (Task 1: Essential Dashboards)
  • COW-615: COW-593 Task 1 - Essential Dashboards (Overview + API)

Add two Grafana dashboards for monitoring performance tests:
- Overview dashboard: test progress, order rates, latency distributions
- API Performance dashboard: response times, throughput, error rates

Configure dashboard provisioning via docker-compose volume mount and
add explicit UID to Prometheus datasource for dashboard compatibility.
Add upload_app_data_with_retry() and get_open_order_count() methods
that were missing from the instrumented wrapper, causing AttributeError
when used in place of the underlying OrderbookClient.
@linear
Copy link

linear bot commented Feb 10, 2026

@jeffersonBastos jeffersonBastos changed the title Jefferson/cow 615 cow 593 task 1 essential dashboards overview api feat(grafana): add performance and API monitoring dashboards Feb 10, 2026
Add three new dashboards completing the Grafana visualization suite:

- Resources dashboard: CPU, memory, network monitoring per container
- Comparison dashboard: baseline vs current with regression indicators
- Trader Activity dashboard: per-trader statistics and activity patterns

Update existing dashboards with cross-navigation links to all 5 dashboards.
@jeffersonBastos jeffersonBastos marked this pull request as ready for review February 10, 2026 18:46
jeffersonBastos and others added 7 commits February 10, 2026 15:51
… COW-593

Document Prometheus exporter phases and Grafana dashboard implementation
plans to track progress on metrics infrastructure work.
- Add prometheus_port config field with default 9091
- CLI uses config default, --prometheus-port 0 to disable
- Enhance order timeout logging with status, age, token pair, lifecycle
- Improve monitoring output with status breakdown counts
- Show all terminal states in final summary (filled/expired/failed/cancelled)
- Update README and CLI docs with monitoring instructions
Add concurrent Prometheus metrics update loop that exports test progress
and throughput metrics every second during performance test runs. This
fixes "No Data" panels in the Overview dashboard.

Remove redundant P50 delta panels from the comparison dashboard and
adjust grid positions for cleaner layout.
- Create 7 core alerting rules (latency, error rate, throughput, resources, test execution)
- Enable rule_files in Prometheus configuration
- Add alerts volume mount in Docker Compose
- Add Grafana annotations to show firing alerts on dashboard
- Add container_memory_percent metric for CriticalMemoryUsage alert
- Add implementation plan: thoughts/plans/2026-02-13-cow-598-alerting-rules.md
- Add implementation notes to ticket file documenting scope decisions
- Update INDEX.md with plan entry and document cluster reference

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…aining-dashboards-resources-comparison

COW-593 task 2 remaining dashboards resources comparison
@lgahdl lgahdl merged commit 5b5298a into jefferson/cow-614-cow-591-phase-2-extended-prometheus-metrics Feb 25, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants