|
| 1 | +# Metrics & Monitoring |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This adds two kinds of observability to the REST API: |
| 6 | + |
| 7 | +* **Operational metrics (Prometheus):** requests/second, latencies, error rates, exposed at **`/metrics`** and scraped by Prometheus; visualized in Grafana. |
| 8 | +* **Product usage (MySQL):** the middleware writes one row per “asset-shaped” request to **`asset_access_log`** so we can query **top assets** (popularity) and build dashboards. Returned via **`/stats/top/{resource_type}`**. |
| 9 | + |
| 10 | +Low-coupling design: a small middleware observes the path and logs access; routers are unchanged. Path parsing is centralized to handle version prefixes. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Components |
| 15 | + |
| 16 | +* **apiserver** — FastAPI app exposing: |
| 17 | + |
| 18 | + * **`/metrics`** (Prometheus exposition via `prometheus_fastapi_instrumentator`) |
| 19 | + * **`/stats/top/{resource_type}`** (JSON; success hits only) |
| 20 | +* **MySQL** — table `asset_access_log` stores per-request asset hits |
| 21 | +* **Prometheus** — scrapes apiserver’s `/metrics` |
| 22 | +* **Grafana** — visualizes Prometheus (traffic) + MySQL (popularity) |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Endpoints (apiserver) |
| 27 | + |
| 28 | +* **GET `/metrics`** |
| 29 | + Exposes Prometheus metrics. Example series: `http_requests_total`, `http_request_duration_seconds`, process/python metrics, etc. |
| 30 | + |
| 31 | +* **GET `/stats/top/{resource_type}?limit=10`** |
| 32 | + Returns an array of objects: |
| 33 | + |
| 34 | + ```json |
| 35 | + [ |
| 36 | + { "asset_id": "data_p7v02a70CbBGKk29T8przBjf", "hits": 42 }, |
| 37 | + { "asset_id": "data_g8912mLHg8i2hsJblKu6G78i", "hits": 17 } |
| 38 | + ] |
| 39 | + ``` |
| 40 | + |
| 41 | + * Reports only successful requests (status code 200). |
| 42 | + * `resource_type` is something like `datasets`, `models`, etc. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## What gets logged (middleware) |
| 47 | + |
| 48 | +“Asset-shaped” paths are logged after the response completes, i.e., any endpoint starting with e.g., `/datasets`, `/models`, including `/assets`. Access to other endpoints, such as `/metrics` or `/docs` do not get logged by the middleware. This also works if the API is deployed with a path prefix, and access is captured regardless of which version of the API is used (e.g., `/v2` or latest). The middleware does *not* log *who* accessed the log in any way (though the webserver itself does log incoming requests, these are not stored to the database). |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Table schema: `asset_access_log` |
| 53 | + |
| 54 | +* `id` (PK) |
| 55 | +* `asset_id` (string) — the identifier of the asset, e.g., `data_f8aa9...`. |
| 56 | +* `resource_type` (string) — e.g. `datasets`, `models`, etc. |
| 57 | +* `status` (int) — HTTP status code from the response |
| 58 | +* `accessed_at` (UTC timestamp, indexed) |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Where the code lives |
| 63 | + |
| 64 | +* Middleware: **`src/middleware/access_log.py`** |
| 65 | +* Path parsing (version/deployment prefixes): **`src/middleware/path_parse.py`** |
| 66 | +* Top-assets router: **`src/routers/access_stats_router.py`** |
| 67 | +* Wiring (include router, add middleware, expose /metrics): **`src/main.py`** |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## Run it |
| 72 | + |
| 73 | +Start the API + monitoring stack (Prometheus, Grafana): |
| 74 | + |
| 75 | +```bash |
| 76 | +# helper |
| 77 | +scripts/up.sh monitoring |
| 78 | + |
| 79 | +# or directly |
| 80 | +docker compose --env-file=.env --env-file=override.env \ |
| 81 | + -f docker-compose.yaml -f docker-compose.dev.yaml \ |
| 82 | + --profile monitoring up -d |
| 83 | +``` |
| 84 | + |
| 85 | +Open: |
| 86 | + |
| 87 | +* API Docs: `http://localhost:8000/docs` |
| 88 | +* Metrics: `http://localhost:8000/metrics` |
| 89 | +* Prometheus: `http://localhost:${PROMETHEUS_HOST_PORT:-9090}` |
| 90 | +* Grafana: `http://localhost:${GRAFANA_HOST_PORT:-3000}` |
| 91 | + |
| 92 | +Generate some traffic: |
| 93 | + |
| 94 | +```bash |
| 95 | +curl -s http://localhost:8000/datasets/abc >/dev/null |
| 96 | +curl -s http://localhost:8000/datasets/v1/1 >/dev/null |
| 97 | +curl -s http://localhost:8000/v2/models/bert >/dev/null |
| 98 | +``` |
| 99 | + |
| 100 | +Check top assets (datasets): |
| 101 | + |
| 102 | +```bash |
| 103 | +curl -s "http://localhost:8000/stats/top/datasets?limit=5" | jq . |
| 104 | +``` |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Grafana: quick setup |
| 109 | + |
| 110 | +Configure two data sources: |
| 111 | + |
| 112 | +1. **Prometheus** |
| 113 | + |
| 114 | + * URL: `http://prometheus:9090` |
| 115 | + |
| 116 | +2. **MySQL** (popularity) |
| 117 | + |
| 118 | + * Host: `sqlserver` |
| 119 | + * Port: `3306` |
| 120 | + * Database: `aiod` |
| 121 | + * User/password: from `.env` |
| 122 | + |
| 123 | +**PromQL (traffic/latency examples):** |
| 124 | + |
| 125 | +```promql |
| 126 | +# Requests per endpoint (1m rate) |
| 127 | +sum by (handler) (rate(http_requests_total[1m])) |
| 128 | +
|
| 129 | +# P95 latency by handler (5m window) |
| 130 | +histogram_quantile( |
| 131 | + 0.95, |
| 132 | + sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])) |
| 133 | +) |
| 134 | +
|
| 135 | +# Error rate (4xx/5xx) per endpoint |
| 136 | +sum by (handler) (rate(http_requests_total{status=~"4..|5.."}[5m])) |
| 137 | +``` |
| 138 | + |
| 139 | +**MySQL (popularity examples):** |
| 140 | + |
| 141 | +```sql |
| 142 | +-- Top datasets (all time) |
| 143 | +SELECT asset_id AS asset, COUNT(*) AS hits |
| 144 | +FROM asset_access_log |
| 145 | +WHERE resource_type='datasets' AND status=200 |
| 146 | +GROUP BY asset |
| 147 | +ORDER BY hits DESC |
| 148 | +LIMIT 10; |
| 149 | + |
| 150 | +-- All assets by type |
| 151 | +SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits |
| 152 | +FROM asset_access_log |
| 153 | +WHERE status=200 |
| 154 | +GROUP BY type, asset |
| 155 | +ORDER BY hits DESC; |
| 156 | + |
| 157 | +-- Top assets last 24h |
| 158 | +SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits |
| 159 | +FROM asset_access_log |
| 160 | +WHERE status=200 AND accessed_at >= NOW() - INTERVAL 1 DAY |
| 161 | +GROUP BY type, asset |
| 162 | +ORDER BY hits DESC |
| 163 | +LIMIT 20; |
| 164 | +``` |
| 165 | + |
| 166 | +(Optional) Provision defaults in repo: |
| 167 | + |
| 168 | +``` |
| 169 | +grafana/provisioning/datasources/datasources.yml |
| 170 | +grafana/provisioning/dashboards/dashboards.yml |
| 171 | +grafana/provisioning/dashboards/aiod-metrics.json |
| 172 | +``` |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## Tests |
| 177 | + |
| 178 | +Focused middleware tests live under `src/tests/middleware/`: |
| 179 | + |
| 180 | +```bash |
| 181 | +PYTHONPATH=src pytest -q \ |
| 182 | + src/tests/middleware/test_path_parse.py \ |
| 183 | + src/tests/middleware/test_access_log_middleware.py |
| 184 | +``` |
| 185 | + |
| 186 | +They cover: |
| 187 | + |
| 188 | +* Path parsing of `/datasets/abc`, `/datasets/v1/1`, `/v2/models/bert`, etc. |
| 189 | +* That asset hits are written for 200s and 404s. |
| 190 | +* That excluded paths (e.g., `/metrics`) are ignored. |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Which service exposes `/stats`? |
| 195 | + |
| 196 | +The **apiserver** (REST API) exposes `/stats/top/{resource_type}`. It’s mounted with the other routers in `src/main.py`. |
| 197 | + |
| 198 | +--- |
0 commit comments