Skip to content

Commit 4f169ab

Browse files
mardallaMarco DallaPGijsbers
authored
Add /stats and /metrics endpoints and monitoring services (#581)
* docs: add metrics spec for Prometheus phase‑1 * chore: add baseline Prometheus scrape config * Add Prometheus metrics, per-asset access logging, and Grafana panels * Review fixes: remove user_id, monitoring profile, env ports, doc, session helper * Expand resource type detection to all router groups * metrics: versioned /stats/v1/top, robust access logging via parser, tests; revert config permissions * middleware:fix asset path, tests:addition of test access log * metrics: fix pre-commit issues, correct access_stats router typing, drop unused resource_types * main: use importlib.metadata instead of pkg_resources (fix mypy) * metrics: expand docs; add/default Grafana provisioning; stats router & parser tweaks * Pre-commit missed by GitHub merge * Only keep the identifier of the asset, without path info * Do not register the middleware with subapps since it leads to duplicates * type resolution from identifier prefix; grafana: default API metrics dashboard; tests: expand path parser * grafana and prometheus bind mounts * Update monitoring information * Update table definition * Allow arbitrary url_prefix based on the server configuration * Update path parsing to be more strict and use whitelists * Add constraint to length of identifier --------- Co-authored-by: Marco Dalla <mdalla@cs.ucc.ie> Co-authored-by: PGijsbers <p.gijsbers@tue.nl>
1 parent 2ca0ebb commit 4f169ab

File tree

20 files changed

+648
-15
lines changed

20 files changed

+648
-15
lines changed

.env

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,28 @@ AIOD_NGINX_PORT=80
4444
#DATA STORAGE
4545
DATA_PATH=./data
4646
BACKUP_PATH=./data/backups
47+
48+
#PROMETHEUS and GRAFANA
49+
AIOD_PROMETHEUS_PORT=9090
50+
AIOD_GRAFANA_PORT=3000
51+
52+
# MONITORING STORAGE
53+
PROMETHEUS_DATA_PATH=./data/prometheus
54+
GRAFANA_DATA_PATH=./data/grafana
55+
56+
# PROMETHEUS RETENTION
57+
PROMETHEUS_RETENTION=7d
58+
59+
# GRAFANA DEMOS AUTH
60+
GF_AUTH_ANONYMOUS_ENABLED=true
61+
GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
62+
GF_SECURITY_ADMIN_PASSWORD=admin
63+
64+
# GRAFANA
65+
GRAFANA_PROMETHEUS_URL=http://prometheus:9090
66+
67+
GRAFANA_MYSQL_HOST=sqlserver
68+
GRAFANA_MYSQL_PORT=3306
69+
GRAFANA_MYSQL_DB=aiod
70+
GRAFANA_MYSQL_USER=root
71+
GRAFANA_MYSQL_PASSWORD=${MYSQL_ROOT_PASSWORD}

docker-compose.dev.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,25 @@ services:
5858
stdin_open: true
5959
volumes:
6060
- ./src:/app:ro
61+
62+
prometheus:
63+
volumes:
64+
- ${PROMETHEUS_DATA_PATH:-./data/prometheus}:/prometheus
65+
command:
66+
- --config.file=/etc/prometheus/prometheus.yml
67+
- --storage.tsdb.path=/prometheus
68+
- --storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-7d}
69+
70+
grafana:
71+
environment:
72+
- GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED:-false}
73+
- GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE:-Viewer}
74+
- GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:-admin}
75+
- GRAFANA_PROMETHEUS_URL=${GRAFANA_PROMETHEUS_URL:-http://prometheus:9090}
76+
- GRAFANA_MYSQL_HOST=${GRAFANA_MYSQL_HOST:-sqlserver}
77+
- GRAFANA_MYSQL_PORT=${GRAFANA_MYSQL_PORT:-3306}
78+
- GRAFANA_MYSQL_DB=${GRAFANA_MYSQL_DB:-aiod}
79+
- GRAFANA_MYSQL_USER=${GRAFANA_MYSQL_USER:-root}
80+
- GRAFANA_MYSQL_PASSWORD=${MYSQL_ROOT_PASSWORD}
81+
volumes:
82+
- ${GRAFANA_DATA_PATH:-./data/grafana}:/var/lib/grafana

docker-compose.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,28 @@ services:
237237
es_logstash_setup:
238238
condition: service_completed_successfully
239239

240+
prometheus:
241+
profiles: ["monitoring"]
242+
image: prom/prometheus:latest
243+
volumes:
244+
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
245+
ports:
246+
- "${AIOD_PROMETHEUS_PORT:-9090}:9090"
247+
restart: unless-stopped
248+
249+
grafana:
250+
profiles: ["monitoring"]
251+
image: grafana/grafana:latest
252+
depends_on:
253+
- prometheus
254+
volumes:
255+
- ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
256+
- ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
257+
- ./grafana/dashboards:/etc/grafana/dashboards:ro
258+
ports:
259+
- "${AIOD_GRAFANA_PORT:-3000}:3000"
260+
restart: unless-stopped
261+
240262
taxonomy:
241263
profiles: ["taxonomy"]
242264
container_name: taxonomy

docs/hosting/metrics.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Metrics & Monitoring
2+
3+
## Overview
4+
5+
This adds two kinds of observability to the REST API:
6+
7+
* **Operational metrics (Prometheus):** requests/second, latencies, error rates, exposed at **`/metrics`** and scraped by Prometheus; visualized in Grafana.
8+
* **Product usage (MySQL):** the middleware writes one row per “asset-shaped” request to **`asset_access_log`** so we can query **top assets** (popularity) and build dashboards. Returned via **`/stats/top/{resource_type}`**.
9+
10+
Low-coupling design: a small middleware observes the path and logs access; routers are unchanged. Path parsing is centralized to handle version prefixes.
11+
12+
---
13+
14+
## Components
15+
16+
* **apiserver** — FastAPI app exposing:
17+
18+
* **`/metrics`** (Prometheus exposition via `prometheus_fastapi_instrumentator`)
19+
* **`/stats/top/{resource_type}`** (JSON; success hits only)
20+
* **MySQL** — table `asset_access_log` stores per-request asset hits
21+
* **Prometheus** — scrapes apiserver’s `/metrics`
22+
* **Grafana** — visualizes Prometheus (traffic) + MySQL (popularity)
23+
24+
---
25+
26+
## Endpoints (apiserver)
27+
28+
* **GET `/metrics`**
29+
Exposes Prometheus metrics. Example series: `http_requests_total`, `http_request_duration_seconds`, process/python metrics, etc.
30+
31+
* **GET `/stats/top/{resource_type}?limit=10`**
32+
Returns an array of objects:
33+
34+
```json
35+
[
36+
{ "asset_id": "data_p7v02a70CbBGKk29T8przBjf", "hits": 42 },
37+
{ "asset_id": "data_g8912mLHg8i2hsJblKu6G78i", "hits": 17 }
38+
]
39+
```
40+
41+
* Reports only successful requests (status code 200).
42+
* `resource_type` is something like `datasets`, `models`, etc.
43+
44+
---
45+
46+
## What gets logged (middleware)
47+
48+
“Asset-shaped” paths are logged after the response completes, i.e., any endpoint starting with e.g., `/datasets`, `/models`, including `/assets`. Access to other endpoints, such as `/metrics` or `/docs` do not get logged by the middleware. This also works if the API is deployed with a path prefix, and access is captured regardless of which version of the API is used (e.g., `/v2` or latest). The middleware does *not* log *who* accessed the log in any way (though the webserver itself does log incoming requests, these are not stored to the database).
49+
50+
---
51+
52+
## Table schema: `asset_access_log`
53+
54+
* `id` (PK)
55+
* `asset_id` (string) — the identifier of the asset, e.g., `data_f8aa9...`.
56+
* `resource_type` (string) — e.g. `datasets`, `models`, etc.
57+
* `status` (int) — HTTP status code from the response
58+
* `accessed_at` (UTC timestamp, indexed)
59+
60+
---
61+
62+
## Where the code lives
63+
64+
* Middleware: **`src/middleware/access_log.py`**
65+
* Path parsing (version/deployment prefixes): **`src/middleware/path_parse.py`**
66+
* Top-assets router: **`src/routers/access_stats_router.py`**
67+
* Wiring (include router, add middleware, expose /metrics): **`src/main.py`**
68+
69+
---
70+
71+
## Run it
72+
73+
Start the API + monitoring stack (Prometheus, Grafana):
74+
75+
```bash
76+
# helper
77+
scripts/up.sh monitoring
78+
79+
# or directly
80+
docker compose --env-file=.env --env-file=override.env \
81+
-f docker-compose.yaml -f docker-compose.dev.yaml \
82+
--profile monitoring up -d
83+
```
84+
85+
Open:
86+
87+
* API Docs: `http://localhost:8000/docs`
88+
* Metrics: `http://localhost:8000/metrics`
89+
* Prometheus: `http://localhost:${PROMETHEUS_HOST_PORT:-9090}`
90+
* Grafana: `http://localhost:${GRAFANA_HOST_PORT:-3000}`
91+
92+
Generate some traffic:
93+
94+
```bash
95+
curl -s http://localhost:8000/datasets/abc >/dev/null
96+
curl -s http://localhost:8000/datasets/v1/1 >/dev/null
97+
curl -s http://localhost:8000/v2/models/bert >/dev/null
98+
```
99+
100+
Check top assets (datasets):
101+
102+
```bash
103+
curl -s "http://localhost:8000/stats/top/datasets?limit=5" | jq .
104+
```
105+
106+
---
107+
108+
## Grafana: quick setup
109+
110+
Configure two data sources:
111+
112+
1. **Prometheus**
113+
114+
* URL: `http://prometheus:9090`
115+
116+
2. **MySQL** (popularity)
117+
118+
* Host: `sqlserver`
119+
* Port: `3306`
120+
* Database: `aiod`
121+
* User/password: from `.env`
122+
123+
**PromQL (traffic/latency examples):**
124+
125+
```promql
126+
# Requests per endpoint (1m rate)
127+
sum by (handler) (rate(http_requests_total[1m]))
128+
129+
# P95 latency by handler (5m window)
130+
histogram_quantile(
131+
0.95,
132+
sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m]))
133+
)
134+
135+
# Error rate (4xx/5xx) per endpoint
136+
sum by (handler) (rate(http_requests_total{status=~"4..|5.."}[5m]))
137+
```
138+
139+
**MySQL (popularity examples):**
140+
141+
```sql
142+
-- Top datasets (all time)
143+
SELECT asset_id AS asset, COUNT(*) AS hits
144+
FROM asset_access_log
145+
WHERE resource_type='datasets' AND status=200
146+
GROUP BY asset
147+
ORDER BY hits DESC
148+
LIMIT 10;
149+
150+
-- All assets by type
151+
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
152+
FROM asset_access_log
153+
WHERE status=200
154+
GROUP BY type, asset
155+
ORDER BY hits DESC;
156+
157+
-- Top assets last 24h
158+
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
159+
FROM asset_access_log
160+
WHERE status=200 AND accessed_at >= NOW() - INTERVAL 1 DAY
161+
GROUP BY type, asset
162+
ORDER BY hits DESC
163+
LIMIT 20;
164+
```
165+
166+
(Optional) Provision defaults in repo:
167+
168+
```
169+
grafana/provisioning/datasources/datasources.yml
170+
grafana/provisioning/dashboards/dashboards.yml
171+
grafana/provisioning/dashboards/aiod-metrics.json
172+
```
173+
174+
---
175+
176+
## Tests
177+
178+
Focused middleware tests live under `src/tests/middleware/`:
179+
180+
```bash
181+
PYTHONPATH=src pytest -q \
182+
src/tests/middleware/test_path_parse.py \
183+
src/tests/middleware/test_access_log_middleware.py
184+
```
185+
186+
They cover:
187+
188+
* Path parsing of `/datasets/abc`, `/datasets/v1/1`, `/v2/models/bert`, etc.
189+
* That asset hits are written for 200s and 404s.
190+
* That excluded paths (e.g., `/metrics`) are ignored.
191+
192+
---
193+
194+
## Which service exposes `/stats`?
195+
196+
The **apiserver** (REST API) exposes `/stats/top/{resource_type}`. It’s mounted with the other routers in `src/main.py`.
197+
198+
---
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
{
2+
"title": "AIoD API Metrics",
3+
"uid": "aiod-api-metrics",
4+
"timezone": "browser",
5+
"schemaVersion": 38,
6+
"version": 2,
7+
"refresh": "5s",
8+
"panels": [
9+
{
10+
"type": "timeseries",
11+
"title": "Requests per endpoint",
12+
"datasource": { "type": "prometheus", "uid": "prometheus" },
13+
"targets": [
14+
{
15+
"expr": "sum by (handler) (rate(http_requests_total[1m]))",
16+
"legendFormat": "{{handler}}"
17+
}
18+
],
19+
"gridPos": { "x": 0, "y": 0, "w": 24, "h": 9 }
20+
},
21+
{
22+
"type": "table",
23+
"title": "Top assets per type (top 20 each)",
24+
"datasource": { "type": "mysql", "uid": "mysql" },
25+
"targets": [
26+
{
27+
"format": "table",
28+
"rawSql": "WITH ranked AS (\n SELECT\n resource_type,\n asset_id,\n COUNT(*) AS hits,\n ROW_NUMBER() OVER (\n PARTITION BY resource_type\n ORDER BY COUNT(*) DESC\n ) AS r\n FROM asset_access_log\n WHERE status = 200\n GROUP BY resource_type, asset_id\n)\nSELECT\n resource_type AS type,\n asset_id AS asset,\n hits\nFROM ranked\nWHERE r <= 20\nORDER BY type, hits DESC;"
29+
}
30+
],
31+
"gridPos": { "x": 0, "y": 9, "w": 24, "h": 9 },
32+
"options": { "showHeader": true }
33+
},
34+
{
35+
"type": "table",
36+
"title": "Top assets overall (top 20)",
37+
"datasource": { "type": "mysql", "uid": "mysql" },
38+
"targets": [
39+
{
40+
"format": "table",
41+
"rawSql": "SELECT\n CONCAT(resource_type, '/', asset_id) AS identifier,\n COUNT(*) AS hits\nFROM asset_access_log\nWHERE status = 200\nGROUP BY resource_type, asset_id\nORDER BY hits DESC\nLIMIT 20;"
42+
}
43+
],
44+
"gridPos": { "x": 0, "y": 18, "w": 24, "h": 8 },
45+
"options": { "showHeader": true }
46+
}
47+
]
48+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: 1
2+
providers:
3+
- name: default
4+
orgId: 1
5+
folder: ""
6+
type: file
7+
disableDeletion: false
8+
updateIntervalSeconds: 10
9+
options:
10+
path: /etc/grafana/dashboards
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: 1
2+
datasources:
3+
- uid: mysql
4+
name: API MySQL
5+
type: mysql
6+
access: proxy
7+
url: ${GRAFANA_MYSQL_HOST}:${GRAFANA_MYSQL_PORT}
8+
database: ${GRAFANA_MYSQL_DB}
9+
user: ${GRAFANA_MYSQL_USER}
10+
secureJsonData:
11+
password: ${GRAFANA_MYSQL_PASSWORD}
12+
isDefault: false
13+
editable: false
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: 1
2+
datasources:
3+
- uid: prometheus
4+
name: Prometheus
5+
type: prometheus
6+
access: proxy
7+
url: ${GRAFANA_PROMETHEUS_URL}
8+
isDefault: true
9+
editable: false

mkdocs.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ nav:
4646
- 'Authentication': hosting/authentication.md
4747
- 'Connectors': hosting/connectors.md
4848
- 'Synchronization': hosting/synchronization.md
49+
- 'Monitoring': hosting/metrics.md
4950
- 'Developer Resources':
5051
- developer/index.md
5152
- 'Authentication': developer/authentication.md

prometheus.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
global:
2+
scrape_interval: 1s
3+
scrape_configs:
4+
- job_name: 'aiod_rest_api'
5+
metrics_path: /metrics
6+
static_configs:
7+
- targets: ['app:8000']

0 commit comments

Comments
 (0)