Skip to content

Commit c1a27a1

Browse files
committed
feat(sprint-10): implement Story 10.1 - Prometheus Metrics Integration (8 points)
SUMMARY: Implemented comprehensive Prometheus metrics integration for system monitoring, ML operations tracking, and business metrics collection. All metrics exposed via /metrics endpoint for Prometheus scraping. IMPLEMENTATION: 1. Request Metrics Middleware (app/middleware/metrics.py): - Request latency histogram with buckets [0.1, 0.5, 1, 2, 5, 10] seconds - Request counter by endpoint, method, and status code - Active requests gauge tracking concurrent requests - Automatic exclusion of /metrics endpoint from tracking 2. ML Metrics Collector (app/services/metrics_collector.py): - Training duration histogram with context manager - Prediction latency histogram with context manager - Model accuracy gauge for performance tracking - Dataset size histogram for data profiling 3. Business Metrics Collector (app/services/metrics_collector.py): - Active users gauge with time window support (day/week/month) - Datasets created counter by user - Models trained counter by model type and user - Predictions made counter by model 4. Integration (app/main.py): - Registered MetricsMiddleware in FastAPI app - Added /metrics endpoint for Prometheus scraping - Exposed metrics in Prometheus text exposition format TESTING: - Created 35 comprehensive tests with 100% code coverage - tests/test_middleware/test_metrics.py (16 tests) - tests/test_services/test_metrics_collector.py (19 tests) - All 205 unit tests passing (100% pass rate) DEPENDENCIES: - Added prometheus-client>=0.19.0 to pyproject.toml SPRINT TRACKING: - Sprint 10 Story 10.1 COMPLETE (8/27 points, 30%) - Updated SPRINT_10.md with implementation details - All acceptance criteria met METRICS NAMING CONVENTIONS: - http_request_duration_seconds (histogram) - http_requests_total (counter) - http_requests_active (gauge) - ml_training_duration_seconds (histogram) - ml_prediction_latency_seconds (histogram) - ml_model_accuracy (gauge) - ml_dataset_size_rows (histogram) - business_active_users (gauge) - business_datasets_created_total (counter) - business_models_trained_total (counter) - business_predictions_made_total (counter)
1 parent d61ba69 commit c1a27a1

File tree

9 files changed

+1275
-0
lines changed

9 files changed

+1275
-0
lines changed

SPRINT_10.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Sprint 10: Monitoring & Documentation
2+
3+
**Sprint Duration**: Oct 9-11, 2025 (3 days - accelerated)
4+
**Sprint Goal**: Implement comprehensive monitoring with Prometheus and Grafana, complete API documentation with OpenAPI specs, and fill integration test coverage gaps
5+
**Velocity Target**: 27 story points
6+
**Points Completed**: 8/27 (30%)
7+
**Risk Level**: Low (non-breaking improvements)
8+
**Status**: 🚧 **IN PROGRESS** - Story 10.1 starting
9+
10+
---
11+
12+
## Sprint Overview
13+
14+
### Capacity Planning
15+
- **Team Size**: 1 developer
16+
- **Velocity Target**: 27 story points
17+
- **Focus**: Observability & Documentation
18+
- **Risk Level**: Low (non-breaking improvements)
19+
20+
### Sprint Goals
21+
1. ✅ Prometheus metrics integration with system, ML, and business metrics
22+
2. ⏳ Grafana dashboards for visualization
23+
3. ⏳ Complete OpenAPI documentation for all endpoints
24+
4. ⏳ Integration test coverage >80%
25+
5. ⏳ Monitoring runbook for on-call engineers
26+
27+
---
28+
29+
## Stories
30+
31+
### Story 10.1: Prometheus Metrics Integration (Priority: 🟡, Points: 8)
32+
33+
**Status**: ✅ **COMPLETE**
34+
**Started**: 2025-10-09
35+
**Completed**: 2025-10-09
36+
37+
**As a** SRE
38+
**I want** Prometheus metrics for system and ML operations
39+
**So that** I can monitor performance and identify issues proactively
40+
41+
**Acceptance Criteria:**
42+
- [x] System metrics exposed: request latency, error rates, throughput
43+
- [x] ML metrics exposed: training duration, prediction latency, model accuracy
44+
- [x] Business metrics exposed: active users, datasets created, models trained
45+
- [x] Metrics endpoint `/metrics` accessible
46+
- [x] Metrics follow Prometheus naming conventions
47+
48+
**Technical Tasks:**
49+
1. Install and configure Prometheus client - 1.5h
50+
- Files: `apps/backend/requirements.txt`, `apps/backend/app/middleware/metrics.py` (new)
51+
- Add prometheus-client library: `prometheus-client==0.19.0`
52+
- Create metrics registry
53+
- Configure metrics HTTP handler at `/metrics`
54+
55+
2. Implement request metrics middleware - 2h
56+
- File: `apps/backend/app/middleware/metrics.py:20-80`
57+
- Track request latency histogram (buckets: 0.1, 0.5, 1, 2, 5, 10s)
58+
- Track request count by endpoint and status code
59+
- Track active request gauge
60+
- Add labels: endpoint, method, status_code
61+
62+
3. Implement ML operation metrics - 2.5h
63+
- File: `apps/backend/app/services/metrics_collector.py` (new)
64+
- Track training duration histogram
65+
- Track prediction latency histogram
66+
- Track model accuracy gauge (updated after training)
67+
- Track dataset size histogram
68+
- Add labels: model_type, problem_type, dataset_id
69+
70+
4. Implement business metrics - 2h
71+
- File: `apps/backend/app/services/metrics_collector.py:100-150`
72+
- Track active users gauge (from auth)
73+
- Track datasets created counter
74+
- Track models trained counter
75+
- Track predictions made counter
76+
- Add time-based labels (day, week, month)
77+
78+
**Dependencies:**
79+
- Sprint 8: API versioning (metrics paths)
80+
81+
**Risks:**
82+
- Metrics overhead may impact performance
83+
- High cardinality labels could cause issues
84+
85+
**Progress:**
86+
- ✅ Installed prometheus-client library (v0.23.1)
87+
- ✅ Created request metrics middleware in `app/middleware/metrics.py`
88+
- Request latency histogram (buckets: 0.1, 0.5, 1, 2, 5, 10s)
89+
- Request count by endpoint, method, and status code
90+
- Active requests gauge
91+
- ✅ Created ML and business metrics collector in `app/services/metrics_collector.py`
92+
- ML metrics: training duration, prediction latency, model accuracy, dataset size
93+
- Business metrics: active users, datasets created, models trained, predictions made
94+
- ✅ Registered metrics middleware and `/metrics` endpoint in `app/main.py`
95+
- ✅ Created comprehensive tests with 100% coverage (35 tests)
96+
- `tests/test_middleware/test_metrics.py` (16 tests)
97+
- `tests/test_services/test_metrics_collector.py` (19 tests)
98+
- ✅ All 205 unit tests pass (100% pass rate)
99+
100+
**Story 10.1 Status**: ✅ **COMPLETE**
101+
102+
---
103+
104+
### Story 10.2: Grafana Dashboards (Priority: 🟡, Points: 5)
105+
106+
**Status**: ⏳ **PENDING**
107+
108+
**As a** operator
109+
**I want** Grafana dashboards for system visualization
110+
**So that** I can understand system health at a glance
111+
112+
**Acceptance Criteria:**
113+
- [ ] System health dashboard: latency, errors, throughput
114+
- [ ] ML operations dashboard: training metrics, model performance
115+
- [ ] Business metrics dashboard: user activity, usage trends
116+
- [ ] Dashboards auto-refresh every 30s
117+
- [ ] Alerts configured for critical thresholds
118+
119+
**Dependencies:**
120+
- Story 10.1: Prometheus metrics
121+
122+
---
123+
124+
### Story 10.3: OpenAPI Spec Completion (Priority: 🟡, Points: 8)
125+
126+
**Status**: ⏳ **PENDING**
127+
128+
**As an** API consumer
129+
**I want** complete OpenAPI documentation
130+
**So that** I can integrate with the API easily
131+
132+
**Acceptance Criteria:**
133+
- [ ] All endpoints documented with request/response schemas
134+
- [ ] Authentication documented (JWT bearer token)
135+
- [ ] Error responses documented with examples
136+
- [ ] Interactive API docs at `/docs` endpoint
137+
- [ ] OpenAPI spec downloadable as JSON/YAML
138+
139+
**Dependencies:**
140+
- Sprint 8: API versioning complete
141+
142+
---
143+
144+
### Story 10.4: Complete Integration Tests (Priority: 🟡, Points: 5)
145+
146+
**Status**: ⏳ **PENDING**
147+
148+
**As a** developer
149+
**I want** complete integration test coverage
150+
**So that** service interactions are validated
151+
152+
**Acceptance Criteria:**
153+
- [ ] S3 integration tests cover upload/download/delete
154+
- [ ] MongoDB integration tests cover CRUD operations
155+
- [ ] OpenAI integration tests cover analysis workflows
156+
- [ ] End-to-end workflow integration tests pass
157+
- [ ] Integration test coverage >80%
158+
159+
**Dependencies:**
160+
- Sprint 9.3: Integration test fixtures
161+
162+
---
163+
164+
### Story 10.5: Monitoring Runbook (Priority: 🟢, Points: 1)
165+
166+
**Status**: ⏳ **PENDING**
167+
168+
**As an** on-call engineer
169+
**I want** monitoring runbook documentation
170+
**So that** I can respond to incidents effectively
171+
172+
**Acceptance Criteria:**
173+
- [ ] Alert response procedures documented
174+
- [ ] Common issue troubleshooting guide
175+
- [ ] Metric interpretation guide
176+
- [ ] Escalation procedures defined
177+
178+
**Dependencies:**
179+
- Stories 10.1 and 10.2 complete
180+
181+
---
182+
183+
## Sprint Validation Gates
184+
185+
- [ ] Prometheus metrics exposed and accurate
186+
- [ ] 3 Grafana dashboards operational
187+
- [ ] OpenAPI spec complete at `/docs`
188+
- [ ] Integration tests passing with >80% coverage
189+
- [ ] Monitoring runbook reviewed by team
190+
- [ ] All documentation updated
191+
192+
## Progress Tracking
193+
194+
**Daily Updates:**
195+
196+
### Day 1 (2025-10-09)
197+
- ✅ Completed Story 10.1: Prometheus Metrics Integration (8 points)
198+
- Implemented request metrics middleware with latency, count, and active request tracking
199+
- Implemented ML metrics collector for training, prediction, and model metrics
200+
- Implemented business metrics collector for users, datasets, models, and predictions
201+
- Created `/metrics` endpoint for Prometheus scraping
202+
- Wrote 35 comprehensive tests with 100% code coverage
203+
- All 205 unit tests passing
204+
- Next: Story 10.2 (Grafana Dashboards) or Story 10.3 (OpenAPI Spec Completion)
205+
206+
---
207+
208+
## Sprint Retrospective (To be completed)
209+
210+
**What went well:**
211+
- TBD
212+
213+
**What to improve:**
214+
- TBD
215+
216+
**Action items for Sprint 11:**
217+
- TBD
218+
219+
---
220+
221+
**Last Updated**: 2025-10-09
222+
**Maintained By**: Development team

apps/backend/.coverage

0 Bytes
Binary file not shown.

apps/backend/app/main.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,11 @@
3232
from motor.motor_asyncio import AsyncIOMotorClient
3333
from fastapi import FastAPI
3434
from fastapi.middleware.cors import CORSMiddleware
35+
from fastapi.responses import Response
3536

3637
from app.middleware.api_version import APIVersionMiddleware
38+
from app.middleware.metrics import MetricsMiddleware, get_metrics
39+
from prometheus_client import CONTENT_TYPE_LATEST
3740
from app.api.routes import (
3841
health,
3942
user_data,
@@ -127,6 +130,9 @@ async def lifespan(app: FastAPI):
127130
# ✅ Apply API versioning middleware
128131
app.add_middleware(APIVersionMiddleware)
129132

133+
# ✅ Apply Prometheus metrics middleware
134+
app.add_middleware(MetricsMiddleware)
135+
130136
# ✅ Include routers
131137
# Health check routes at root level (no version prefix)
132138
app.include_router(health.router, tags=["health"])
@@ -225,3 +231,13 @@ async def lifespan(app: FastAPI):
225231
@app.get("/")
226232
async def root():
227233
return {"message": "Welcome to the Narrative Modeling API"}
234+
235+
236+
@app.get("/metrics")
237+
async def metrics():
238+
"""
239+
Prometheus metrics endpoint.
240+
241+
Returns metrics in Prometheus text exposition format for scraping.
242+
"""
243+
return Response(content=get_metrics(), media_type=CONTENT_TYPE_LATEST)
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# apps/backend/app/middleware/metrics.py
2+
"""
3+
Prometheus metrics middleware for FastAPI.
4+
5+
Tracks:
6+
- Request latency histogram (buckets: 0.1, 0.5, 1, 2, 5, 10s)
7+
- Request count by endpoint, method, and status code
8+
- Active request gauge
9+
10+
Usage:
11+
from app.middleware.metrics import MetricsMiddleware, metrics_registry
12+
13+
app.add_middleware(MetricsMiddleware)
14+
15+
@app.get("/metrics")
16+
async def metrics():
17+
return Response(
18+
content=generate_latest(metrics_registry),
19+
media_type=CONTENT_TYPE_LATEST
20+
)
21+
"""
22+
23+
import time
24+
from typing import Callable
25+
from prometheus_client import (
26+
CollectorRegistry,
27+
Counter,
28+
Histogram,
29+
Gauge,
30+
generate_latest,
31+
CONTENT_TYPE_LATEST,
32+
)
33+
from starlette.middleware.base import BaseHTTPMiddleware
34+
from starlette.requests import Request
35+
from starlette.responses import Response
36+
from starlette.types import ASGIApp
37+
38+
# Create a custom registry for application metrics
39+
metrics_registry = CollectorRegistry()
40+
41+
# Request latency histogram with specified buckets
42+
request_latency = Histogram(
43+
name="http_request_duration_seconds",
44+
documentation="HTTP request latency in seconds",
45+
labelnames=["method", "endpoint", "status_code"],
46+
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0),
47+
registry=metrics_registry,
48+
)
49+
50+
# Request counter by endpoint, method, and status
51+
request_count = Counter(
52+
name="http_requests_total",
53+
documentation="Total HTTP requests",
54+
labelnames=["method", "endpoint", "status_code"],
55+
registry=metrics_registry,
56+
)
57+
58+
# Active requests gauge
59+
active_requests = Gauge(
60+
name="http_requests_active",
61+
documentation="Number of active HTTP requests",
62+
labelnames=["method", "endpoint"],
63+
registry=metrics_registry,
64+
)
65+
66+
67+
class MetricsMiddleware(BaseHTTPMiddleware):
68+
"""
69+
Middleware to collect Prometheus metrics for HTTP requests.
70+
71+
Tracks request latency, count, and active requests for all endpoints
72+
except the /metrics endpoint itself to avoid metric pollution.
73+
"""
74+
75+
def __init__(self, app: ASGIApp):
76+
super().__init__(app)
77+
78+
async def dispatch(self, request: Request, call_next: Callable) -> Response:
79+
# Skip metrics collection for the metrics endpoint itself
80+
if request.url.path == "/metrics":
81+
return await call_next(request)
82+
83+
# Extract endpoint path (use route path if available, otherwise URL path)
84+
endpoint = request.url.path
85+
method = request.method
86+
87+
# Track active requests
88+
active_requests.labels(method=method, endpoint=endpoint).inc()
89+
90+
# Start timing
91+
start_time = time.time()
92+
93+
try:
94+
# Process the request
95+
response = await call_next(request)
96+
status_code = response.status_code
97+
except Exception as e:
98+
# Track errors as 500
99+
status_code = 500
100+
raise
101+
finally:
102+
# Calculate latency
103+
latency = time.time() - start_time
104+
105+
# Record metrics
106+
request_latency.labels(
107+
method=method, endpoint=endpoint, status_code=status_code
108+
).observe(latency)
109+
110+
request_count.labels(
111+
method=method, endpoint=endpoint, status_code=status_code
112+
).inc()
113+
114+
# Decrement active requests
115+
active_requests.labels(method=method, endpoint=endpoint).dec()
116+
117+
return response
118+
119+
120+
def get_metrics() -> bytes:
121+
"""
122+
Generate Prometheus metrics in the exposition format.
123+
124+
Returns:
125+
bytes: Metrics in Prometheus text format
126+
"""
127+
return generate_latest(metrics_registry)

0 commit comments

Comments
 (0)