You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an enterprise procurement manager I want SLA reporting and service level guarantees So that I can validate platform performance against contract requirements
Refined: Story is detailed, estimated, and ready for development
In Progress: Story is actively being developed
Done: Story delivered and accepted
Acceptance Criteria
Functional Requirements
Given the knowledge service has been running for 30 days When an admin calls GET /api/v1/organizations/acme/sla/current Then the service returns: { "uptime_percent": 99.95, "target": 99.9, "status": "meeting_sla", "period": "2026-02-01/2026-02-28", "incidents": 1, "total_downtime_minutes": 22 }
Given an admin requests an SLA report When they call GET /api/v1/organizations/acme/sla/report?period=2026-Q1 Then the service returns: monthly uptime breakdown, latency SLA compliance (p95 < target), incident summary (count, duration, impact), error rate compliance, overall SLA score
Given uptime drops below 99.9% in the current period When the SLA calculation runs Then the dashboard shows "SLA breach" status and an alert fires with severity "critical"
Given the service experiences an outage When an incident occurs (health probe fails for >1 minute) Then the incident is automatically recorded: start time, end time, duration, affected endpoints, root cause (if known)
Given an admin needs to share the SLA report When they call GET /api/v1/organizations/acme/sla/report?period=2026-Q1&format=pdf Then the service generates a downloadable PDF report with charts and summary
SLA breach triggers critical alert + incident record
Service credit calculation: for each 0.1% below target, X% credit (configurable)
Reports available for current month and up to 12 months history
Edge Cases and Error Handling
No data for requested period: Return "No SLA data available for this period"
Partial month (current month): Calculate pro-rata uptime from month start to now
Planned downtime window: Exclude from downtime calculation; require admin to pre-register planned maintenance
Monitoring data gaps: Flag gaps as "unknown" in SLA report; don't count as uptime or downtime
Definition of Done Checklist
Development Completion
All 5 acceptance criteria implemented and verified
SLA current status endpoint
SLA report generation (JSON + PDF)
Incident auto-detection and recording
SLA breach alerting
Planned downtime registration and exclusion
Unit tests for SLA calculation, credit computation
Integration tests for incident detection and reporting
Quality Assurance
SLA calculation accuracy verified against manual computation
PDF report renders correctly with charts
Planned downtime correctly excluded from calculations
Story Sizing and Sprint Readiness
Refined Story Points
Final Story Points: L(5) Confidence Level: Medium Sizing Justification: SLA calculation from monitoring data, report template, incident recording, PDF generation. Moderate effort with known patterns. PDF generation may add complexity.
Sprint Capacity Validation
Sprint Fit Assessment: Fits in single sprint Total Effort Assessment: Yes
Testing Methods: Unit tests for SLA math (uptime %, credit calculation); integration tests: simulate outage → verify incident recorded → verify SLA report reflects downtime Test Data Requirements: Simulated health probe failures, planned downtime records Environment Requirements: Mock monitoring data, PDF generation library
Notes
Refinement Insights: SLA reporting is primarily a presentation layer over monitoring and incident data. The hardest part is incident auto-detection and planned downtime exclusion logic.
Technical Analysis
Implementation Approach
Technical Strategy: Calculate SLA from health probe history (Prometheus up metric). Incident detection: detect consecutive health probe failures >1 min. Report generation: aggregate metrics into report template. PDF: use puppeteer or pdfkit for PDF rendering. Key Components: SLA calculator, incident detector/recorder, planned downtime registry, report generator (JSON + PDF), SLA breach alerter Data Flow: Prometheus health metrics → SLA calculator → check breach → alert if needed. On report request: aggregate → render → respond.
Story Statement
As an enterprise procurement manager
I want SLA reporting and service level guarantees
So that I can validate platform performance against contract requirements
Where: Knowledge service — SLA reporting layer
Epic Context
Parent Epic: Platform Hardening & Enterprise Readiness #68
Status: Refined
Priority: P2 (Could-Have)
Status Workflow
Acceptance Criteria
Functional Requirements
Given the knowledge service has been running for 30 days
When an admin calls GET
/api/v1/organizations/acme/sla/currentThen the service returns:
{ "uptime_percent": 99.95, "target": 99.9, "status": "meeting_sla", "period": "2026-02-01/2026-02-28", "incidents": 1, "total_downtime_minutes": 22 }Given an admin requests an SLA report
When they call GET
/api/v1/organizations/acme/sla/report?period=2026-Q1Then the service returns: monthly uptime breakdown, latency SLA compliance (p95 < target), incident summary (count, duration, impact), error rate compliance, overall SLA score
Given uptime drops below 99.9% in the current period
When the SLA calculation runs
Then the dashboard shows "SLA breach" status and an alert fires with severity "critical"
Given the service experiences an outage
When an incident occurs (health probe fails for >1 minute)
Then the incident is automatically recorded: start time, end time, duration, affected endpoints, root cause (if known)
Given an admin needs to share the SLA report
When they call GET
/api/v1/organizations/acme/sla/report?period=2026-Q1&format=pdfThen the service generates a downloadable PDF report with charts and summary
Business Rules
Edge Cases and Error Handling
Definition of Done Checklist
Development Completion
Quality Assurance
Story Sizing and Sprint Readiness
Refined Story Points
Final Story Points: L(5)
Confidence Level: Medium
Sizing Justification: SLA calculation from monitoring data, report template, incident recording, PDF generation. Moderate effort with known patterns. PDF generation may add complexity.
Sprint Capacity Validation
Sprint Fit Assessment: Fits in single sprint
Total Effort Assessment: Yes
Dependencies and Coordination
Story Dependencies
Prerequisite Stories: #164 (Monitoring — health probes, metrics), #168 (Performance Analytics — shares analytics infra)
Dependent Stories: None
Validation and Testing Strategy
Acceptance Testing Approach
Testing Methods: Unit tests for SLA math (uptime %, credit calculation); integration tests: simulate outage → verify incident recorded → verify SLA report reflects downtime
Test Data Requirements: Simulated health probe failures, planned downtime records
Environment Requirements: Mock monitoring data, PDF generation library
Notes
Refinement Insights: SLA reporting is primarily a presentation layer over monitoring and incident data. The hardest part is incident auto-detection and planned downtime exclusion logic.
Technical Analysis
Implementation Approach
Technical Strategy: Calculate SLA from health probe history (Prometheus up metric). Incident detection: detect consecutive health probe failures >1 min. Report generation: aggregate metrics into report template. PDF: use puppeteer or pdfkit for PDF rendering.
Key Components: SLA calculator, incident detector/recorder, planned downtime registry, report generator (JSON + PDF), SLA breach alerter
Data Flow: Prometheus health metrics → SLA calculator → check breach → alert if needed. On report request: aggregate → render → respond.
Technical Requirements
incidents(id UUID PK, org_id UUID FK, started_at TIMESTAMPTZ, ended_at TIMESTAMPTZ, duration_minutes INTEGER, affected_endpoints TEXT[], auto_detected BOOLEAN, root_cause TEXT, created_at TIMESTAMPTZ)planned_downtime(id UUID PK, org_id UUID FK, start TIMESTAMPTZ, end TIMESTAMPTZ, reason TEXT, created_by UUID)(total_minutes - unplanned_downtime_minutes) / total_minutes * 100puppeteerfor HTML-to-PDF orpdfkitfor programmatic generationTechnical Risks and Mitigation
Spike Requirements
Required Spikes: Evaluate PDF generation approach: puppeteer vs pdfkit vs defer PDF to v2