SLA Reporting and Service Level Guarantees

## Story Statement

**As an** enterprise procurement manager
**I want** SLA reporting and service level guarantees
**So that** I can validate platform performance against contract requirements

**Where**: Knowledge service — SLA reporting layer

## Epic Context

**Parent Epic**: [Platform Hardening & Enterprise Readiness #68](https://github.com/foomakers/pair/issues/68)
**Status**: Refined
**Priority**: P2 (Could-Have)

### Status Workflow

- **Refined**: Story is detailed, estimated, and ready for development
- **In Progress**: Story is actively being developed
- **Done**: Story delivered and accepted

## Acceptance Criteria

### Functional Requirements

1. **Given** the knowledge service has been running for 30 days
   **When** an admin calls GET `/api/v1/organizations/acme/sla/current`
   **Then** the service returns: `{ "uptime_percent": 99.95, "target": 99.9, "status": "meeting_sla", "period": "2026-02-01/2026-02-28", "incidents": 1, "total_downtime_minutes": 22 }`

2. **Given** an admin requests an SLA report
   **When** they call GET `/api/v1/organizations/acme/sla/report?period=2026-Q1`
   **Then** the service returns: monthly uptime breakdown, latency SLA compliance (p95 < target), incident summary (count, duration, impact), error rate compliance, overall SLA score

3. **Given** uptime drops below 99.9% in the current period
   **When** the SLA calculation runs
   **Then** the dashboard shows "SLA breach" status and an alert fires with severity "critical"

4. **Given** the service experiences an outage
   **When** an incident occurs (health probe fails for >1 minute)
   **Then** the incident is automatically recorded: start time, end time, duration, affected endpoints, root cause (if known)

5. **Given** an admin needs to share the SLA report
   **When** they call GET `/api/v1/organizations/acme/sla/report?period=2026-Q1&format=pdf`
   **Then** the service generates a downloadable PDF report with charts and summary

### Business Rules

- SLA metrics: uptime % (target 99.9%), latency p95 (target <500ms), error rate (target <1%)
- Measurement window: calendar month (SLA resets on 1st of each month)
- Planned downtime excluded from SLA calculation (must be pre-announced)
- Incident auto-detection: based on health probe failures (from #164)
- SLA breach triggers critical alert + incident record
- Service credit calculation: for each 0.1% below target, X% credit (configurable)
- Reports available for current month and up to 12 months history

### Edge Cases and Error Handling

- **No data for requested period**: Return "No SLA data available for this period"
- **Partial month** (current month): Calculate pro-rata uptime from month start to now
- **Planned downtime window**: Exclude from downtime calculation; require admin to pre-register planned maintenance
- **Monitoring data gaps**: Flag gaps as "unknown" in SLA report; don't count as uptime or downtime

## Definition of Done Checklist

### Development Completion

- [ ] All 5 acceptance criteria implemented and verified
- [ ] SLA current status endpoint
- [ ] SLA report generation (JSON + PDF)
- [ ] Incident auto-detection and recording
- [ ] SLA breach alerting
- [ ] Planned downtime registration and exclusion
- [ ] Unit tests for SLA calculation, credit computation
- [ ] Integration tests for incident detection and reporting

### Quality Assurance

- [ ] SLA calculation accuracy verified against manual computation
- [ ] PDF report renders correctly with charts
- [ ] Planned downtime correctly excluded from calculations

## Story Sizing and Sprint Readiness

### Refined Story Points

**Final Story Points**: L(5)
**Confidence Level**: Medium
**Sizing Justification**: SLA calculation from monitoring data, report template, incident recording, PDF generation. Moderate effort with known patterns. PDF generation may add complexity.

### Sprint Capacity Validation

**Sprint Fit Assessment**: Fits in single sprint
**Total Effort Assessment**: Yes

## Dependencies and Coordination

### Story Dependencies

**Prerequisite Stories**: #164 (Monitoring — health probes, metrics), #168 (Performance Analytics — shares analytics infra)
**Dependent Stories**: None

## Validation and Testing Strategy

### Acceptance Testing Approach

**Testing Methods**: Unit tests for SLA math (uptime %, credit calculation); integration tests: simulate outage → verify incident recorded → verify SLA report reflects downtime
**Test Data Requirements**: Simulated health probe failures, planned downtime records
**Environment Requirements**: Mock monitoring data, PDF generation library

## Notes

**Refinement Insights**: SLA reporting is primarily a presentation layer over monitoring and incident data. The hardest part is incident auto-detection and planned downtime exclusion logic.

## Technical Analysis

### Implementation Approach

**Technical Strategy**: Calculate SLA from health probe history (Prometheus up metric). Incident detection: detect consecutive health probe failures >1 min. Report generation: aggregate metrics into report template. PDF: use puppeteer or pdfkit for PDF rendering.
**Key Components**: SLA calculator, incident detector/recorder, planned downtime registry, report generator (JSON + PDF), SLA breach alerter
**Data Flow**: Prometheus health metrics → SLA calculator → check breach → alert if needed. On report request: aggregate → render → respond.

### Technical Requirements

- `incidents(id UUID PK, org_id UUID FK, started_at TIMESTAMPTZ, ended_at TIMESTAMPTZ, duration_minutes INTEGER, affected_endpoints TEXT[], auto_detected BOOLEAN, root_cause TEXT, created_at TIMESTAMPTZ)`
- `planned_downtime(id UUID PK, org_id UUID FK, start TIMESTAMPTZ, end TIMESTAMPTZ, reason TEXT, created_by UUID)`
- Uptime formula: `(total_minutes - unplanned_downtime_minutes) / total_minutes * 100`
- PDF: `puppeteer` for HTML-to-PDF or `pdfkit` for programmatic generation

### Technical Risks and Mitigation

| Risk | Impact | Probability | Mitigation Strategy |
| --- | --- | --- | --- |
| PDF generation adds heavy dependency (puppeteer) | Medium | Medium | Consider pdfkit (lighter) or defer PDF to future; JSON-only for v1 |
| Incident detection false positives | Medium | Medium | Require consecutive failures >1 min to reduce noise |

### Spike Requirements

**Required Spikes**: Evaluate PDF generation approach: puppeteer vs pdfkit vs defer PDF to v2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLA Reporting and Service Level Guarantees #169

Story Statement

Epic Context

Status Workflow

Acceptance Criteria

Functional Requirements

Business Rules

Edge Cases and Error Handling

Definition of Done Checklist

Development Completion

Quality Assurance

Story Sizing and Sprint Readiness

Refined Story Points

Sprint Capacity Validation

Dependencies and Coordination

Story Dependencies

Validation and Testing Strategy

Acceptance Testing Approach

Notes

Technical Analysis

Implementation Approach

Technical Requirements

Technical Risks and Mitigation

Spike Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Risk	Impact	Probability	Mitigation Strategy
PDF generation adds heavy dependency (puppeteer)	Medium	Medium	Consider pdfkit (lighter) or defer PDF to future; JSON-only for v1
Incident detection false positives	Medium	Medium	Require consecutive failures >1 min to reduce noise

SLA Reporting and Service Level Guarantees #169

Description

Story Statement

Epic Context

Status Workflow

Acceptance Criteria

Functional Requirements

Business Rules

Edge Cases and Error Handling

Definition of Done Checklist

Development Completion

Quality Assurance

Story Sizing and Sprint Readiness

Refined Story Points

Sprint Capacity Validation

Dependencies and Coordination

Story Dependencies

Validation and Testing Strategy

Acceptance Testing Approach

Notes

Technical Analysis

Implementation Approach

Technical Requirements

Technical Risks and Mitigation

Spike Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions