Skip to content

SLA Reporting and Service Level Guarantees #169

@rucka

Description

@rucka

Story Statement

As an enterprise procurement manager
I want SLA reporting and service level guarantees
So that I can validate platform performance against contract requirements

Where: Knowledge service — SLA reporting layer

Epic Context

Parent Epic: Platform Hardening & Enterprise Readiness #68
Status: Refined
Priority: P2 (Could-Have)

Status Workflow

  • Refined: Story is detailed, estimated, and ready for development
  • In Progress: Story is actively being developed
  • Done: Story delivered and accepted

Acceptance Criteria

Functional Requirements

  1. Given the knowledge service has been running for 30 days
    When an admin calls GET /api/v1/organizations/acme/sla/current
    Then the service returns: { "uptime_percent": 99.95, "target": 99.9, "status": "meeting_sla", "period": "2026-02-01/2026-02-28", "incidents": 1, "total_downtime_minutes": 22 }

  2. Given an admin requests an SLA report
    When they call GET /api/v1/organizations/acme/sla/report?period=2026-Q1
    Then the service returns: monthly uptime breakdown, latency SLA compliance (p95 < target), incident summary (count, duration, impact), error rate compliance, overall SLA score

  3. Given uptime drops below 99.9% in the current period
    When the SLA calculation runs
    Then the dashboard shows "SLA breach" status and an alert fires with severity "critical"

  4. Given the service experiences an outage
    When an incident occurs (health probe fails for >1 minute)
    Then the incident is automatically recorded: start time, end time, duration, affected endpoints, root cause (if known)

  5. Given an admin needs to share the SLA report
    When they call GET /api/v1/organizations/acme/sla/report?period=2026-Q1&format=pdf
    Then the service generates a downloadable PDF report with charts and summary

Business Rules

  • SLA metrics: uptime % (target 99.9%), latency p95 (target <500ms), error rate (target <1%)
  • Measurement window: calendar month (SLA resets on 1st of each month)
  • Planned downtime excluded from SLA calculation (must be pre-announced)
  • Incident auto-detection: based on health probe failures (from Production Monitoring and Alerting #164)
  • SLA breach triggers critical alert + incident record
  • Service credit calculation: for each 0.1% below target, X% credit (configurable)
  • Reports available for current month and up to 12 months history

Edge Cases and Error Handling

  • No data for requested period: Return "No SLA data available for this period"
  • Partial month (current month): Calculate pro-rata uptime from month start to now
  • Planned downtime window: Exclude from downtime calculation; require admin to pre-register planned maintenance
  • Monitoring data gaps: Flag gaps as "unknown" in SLA report; don't count as uptime or downtime

Definition of Done Checklist

Development Completion

  • All 5 acceptance criteria implemented and verified
  • SLA current status endpoint
  • SLA report generation (JSON + PDF)
  • Incident auto-detection and recording
  • SLA breach alerting
  • Planned downtime registration and exclusion
  • Unit tests for SLA calculation, credit computation
  • Integration tests for incident detection and reporting

Quality Assurance

  • SLA calculation accuracy verified against manual computation
  • PDF report renders correctly with charts
  • Planned downtime correctly excluded from calculations

Story Sizing and Sprint Readiness

Refined Story Points

Final Story Points: L(5)
Confidence Level: Medium
Sizing Justification: SLA calculation from monitoring data, report template, incident recording, PDF generation. Moderate effort with known patterns. PDF generation may add complexity.

Sprint Capacity Validation

Sprint Fit Assessment: Fits in single sprint
Total Effort Assessment: Yes

Dependencies and Coordination

Story Dependencies

Prerequisite Stories: #164 (Monitoring — health probes, metrics), #168 (Performance Analytics — shares analytics infra)
Dependent Stories: None

Validation and Testing Strategy

Acceptance Testing Approach

Testing Methods: Unit tests for SLA math (uptime %, credit calculation); integration tests: simulate outage → verify incident recorded → verify SLA report reflects downtime
Test Data Requirements: Simulated health probe failures, planned downtime records
Environment Requirements: Mock monitoring data, PDF generation library

Notes

Refinement Insights: SLA reporting is primarily a presentation layer over monitoring and incident data. The hardest part is incident auto-detection and planned downtime exclusion logic.

Technical Analysis

Implementation Approach

Technical Strategy: Calculate SLA from health probe history (Prometheus up metric). Incident detection: detect consecutive health probe failures >1 min. Report generation: aggregate metrics into report template. PDF: use puppeteer or pdfkit for PDF rendering.
Key Components: SLA calculator, incident detector/recorder, planned downtime registry, report generator (JSON + PDF), SLA breach alerter
Data Flow: Prometheus health metrics → SLA calculator → check breach → alert if needed. On report request: aggregate → render → respond.

Technical Requirements

  • incidents(id UUID PK, org_id UUID FK, started_at TIMESTAMPTZ, ended_at TIMESTAMPTZ, duration_minutes INTEGER, affected_endpoints TEXT[], auto_detected BOOLEAN, root_cause TEXT, created_at TIMESTAMPTZ)
  • planned_downtime(id UUID PK, org_id UUID FK, start TIMESTAMPTZ, end TIMESTAMPTZ, reason TEXT, created_by UUID)
  • Uptime formula: (total_minutes - unplanned_downtime_minutes) / total_minutes * 100
  • PDF: puppeteer for HTML-to-PDF or pdfkit for programmatic generation

Technical Risks and Mitigation

Risk Impact Probability Mitigation Strategy
PDF generation adds heavy dependency (puppeteer) Medium Medium Consider pdfkit (lighter) or defer PDF to future; JSON-only for v1
Incident detection false positives Medium Medium Require consecutive failures >1 min to reduce noise

Spike Requirements

Required Spikes: Evaluate PDF generation approach: puppeteer vs pdfkit vs defer PDF to v2

Metadata

Metadata

Assignees

No one assigned

    Labels

    user storyWork item representing a user story

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions