Milestones

[Discovery] Devise an approach to reduce AWS admin access
## Objective Come up with a way to reduce admin access to any of GOV.UK's AWS accounts, while maintaining operational effectiveness [This is the discovery phase. Implementation of an approach will fall under a separate milestone] ## Key results - we understand current usage patterns of admin access and the needs of users for it - we have clear objectives for implementation phase (including identifying any "quick wins") - we have an estimate of how long implementation will take, and what people and budget we will need
No due date
•8/9 issues closed
88% complete1 open 8 closed
Build Concourse pipeline for GOV.UK CI/CD
Objective --- Launch a dependable Concourse path with paved-road pipelines and delivery telemetry, so we can confidently measure and improve flow, reliability, and security. Key Results (how we’ll know we’ve won) --- * **Reliability** – 30 consecutive days without a Redis‑related high‑severity incident in migrated tenants. * **Adoption** – ≥ 90 % of Redis‑using **application teams in all environments** adopt the Valkey ElastiCache pattern by **30 Sep 2025**. Deliverables / Workstreams --- * **Baseline & Telemetry** Establish baseline (last 90 days) for: - DORA metrics for ≥80% of services: deployment frequency, lead time for change, change failure rate, MTTR - Data quality: ≥95% event match rate (commit, deploy), gaps documented Live team and service dashboards showing last deployment, commit message and user * **Concourse pipeline ready** - Hig Availability control plane, with worker autoscaling - backup & restore drill completed - Migration runbooks published
No due date
•10/19 issues closed
52% complete9 open 10 closed
♻️ Replace in-cluster Redis with Elasticache Valkey
Objective --- Make application caching on **GOV.UK** so resilient and hands‑off that Platform Engineering never has to nurse a Redis pod again, and application teams confidently cache at scale. Key Results (how we’ll know we’ve won) --- * **Reliability** – 30 consecutive days without a Redis‑related high‑severity incident in migrated tenants. * **Adoption** – ≥ 90 % of Redis‑using **application teams in all environments** adopt the Valkey ElastiCache pattern by **30 Sep 2025**. Deliverables / Workstreams --- * **Baseline & Telemetry** Record the current Redis landscape (instances, versions, incident history). Ship Prometheus/Grafana dashboards. * **Valkey ElastiCache Terraform module** Multi‑AZ, encrypted, tagged; CI tests & example usage. *Issue #1995* merged & validated in integration/staging. * **Docs & Comms** “How to adopt Valkey” guide, migration playbook and rollback instructions. RFC → ADR with architecture‑board **sign‑off by end of Q2 (30 Sep 2025)**. Early comms so tenants can plan Q3 work. * **Advocacy & Dogfooding** Platform engineers migrate their own in‑cluster Redis workloads first, then pair with first‑wave application teams; share wins & pain in fortnightly show‑and‑tell. Out of scope (for now) --- No changes to non‑Redis stateful services; no forced migration timelines; no alternative cache engines (Memcached, DAX). --- ### Progress snapshot (20 Jun 2025) * **Terraform module partially validated** (*Issue #1995*). * **Integration/staging migrations live** for Publishing API, Whitehall admin, Static, Search API – publishing tests green. * Self‑service production migrations to follow once documentation is published. ## Key docs - [migration process](https://docs.google.com/document/d/1C0XzBxVhgihRwtynskazclrGSwOi0IZ2UeRYtlG6p-M/edit?tab=t.0) - [discovery](https://docs.google.com/document/d/1a5PZ8hRGdqnGijRnRjwrxBt-TNnNGdCfx36HNOKYjyU/edit?pli=1&tab=t.0) ---
No due date
•2/7 issues closed
28% complete5 open 2 closed
♻️ Move off Terraform Cloud
Objective --- Eliminate vendor lock-in and cut Terraform IaC platform spend by ≥ 99 % by replacing Terraform Cloud with an encrypted, versioned S3 backend across all environments within six to nine months. | # | Key Result | Baseline & Target | | -------- | ----------------------------------------------- | ------------------------------------------------------------- | | **KR 1** | Reduce annual IaC cost | | | **KR 2** | Migrate all **93** workspaces | *93/93* state files stored in S3; TF Cloud workspaces deleted | | **KR 3** | Replace or retire every TF Cloud feature in use | Gap-analysis checklist 100 % green | | **KR 4** | Ship docs & run-book, team confidence ≥ 8/10 | Survey of Platform Engineering (n ≤ 10) | Context --- [Migrate off Terraform Cloud](https://docs.google.com/document/d/1vTZ8qFpRcKKe2KfoIXJiosefNUt_n3OnFb-7yexlols/edit?pli=1&tab=t.0#heading=h.6riin7urbue3) and [ADR0014](https://github.com/alphagov/govuk-infrastructure/blob/main/docs/architecture/decisions/0014-replace-terraform-cloud.md) set out the context for the decision we've made, and how we'll go about performing the migration. The very short version is this: 1. Audit 2. Prepare 3. Migrate 4. Educate
No due date
•3/6 issues closed
50% complete3 open 3 closed
♻️ [Discovery]: Local Development Experience Uplift
Make the GOV.UK local-development stack (govuk-docker ± supporting tooling) so intuitive, fast and trustworthy that all platform-team engineers adopt it daily and tenant teams rate it a joy to use. Key Results (how we’ll know we’ve won) --- - Adoption % of platform engineers who use the stack at least once per working day - Satisfaction survey (NPS) from the application team improvement (we have to take one first) - External contributions merged into any local-dev repo (govuk-docker, sample apps, docs) - Confidence – all Docker images covered by CI tests (build + healthcheck) Deliverables / workstreams --- - Baseline & telemetry Issue #️⃣ Create survey - Image & compose refactor Smaller layers, dependency pinning, unified healthchecks. - Developer ergonomics VS Code Dev Containers, default Makefile targets, one-command bootstrap script. - Docs & demos Self-guided “Getting started in 5 minutes”, recorded demo. - Dogfooding & advocacy Each platform engineer builds one real feature using only the stack; share wins + pain in fortnightly show-and-tell. Out of scope (for now) --- No wholesale move away from Docker; no live-cluster architecture changes. (Everything else is fair game—tooling, docs, sample apps, tests.)
No due date
•1/2 issues closed
50% complete1 open 1 closed
🔨 Improve Platform Operations and Trust
By the end of Q4 2025, make our platform the service that application teams instinctively trust to stay up, stay responsive, and stay out of their way. Success Indicators - Monitor Mean Time To Recover (MTTR) - Keep-the-Lights-On tickets ≤ **30 %** of stories in any sprint (rolling average). - 100 % of re-architecture / “big bet” decisions have an ADR or decision log published ≤ 5 days after kickoff. Notes * The milestone is open-ended (no fixed due date); teams progress it alongside normal roadmap work. * KTLO ratio script runs after every sprint retrospective, and the team is informed. * Future (out of scope): cost-of-downtime KPI, change-failure-rate SLO, automated post-incident survey.
No due date
•83/105 issues closed
79% complete22 open 83 closed
✨ Ephemeral GOV.UK Kubernetes Clusters (MVP)
### **Objective** Provide a self‑service mechanism for platform engineers (and eventually product teams) to spin up short‑lived “ephemeral” GOV.UK Kubernetes clusters that are representative of live, suitable for risky app changes and add‑on experiments, and cost < £50 per day. Clusters must be easy to create, verify and destroy, without impacting the permanent prod/staging/integration fleet. --- ### **Success Indicators** 1. **Cluster Creation & Destruction** * A single `make epi‑cluster` command provisions an AWS EKS cluster via Terraform in < 30 mins. * Creators can destroy their cluster with `make epi‑destroy`, and no cluster survives the weekend (flagged every Friday 15:00‑17:00). 2. **App Works** * All core GOV.UK apps (frontend, publishing‑api, content‑store, search, router, etc.) deploy successfully and return valid responses using synthetic data. 3. **Observability Parity** * Prometheus, Grafana, Alertmanager, Argo CD, external‑dns, and cert‑manager are installed automatically. Alerts remain local (muted Slack channel `#ephemeral‑alerts`). 4. **Cost & Tagging** * Cost allocation tags (`ephemeral=true`, owner, creation‑timestamp) are applied; daily burn rate remains < £50 and < 5 concurrent clusters. 5. **Documentation Shipped** * ADR, runbook and architecture diagram published in the platform repo. 6. **CI Green Path** * CI pipeline proves cluster bootstrap by running smoke tests and producing a green GitHub Action check. --- ### **Acceptance Criteria** * **EC‑01 Terraform Module** – Parameterised module builds an EKS cluster + scratch AWS resources (RDS, S3, etc.) in a single AZ. * **EC‑02 Make Wrapper** – `make epi‑cluster / epi‑destroy` wraps Terraform; outputs kubeconfig and cluster URL with real Route 53 hostname `epi‑<user>-<date>.govuk.test.gov.uk`. * **EC‑03 DNS & TLS** – external‑dns manages records in the existing Route 53 zone; cert‑manager issues Let’s Encrypt *staging* certificates. * **EC‑04 App Deployment** – Helm/Argo CD sync deploys all core apps; synthetic seed data committed in‑repo. * **EC‑05 Observability Stack** – Prometheus, Grafana dashboards, Alertmanager routed to `#ephemeral‑alerts` Slack webhook. * **EC‑06 Cost & TTL Alerting** – Cost tags applied; Friday‑afternoon GitHub Action posts a warning if any cluster age > 72 h. * **EC‑07 CI Proof‑of‑Concept** – GitHub Action job that spins a cluster on a feature branch, runs smoke tests, and destroys on success. * **EC‑08 Docs** – ADR explaining the need, runbook for users, and architecture diagram merged. Completion of EC‑01 → EC‑08 constitutes fulfilment; Lead Product Manager and Lead Engineer sign off. --- ### **Notes** * Integration remains a **production** environment for the platform; the milestone provides a safe pre‑integration playground. * Initial cluster lifetime policy is **flag‑only**; automatic deletion can be revisited once user trust grows. * Future work (out of scope): approval workflow, multi‑AZ resilience, log aggregation, stubbed third‑party services.
No due date
•14/21 issues closed
66% complete7 open 14 closed
♻️ Establish Sustainable PostgreSQL RDS Upgrade Process
## Objective Implement a repeatable, clearly documented, and automated process to discover, assess, plan, and upgrade GOV.UK PostgreSQL RDS instances, reducing operational overhead and ensuring no instance runs unsupported or EOL database versions. ## Success Indicators **1. Complete Visibility** - All PostgreSQL 13 RDS instances identified, inventoried, and documented - DONE - Inventory mechanism automated and running on a regular schedule. **2. Defined and Approved Upgrade Workflow** - Upgrade strategy (including decision criteria, methods such as blue/green vs. in-place, rollback plan, and communications strategy) is documented and approved by the Lead SRE - DONE **3. Proven Upgrade Capability** - At least one PostgreSQL 13 instance successfully upgraded in a test environment following the approved documented workflow. - DGU DONE - Process validated through peer review and signed off by the Lead SRE. - IN PROGRESS **4. Clear Documentation and Automation** - Process clearly documented in run-books, easily accessible to the team. - DONE [developer runbook](https://docs.google.com/document/d/1mBx9wV4r55SXTD4Nag-ADljk8Hvx_4hqoYJ1kFgt5vc/edit?pli=1&tab=t.0#heading=h.fp6hiaqcco5f) & [sre runbook](https://docs.google.com/document/d/1j-Hhh2k-HyPuCjVgI5QbUq9U1JuQNC4AcWmk6zwY2wQ/edit?tab=t.0#heading=h.fp6hiaqcco5f) **5. Ready for Future Upgrades** - Evidence that the established upgrade process can be applied easily for future PostgreSQL major version upgrades (14, 15, 16, and beyond), with minimal adjustment. ## Acceptance Criteria - [ ] **Discovery Tickets** are complete and signed off by Lead SRE - DONE - [ ] **Upgrade Process Documentation** reviewed, approved, and published. - [ ] **Automation Tooling** delivered and merged into mainline CI/CD pipelines. - [ ] **Pilot Upgrade** successfully executed, demonstrating process effectiveness. - [ ] **Service Teams** have been briefed and understand their responsibilities within the upgrade process. ## Notes - This milestone aims not just to retire PostgreSQL 13 but to embed capability for continuous database upgrades. - The measure of success includes establishing long-term operational resilience rather than just immediate compliance.
No due date
•16/16 issues closed
100% complete0 open 16 closed
📈 Improve the Security Posture of the GOV.UK Platform
**Objective** Enhance the GOV.UK Kubernetes platform’s security by completing the six tasks/epics, reducing the likelihood of breaches, and ensuring compliance with government standards. **Success Indicators** 1. **No Critical or High-Risk Findings Remain Unaddressed** - All pen test issues categorised as high or critical have documented mitigations or accepted risk sign-offs. 2. **Established Threat Model and Risk Review Process** - A routine risk review forum is in place and actively maintained. - Threat modelling artefacts are finalised, shared, and regularly revisited. 3. **Documented and Communicated Policies** - A clear data retention policy is published and understood by all relevant teams. - RBAC and network segregation policies follow the principle of least privilege. 4. **Validation via Follow-Up Penetration Test** - A subsequent pen test is scheduled and completed, confirming the efficacy of the implemented changes. - Any newly identified risks are logged and prioritised without leaving critical gaps. 5. **Positive Security Posture Metrics** - Reduced open security issues in the backlog (especially high/medium priority). - Satisfactory internal/external audit results (where applicable). 6. **Domain Management is Migrated to Another Provider** - Improve manageability of the GOV.UK TLD and related domains by migrating to a more suitable Registrar. **Acceptance Criteria** - The six epics (Pen Test Findings, Threat Modelling, Regular Risk Review, Data Retention Policy, Kubernetes RBAC Review, and Scheduling Another Pen Test) are all completed or have accepted action plans. - The new pen test confirms a tangible reduction in exploitable vulnerabilities. - The security team and stakeholders sign off that the platform meets or exceeds baseline security expectations. **Notes** - Success is determined not just by closing existing gaps but also by creating processes (risk reviews, routine threat modelling) to maintain security momentum. - This milestone aims to provide ongoing operational resilience, rather than a one-off “tick-box” compliance activity.
No due date
•43/72 issues closed
59% complete29 open 43 closed