-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Context
The GPU Control Plane repo now has a packaging and release path, but a real-cluster e2e test still needs a deployable AllocDB service.
Related docs:
- docs/real-cluster-e2e-roadmap.md
- docs/operator-runbook.md
Roadmap
- package a deployable service shape for the replicated node
- make persistence layout explicit for WAL, snapshots, and replica metadata
- expose cluster health clearly through readiness/liveness and metrics
- document startup, restart, isolate/heal, failover, and rejoin flows
- prove a minimal in-cluster smoke for submit/read and restart/rejoin safety
Acceptance
- AllocDB can be deployed from documented manifests or overlays
- the deployed service survives restart with durable state intact
- a minimal smoke passes against the deployed service
- the operational runbook matches the deployed shape
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels