Define the minimum AllocDB work needed to support a real-cluster end-to-end test with
gpu_control_plane.
- AllocDB issue:
skel84/allocdb#99 - GPU Control Plane issue:
skel84/gpu_control_plane#35
The first real-cluster e2e should prove this path:
- A workload enters Kueue and waits for quota.
gpu_control_planeasks AllocDB to commit ownership.- AllocDB commits ownership and exposes it through its deployed service.
gpu_control_planematerializes the matching DRA objects.- The workload progresses, runs, and eventually releases or revokes cleanly.
The test should use:
- one Kubernetes cluster
- one deployed AllocDB service
- one deployed
gpu_control_plane - one deterministic GPU pool mapping
- one repeatable test workload
- provide a container image for the replicated node service
- provide a manifest or overlay that starts the service in-cluster
- keep the runtime shape aligned with the current replicated node implementation
- define the PVC or volume layout for WAL, snapshots, and replica metadata
- keep startup and restart paths tied to the durable workspace layout
- document the data retained across restart and rejoin
- add readiness and liveness probes that reflect real service state
- keep metrics and operator-visible health signals available from the deployed service
- startup
- restart
- isolate and heal
- failover and rejoin
- controlled shutdown
- service starts from the packaged manifest
- client submit works against the deployed service
- primary reads work
- restart and rejoin preserve durable state
- failover stays fail-closed and does not violate ownership safety
This repo is ready for the real-cluster e2e when:
- AllocDB can be deployed from a documented manifest or overlay
- the deployed service survives restart with durable state intact
- a minimal smoke proves submit/read behavior in-cluster
- the operational runbook matches the deployed shape