Skip to content

Prepare deployable cluster shape for gpu_control_plane e2e #99

@skel84

Description

@skel84

Context

The GPU Control Plane repo now has a packaging and release path, but a real-cluster e2e test still needs a deployable AllocDB service.

Related docs:

  • docs/real-cluster-e2e-roadmap.md
  • docs/operator-runbook.md

Roadmap

  • package a deployable service shape for the replicated node
  • make persistence layout explicit for WAL, snapshots, and replica metadata
  • expose cluster health clearly through readiness/liveness and metrics
  • document startup, restart, isolate/heal, failover, and rejoin flows
  • prove a minimal in-cluster smoke for submit/read and restart/rejoin safety

Acceptance

  • AllocDB can be deployed from documented manifests or overlays
  • the deployed service survives restart with durable state intact
  • a minimal smoke passes against the deployed service
  • the operational runbook matches the deployed shape

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions