Skip to content

feat: ReaperOverlay CRD for overlay lifecycle management#47

Merged
miguelgila merged 14 commits intomainfrom
fix/overlay-cleanup-crd
Mar 21, 2026
Merged

feat: ReaperOverlay CRD for overlay lifecycle management#47
miguelgila merged 14 commits intomainfrom
fix/overlay-cleanup-crd

Conversation

@miguelgila
Copy link
Copy Markdown
Owner

@miguelgila miguelgila commented Mar 19, 2026

Summary

  • Adds ReaperOverlay CRD — a PVC-like resource that decouples overlay lifecycle from pod lifecycle
  • Enables Kubernetes-native overlay creation, inspection, reset, and deletion without requiring direct node access
  • ReaperPods with overlayName now block (stay Pending) until a matching ReaperOverlay exists and is Ready, like Pods with unbound PVCs

Changes

CRD (src/crds/reaper_overlay.rs)

  • ReaperOverlaySpec: resetPolicy (Manual/OnFailure/OnDelete), resetGeneration (monotonic counter)
  • ReaperOverlayStatus: phase, observedResetGeneration, per-node status array, message
  • Generated CRD YAML in deploy/kubernetes/crds/ and deploy/helm/reaper/crds/

Controller (src/bin/reaper-controller/)

  • New overlay_reconciler.rs: watches ReaperOverlay objects, manages finalizer for cleanup, handles reset via generation counter, discovers and calls agents
  • Modified reconciler.rs: PVC-like blocking — checks for Ready ReaperOverlay before creating Pods
  • Modified main.rs: runs both ReaperPod and ReaperOverlay controllers concurrently
  • Added reqwest dependency for controller-to-agent HTTP communication

Agent (src/bin/reaper-agent/)

  • New overlay_api.rs: list_overlays(), get_overlay(), delete_overlay() functions
  • New HTTP routes: GET /api/v1/overlays, GET/DELETE /api/v1/overlays/{namespace}/{name}
  • Public wrappers in overlay_gc.rs for reusing existing cleanup logic

Helm & RBAC

  • Controller RBAC updated with reaperoverlays and reaperoverlays/status permissions
  • Static manifest (deploy/kubernetes/reaper-controller.yaml) updated
  • CRD generation script updated to produce both CRDs

Usage

# 1. Create overlay (like creating a PVC)
apiVersion: reaper.io/v1alpha1
kind: ReaperOverlay
metadata:
  name: slurm
spec:
  resetPolicy: Manual

---
# 2. Reference it in a ReaperPod (blocks until overlay is Ready)
apiVersion: reaper.io/v1alpha1
kind: ReaperPod
metadata:
  name: install-slurm
spec:
  overlayName: slurm
  command: ["bash", "-c", "apt-get update && apt-get install -y slurm-wlm"]

---
# 3. Reset after corruption
# kubectl patch reaperoverlays slurm --type merge -p '{"spec":{"resetGeneration":1}}'

Related

Test plan

  • cargo clippy --workspace --all-targets -- -D warnings passes
  • cargo clippy --target x86_64-unknown-linux-gnu --all-targets -- -D warnings passes
  • cargo test --workspace — all 163 tests pass
  • CRD generation script produces both CRDs correctly
  • CI integration tests (Kind cluster: ReaperOverlay lifecycle, PVC-like blocking, reset)

🤖 Generated with Claude Code

miguelgila and others added 2 commits March 19, 2026 18:17
PVC-like CRD that decouples overlay lifecycle from pod lifecycle.
Covers CRD design, controller reconciliation, agent reset endpoints,
PVC-like blocking for ReaperPods, and integration test strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a PVC-like CRD that decouples overlay lifecycle from pod lifecycle,
enabling Kubernetes-native overlay creation, reset, and deletion without
requiring direct node access.

Changes:
- ReaperOverlay CRD types (spec: resetPolicy, resetGeneration; status:
  phase, observedResetGeneration, per-node status)
- Overlay controller reconciler with finalizer-based cleanup and
  reset via generation counter
- PVC-like blocking: ReaperPods with overlayName stay Pending until
  a matching ReaperOverlay is Ready
- Agent HTTP endpoints: GET/DELETE /api/v1/overlays/{namespace}/{name}
  for overlay inspection and reset
- RBAC updates for controller to manage reaperoverlays resources
- CRD generation script updated for both CRDs
- reqwest added as controller dependency for agent communication

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@miguelgila miguelgila changed the title docs: ReaperOverlay CRD implementation plan feat: ReaperOverlay CRD for overlay lifecycle management Mar 19, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.01%. Comparing base (59ee7c2) to head (187f8cf).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #47   +/-   ##
=======================================
  Coverage   85.01%   85.01%           
=======================================
  Files           6        6           
  Lines         307      307           
=======================================
  Hits          261      261           
  Misses         46       46           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

miguelgila and others added 12 commits March 19, 2026 22:01
- Unit tests for ReaperOverlaySpec/Status/NodeStatus serialization and
  defaults (12 tests in src/crds/reaper_overlay.rs)
- Unit tests for overlay_api list/get functions (7 tests in
  src/bin/reaper-agent/overlay_api.rs)
- Kind integration tests (Phase 4c): CRD install, status, kubectl
  columns, PVC-like blocking, reset generation, delete cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Controller sets ReaperOverlay phase to Ready immediately on creation
  (overlays are lazily created by runtime, not pre-provisioned)
- Remove unused update_overlay_status and check_agent_overlay_exists
- Fix annotations test: create ReaperOverlay before ReaperPod with
  overlayName (PVC-like blocking requires matching overlay)
- Fix shell arithmetic error in delete test (tr -d '[:space:]')
- Defer controller cleanup to after Phase 4c so overlay tests have
  a running controller

All 14 Kind integration tests pass (Phase 4b + 4c).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add shortName "rovl" to ReaperOverlay (kubectl get rovl)
- Add shortName "rpod" to ReaperPod (kubectl get rpod)
- Regenerate CRD YAML with shortNames
- Update docs/book CRDs reference page with ReaperOverlay section
- Rename SUMMARY.md entry to "Custom Resource Definitions"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add shortName verification to kubectl column tests:
- kubectl get rpod returns expected columns
- kubectl get rovl returns expected columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add slurm-overlay.yaml (ReaperOverlay CRD for the shared slurm overlay)
- Update README with overlay creation step and troubleshooting section
  for resetting corrupt overlays via kubectl patch rovl
- Update slurmd-daemonset.yaml comment (issue #41 is fixed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All example setup scripts now print `export KUBECONFIG=...` in their
summary output so users know how to connect to the cluster from a
fresh shell.

Also:
- Slurm setup script installs ReaperOverlay CRD (idempotent) to
  support the slurm-overlay.yaml resource
- Slurm README lists ReaperOverlay CRD as prerequisite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace undefined if_log with LOG_FILE redirect. This ensures the
ReaperOverlay CRD is installed even when using --release with a
published Helm chart that predates the CRD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The static slurm-config.yaml with placeholder values (CPU_COUNT,
COMPUTE_NODE_LIST) was being applied by `kubectl apply -f examples/10-slurm-hpc/`,
overwriting the correctly generated ConfigMap from setup.sh.

- Rename slurm-config.yaml → slurm-config.yaml.template
- Update setup.sh summary to list individual files instead of directory
- Update README deploy instructions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use Kubernetes Job instead of Pod for the test
- Use sbatch --wait so the job blocks until completion
- Print job stdout/stderr output directly in the logs
- Show clear PASSED/FAILED verdict
- Update README and setup summary with job/test-slurm-job log command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sbatch writes output to the compute node's filesystem, not the
submitter's. srun streams output back directly so job results
are visible in kubectl logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
srun hangs because it needs bidirectional communication between the
submitter and compute node. Use sbatch --parsable + scontrol polling
instead, which works reliably across node boundaries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@miguelgila miguelgila merged commit b890734 into main Mar 21, 2026
20 checks passed
@miguelgila miguelgila deleted the fix/overlay-cleanup-crd branch March 21, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No mechanism to reset/clean overlay namespaces without direct node access

1 participant