This is the canonical status document for ongoing work. It consolidates historical handoff/checklist/backlog files into one place.
- Core source tree audit completed.
- NVML process-info struct mismatch fixed and validated.
- Cgroup filtering behavior tightened for SLURM isolation.
- dlsym resolution and lock-recovery hardening implemented.
- NVML caching and process-query reliability improvements merged.
nvidia-smiwrapper behavior updated and stress-validated.
-
Finalize Warewulf persistence for
nvidia-smiwrapper- Runtime copy is active on test node, but persistence across reprovision/reboot still needs finalization.
-
Stabilize
.2SM acceptance gates- Keep time-sliced average behavior and gate with statistical windows (e.g., median/p90 bands), not single-sample checks.
-
Harden direct-suite artifact capture
- Improve retries/capture for transient
no softmig logoutcomes so artifact misses are not misclassified as core hook failures.
- Improve retries/capture for transient
-
Run and evaluate overnight soak/stress cycles
- Continue validating
.2variance, direct-suite stability, and wrapper leak regression behavior over long windows.
- Continue validating
-
Calibrate
suite_soak.shthresholds with real data- Tune
shrreg_deltaand FD stability thresholds based on overnight runs.
- Tune
- Continue low-risk dead-code/header cleanup
- Keep as non-blocking unless correctness paths are touched.
-
nvsmileak stress remainsothers=0under pressure. -
.2SM metrics stay within agreed statistical tolerance bands. - OOM enforcement remains PASS across CUDA versions and slices.
- Direct-linked hook registration remains stable.
- Wrapper persistence verified after node reboot/reprovision.
- High-level release history:
CHANGES.md