Skip to content

Commit cd47649

Browse files
committed
docs: outline gpu isolation hardening tasks
1 parent 60ea5e6 commit cd47649

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

docs/gpu-isolation.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# GPU Isolation Considerations
2+
3+
Nimbus currently provisions GPU workloads via the Docker-based executor, but several hardening tasks remain before multi-tenant usage can be considered safe.
4+
5+
## Current State
6+
7+
- All GPUs are discovered via `nvidia-smi` and exposed to the executor through Docker’s `device_requests`. There is no MIG partitioning, per-job cgroup, or NVML scoping.
8+
- Containers receive `CUDA_VISIBLE_DEVICES` limited to the allocated devices, but NVML queries can still reveal global device information.
9+
- There is no admission control around MIG/MPS configuration; all jobs assume exclusive access.
10+
11+
## Required Work
12+
13+
1. **MIG & MPS Strategy**
14+
- Decide on sharing model (exclusive GPU vs MIG vs CUDA MPS).
15+
- Document supported configurations and required driver settings.
16+
17+
2. **Per-job Isolation**
18+
- Configure `nvidia-container-runtime` with per-job device cgroups.
19+
- Restrict `/dev/nvidia*` device nodes to allocated instances only.
20+
- Implement NVML filtering (e.g., via container runtime args or LD_PRELOAD) to prevent topology leakage.
21+
22+
3. **Scheduling & Labels**
23+
- Extend labels to express MIG profiles (e.g., `gpu:mig-1g.5gb`).
24+
- Ensure scheduler prevents over-commit by tracking available partitions.
25+
26+
4. **Attestation & Monitoring**
27+
- Capture MIG/GDS state in agent telemetry.
28+
- Alert on unexpected configuration changes or high utilisation.
29+
30+
5. **Testing**
31+
- Add integration tests that run concurrent GPU jobs ensuring isolation (no cross-job visibility).
32+
- Include red-team scenarios (bus probing, NVML enumeration, PCI scans) to verify controls.
33+
34+
Until these items are addressed, document that GPU workloads must be run in dedicated hosts without untrusted tenants.

0 commit comments

Comments
 (0)