-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Goal
Extend the provider inventory operator (cluster/inventory.go) so that it reports which GPU driver is installed per node, starting with NVIDIA and keeping the design open for AMD/other drivers in the future.
Requirements
- Detect GPU driver family and version on each GPU node (e.g. via nvidia-smi, device‑plugin labels, or similar), with a simple abstraction so new driver types can be plugged in later.
- Add fields like driver_family(
nvidia,amd-rocm, etc) and driver_version(535.129.03,1.3.0, etc) to the inventory model and ensure they are returned by the inventory service APIs used by the cluster service. - Reuse the existing inventory poll loop; on detection failure, log a warning and keep the rest of inventory updates working.
- Keep implementation lightweight (no heavy dependencies, no long‑running extra processes).
Deliverables
- Code changes + unit tests for driver detection and inventory serialization.
- Brief design note in the PR explaining detection method and how to extend for other drivers.
Proposed solution:
GPU driver detection (via k8s API + runtime)
Core idea
Inventory operator detects driver_family and driver_version directly from the Kubernetes API and runtime, not from Akash‑specific config, so providers cannot spoof values.
Detection flow
-
Read node labels via Kubernetes API
For each GPU node, use the k8s client to get the Node object and inspect node.ObjectMeta.Labels.
Look for NVIDIA GPU Operator / GFD labels, e.g.:nvidia.com/cuda.driver-version.fullor the triple nvidia.com/cuda.driver.major, nvidia.com/cuda.driver.minor, nvidia.com/cuda.driver.rev.
If present and valid, set:
driver_family = "nvidia"
driver_version = "<full or composed value>" (e.g. 535.129.03). -
Fallback: runtime detection (no provider input)
If labels are missing or clearly bogus, run a small helper (DaemonSet or existing inventory pod with node access) that executes:
nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 and uses that output as driver_version.
This reads from the actual driver on the node, not from any user‑supplied config.
Extensibility
Implement this through a DriverDetector interface the inventory operator calls per node; register detectors for nvidia now, and later amd-rocm, intel, etc., which will also use k8s node labels + runtime probes to avoid provider‑controlled spoofing.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status