Skip to content

Add GPU driver info to provider inventory #428

@vertex451

Description

@vertex451

Goal
Extend the provider inventory operator (cluster/inventory.go) so that it reports which GPU driver is installed per node, starting with NVIDIA and keeping the design open for AMD/other drivers in the future.

Requirements

  • Detect GPU driver family and version on each GPU node (e.g. via nvidia-smi, device‑plugin labels, or similar), with a simple abstraction so new driver types can be plugged in later.
  • Add fields like driver_family(nvidia, amd-rocm, etc) and driver_version(535.129.03, 1.3.0, etc) to the inventory model and ensure they are returned by the inventory service APIs used by the cluster service.
  • Reuse the existing inventory poll loop; on detection failure, log a warning and keep the rest of inventory updates working.
  • Keep implementation lightweight (no heavy dependencies, no long‑running extra processes).

Deliverables

  • Code changes + unit tests for driver detection and inventory serialization.
  • Brief design note in the PR explaining detection method and how to extend for other drivers.

Proposed solution:
GPU driver detection (via k8s API + runtime)

Core idea
Inventory operator detects driver_family and driver_version directly from the Kubernetes API and runtime, not from Akash‑specific config, so providers cannot spoof values.

Detection flow

  1. Read node labels via Kubernetes API
    For each GPU node, use the k8s client to get the Node object and inspect node.ObjectMeta.Labels.
    Look for NVIDIA GPU Operator / GFD labels, e.g.: nvidia.com/cuda.driver-version.full or the triple nvidia.com/cuda.driver.major, nvidia.com/cuda.driver.minor, nvidia.com/cuda.driver.rev.
    If present and valid, set:
    driver_family = "nvidia"
    driver_version = "<full or composed value>" (e.g. 535.129.03).

  2. Fallback: runtime detection (no provider input)
    If labels are missing or clearly bogus, run a small helper (DaemonSet or existing inventory pod with node access) that executes:
    nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 and uses that output as driver_version.

This reads from the actual driver on the node, not from any user‑supplied config.

Extensibility
Implement this through a DriverDetector interface the inventory operator calls per node; register detectors for nvidia now, and later amd-rocm, intel, etc., which will also use k8s node labels + runtime probes to avoid provider‑controlled spoofing.

Metadata

Metadata

Assignees

Labels

GPUrepo/providerAkash provider-services repo issues

Type

No type

Projects

Status

Backlog (not prioritized)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions