Metadata Collector

Overview

The Metadata Collector gathers comprehensive GPU and NVSwitch topology information from nodes and writes it to a local file. This metadata is consumed by health monitors (like Syslog Health Monitor) to correlate errors with specific hardware components.

Think of it as a hardware inventory scanner - it catalogs all GPUs, their connections, and NVSwitch fabric topology, making this information available for error analysis and troubleshooting.

In addition to persisting the GPU and NVSwitch topology information from nodes in a local file, the Metadata Collector will also expose the pod-to-GPU mapping as an annotation on each pod requesting GPUs. This allows components running externally to the node to discover this device mapping through the Kubernetes API.

Why Do You Need This?

Health monitors need detailed hardware information to create accurate health events:

Error correlation: Map PCI addresses and NVLink IDs to specific GPUs
Topology awareness: Understand GPU interconnect fabric for SXID error analysis
Hardware identification: Track GPU UUIDs, serial numbers, and device names
NVSwitch mapping: Identify which NVSwitches connect which GPUs

Without metadata collection, health monitors can only report generic errors without knowing which specific GPU or NVLink is affected. Additionally, the node drainer module needs the pod-to-GPU mapping to determine which set of pods is impacted by a given health event:

Partial drains: For GPU faults requiring component resets, the node drainer module will reference this mapping to only drain pods leveraging that GPU

How It Works

GPU and NVSwitch topology information collection:

Initializes NVML (NVIDIA Management Library)
Queries GPU information (UUID, PCI address, serial number, device name)
Parses NVLink topology from nvidia-smi
Builds NVSwitch fabric map
Writes comprehensive metadata to JSON file

The JSON file persists on the node and is read by health monitors via a shared volume.

GPU-to-pod mapping annotation:

To discover all pods running on the given node, this component will call the Kubelet /pods HTTPS endpoint.
To discover the GPU devices allocated to each pod, this component will leverage the Kubelet PodResourcesLister gRPC service.
If any pod has a change in its GPU device allocation, we will update the tracking annotation on the pod object.
The Metadata Collector will run this logic in a loop on a fixed threshold to continually update the mapping for new and existing pods.

Configuration

Configure the Metadata Collector through Helm values:

metadata-collector:
  enabled: true
  
  # Runtime class for GPU access (omit for CRI-O environments)
  runtimeClassName: "nvidia"

Configuration Options

Runtime Class: Specify runtime class name for GPU access (typically "nvidia" for containerd). For CRI-O environments, do not set this field.
Output Path: Path where metadata JSON is written (default: /var/lib/nvsentinel/gpu_metadata.json)

What It Collects

The metadata collector gathers:

GPU Information

GPU UUID (unique identifier)
PCI address
Serial number
Device name/model
GPU index

Pod Information

GPU UUIDs allocated to each pod

NVLink Topology

NVLink connections between GPUs
Remote GPU endpoints for each link
Link status and capability
Peer-to-peer connectivity map

NVSwitch Information

NVSwitch PCI addresses
Which GPUs connect through each switch
Fabric topology

System Information

Node name (hostname)
Chassis serial number (if available)
Timestamp of collection

Key Features

NVML-Based Collection

Uses NVIDIA Management Library for reliable, direct hardware queries without external dependencies.

Topology Parsing

Parses nvidia-smi output to build complete NVLink topology map showing GPU interconnections.

Shared Volume

Writes metadata to shared volume accessible by health monitor sidecars for error correlation.

JSON Output

Structured JSON format for easy parsing and consumption by health monitors.

Kubernetes API

The pod-to-GPU mapping is exposed on pods objects as an annotation which can be consumed by external components.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Collector

Overview

Why Do You Need This?

How It Works

Configuration

Configuration Options

What It Collects

GPU Information

Pod Information

NVLink Topology

NVSwitch Information

System Information

Key Features

NVML-Based Collection

Topology Parsing

Shared Volume

JSON Output

Kubernetes API

FilesExpand file tree

metadata-collector.md

Latest commit

History

metadata-collector.md

File metadata and controls

Metadata Collector

Overview

Why Do You Need This?

How It Works

Configuration

Configuration Options

What It Collects

GPU Information

Pod Information

NVLink Topology

NVSwitch Information

System Information

Key Features

NVML-Based Collection

Topology Parsing

Shared Volume

JSON Output

Kubernetes API