Skip to content

Releases: polarsignals/gpu-metrics-agent

v0.1.3

08 Jan 10:12
02fd691

Choose a tag to compare

What's Changed

  • Add ability to attach arbitrary grpc headers by @brancz in #30
  • Turn some frequent DGX errors into warnings by @gnurizen in #29
  • Fix artifacts workflow to use actual git tags by @metalmatze in #32
  • Add ARM64 support to snapshot image push by @metalmatze in #33

New Contributors

Full Changelog: v0.1.2...v0.1.3

v0.1.2

17 Nov 18:02
5a67814

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.1...v0.1.2

v0.1.1

09 Jul 13:15
36da4c6

Choose a tag to compare

What's Changed

Full Changelog: v0.1.0...v0.1.1

v0.1.0

18 Jun 12:45

Choose a tag to compare

GPU Metrics Agent v0.1.0

We're excited to announce the first release of GPU Metrics Agent - a high-performance, purpose-built agent for collecting comprehensive NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow (OTAP) protocol.

🎯 Key Features

  • Comprehensive GPU Metrics Collection

    • GPU and memory utilization (device and per-process level)
    • Power consumption and limits
    • Clock speeds (graphics, SM, memory, video)
    • Temperature monitoring
    • PCIe throughput (transmit/receive)
  • Process-Level GPU Tracking

    • Monitor GPU usage per process with PID and process name
    • Essential for multi-tenant environments and resource allocation
  • High-Performance Architecture

    • Concurrent metric collection with optimized intervals
    • Efficient OpenTelemetry Arrow protocol for data export
    • Minimal overhead on GPU operations
  • Production-Ready

    • Graceful handling of process termination
    • Automatic multi-GPU detection and monitoring
    • Secure gRPC communication with TLS support
    • Configurable authentication via bearer tokens

📊 Metrics Overview

Metric Description Interval
gpu_utilization_percent GPU compute utilization 5s
gpu_utilization_memory_percent GPU memory utilization 5s
gpu_power_watt Current power consumption 1s
gpu_power_limit_watt Maximum power limit 1s
gpu_clock_hertz Clock speeds 1s
gpu_temperature_celsius GPU temperature 1s
gpu_pcie_throughput_*_bytes PCIe throughput 100ms

🚀 Use Cases

  • GPU Resource Optimization: Identify underutilized resources and optimize workload scheduling
  • Multi-tenant GPU Sharing: Track per-process GPU usage for fair resource allocation
  • Performance Troubleshooting: Detect thermal throttling and bandwidth constraints
  • Cost Management: Monitor power consumption for operational cost optimization
  • ML/AI Workload Monitoring: Ensure optimal resource allocation during training and inference

📦 Installation

See our Kubernetes setup guide for detailed installation instructions.

🛠️ Requirements

  • NVIDIA GPU with driver version 390.x or newer
  • Linux operating system
  • NVIDIA Management Library (NVML)

🙏 Acknowledgments

This release represents a focused effort to provide best-in-class GPU observability. Special thanks to all contributors who helped shape this initial release.


Full Changelog: https://github.com/polarsignals/gpu-metrics-agent/commits/v0.1.0