Releases: polarsignals/gpu-metrics-agent
Releases · polarsignals/gpu-metrics-agent
v0.1.3
What's Changed
- Add ability to attach arbitrary grpc headers by @brancz in #30
- Turn some frequent DGX errors into warnings by @gnurizen in #29
- Fix artifacts workflow to use actual git tags by @metalmatze in #32
- Add ARM64 support to snapshot image push by @metalmatze in #33
New Contributors
Full Changelog: v0.1.2...v0.1.3
v0.1.2
v0.1.1
What's Changed
- Add LICENSE by @metalmatze in #25
- nvidia: Collect graphicsRunningProcesses by @metalmatze in #26
Full Changelog: v0.1.0...v0.1.1
v0.1.0
GPU Metrics Agent v0.1.0
We're excited to announce the first release of GPU Metrics Agent - a high-performance, purpose-built agent for collecting comprehensive NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow (OTAP) protocol.
🎯 Key Features
-
Comprehensive GPU Metrics Collection
- GPU and memory utilization (device and per-process level)
- Power consumption and limits
- Clock speeds (graphics, SM, memory, video)
- Temperature monitoring
- PCIe throughput (transmit/receive)
-
Process-Level GPU Tracking
- Monitor GPU usage per process with PID and process name
- Essential for multi-tenant environments and resource allocation
-
High-Performance Architecture
- Concurrent metric collection with optimized intervals
- Efficient OpenTelemetry Arrow protocol for data export
- Minimal overhead on GPU operations
-
Production-Ready
- Graceful handling of process termination
- Automatic multi-GPU detection and monitoring
- Secure gRPC communication with TLS support
- Configurable authentication via bearer tokens
📊 Metrics Overview
| Metric | Description | Interval |
|---|---|---|
gpu_utilization_percent |
GPU compute utilization | 5s |
gpu_utilization_memory_percent |
GPU memory utilization | 5s |
gpu_power_watt |
Current power consumption | 1s |
gpu_power_limit_watt |
Maximum power limit | 1s |
gpu_clock_hertz |
Clock speeds | 1s |
gpu_temperature_celsius |
GPU temperature | 1s |
gpu_pcie_throughput_*_bytes |
PCIe throughput | 100ms |
🚀 Use Cases
- GPU Resource Optimization: Identify underutilized resources and optimize workload scheduling
- Multi-tenant GPU Sharing: Track per-process GPU usage for fair resource allocation
- Performance Troubleshooting: Detect thermal throttling and bandwidth constraints
- Cost Management: Monitor power consumption for operational cost optimization
- ML/AI Workload Monitoring: Ensure optimal resource allocation during training and inference
📦 Installation
See our Kubernetes setup guide for detailed installation instructions.
🛠️ Requirements
- NVIDIA GPU with driver version 390.x or newer
- Linux operating system
- NVIDIA Management Library (NVML)
🙏 Acknowledgments
This release represents a focused effort to provide best-in-class GPU observability. Special thanks to all contributors who helped shape this initial release.
Full Changelog: https://github.com/polarsignals/gpu-metrics-agent/commits/v0.1.0