GPU Metrics Agent v0.1.0

We're excited to announce the first release of GPU Metrics Agent - a high-performance, purpose-built agent for collecting comprehensive NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow (OTAP) protocol.

🎯 Key Features

Comprehensive GPU Metrics Collection
- GPU and memory utilization (device and per-process level)
- Power consumption and limits
- Clock speeds (graphics, SM, memory, video)
- Temperature monitoring
- PCIe throughput (transmit/receive)
Process-Level GPU Tracking
- Monitor GPU usage per process with PID and process name
- Essential for multi-tenant environments and resource allocation
High-Performance Architecture
- Concurrent metric collection with optimized intervals
- Efficient OpenTelemetry Arrow protocol for data export
- Minimal overhead on GPU operations
Production-Ready
- Graceful handling of process termination
- Automatic multi-GPU detection and monitoring
- Secure gRPC communication with TLS support
- Configurable authentication via bearer tokens

📊 Metrics Overview

Metric	Description	Interval
`gpu_utilization_percent`	GPU compute utilization	5s
`gpu_utilization_memory_percent`	GPU memory utilization	5s
`gpu_power_watt`	Current power consumption	1s
`gpu_power_limit_watt`	Maximum power limit	1s
`gpu_clock_hertz`	Clock speeds	1s
`gpu_temperature_celsius`	GPU temperature	1s
`gpu_pcie_throughput_*_bytes`	PCIe throughput	100ms

🚀 Use Cases

GPU Resource Optimization: Identify underutilized resources and optimize workload scheduling
Multi-tenant GPU Sharing: Track per-process GPU usage for fair resource allocation
Performance Troubleshooting: Detect thermal throttling and bandwidth constraints
Cost Management: Monitor power consumption for operational cost optimization
ML/AI Workload Monitoring: Ensure optimal resource allocation during training and inference

📦 Installation

See our Kubernetes setup guide for detailed installation instructions.

🛠️ Requirements

NVIDIA GPU with driver version 390.x or newer
Linux operating system
NVIDIA Management Library (NVML)

🙏 Acknowledgments

This release represents a focused effort to provide best-in-class GPU observability. Special thanks to all contributors who helped shape this initial release.

Full Changelog: https://github.com/polarsignals/gpu-metrics-agent/commits/v0.1.0

Releases: polarsignals/gpu-metrics-agent

v0.1.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.1

What's Changed

Contributors

Uh oh!

v0.1.0

GPU Metrics Agent v0.1.0

🎯 Key Features

📊 Metrics Overview

🚀 Use Cases

📦 Installation

🛠️ Requirements

🙏 Acknowledgments

Uh oh!