A real-time GPU monitoring tool for AMD GPUs using ROCm, with a web-based dashboard for visualization and data export. Providing /metrics endpoint for Prometheus and Grafana for related dashboards.
The tool also checks ROCm-related driver availability:

The test also reveals details:

- ✅ Real-time GPU monitoring (temperature, power, usage, VRAM)
- ✅ Multi-GPU support with individual GPU selection
- ✅ Web-based dashboard with interactive charts
- ✅ ROCm System Diagnostics - Comprehensive ROCm installation testing
- ✅ Data export (CSV, JSON, Prometheus metrics)
- ✅ Configurable monitoring intervals
- ✅ Time-windowed data views (5min, 15min, 30min, 1h, all)
- ✅ Persistent settings - Interval and window settings saved across browser sessions
- ✅ Client-side data filtering - Change time windows without losing historical data
- ✅ Dark/Light theme support
- ✅ REST API for programmatic access
- ✅ Graceful error handling and retry logic
- ✅ Command-line configuration options
- Ubuntu 24.04 LTS (or compatible Linux distribution)
- ROCm 6.4.x installed and configured
- AMD GPU(s) supported by ROCm
- Go 1.19+ (for building from source)
rocm-smicommand available in PATH
# Clone the repository
git clone https://github.com/smarttechlabs-projects/strix-halo-rocm_info.git
cd rocm_monitor
# Build the application
go build -o rocm-monitor
# Run the monitor
./rocm-monitor# Build
make build
# Run
make run
# Clean
make clean# Start with default settings
./rocm-monitor
# Custom port
./rocm-monitor -port 9090
# Custom monitoring interval
./rocm-monitor -interval 10s
# Enable Prometheus metrics
./rocm-monitor -metrics
# Restrict CORS origin
./rocm-monitor -cors "http://localhost:3000"-port int
HTTP server port (default 8080)
-interval duration
Collection interval (default 5s)
-history int
Maximum history size (default 1000)
-cors string
CORS allowed origin (default "*")
-metrics
Enable Prometheus metrics endpoint
GET /api/stats- Get full history of GPU statisticsGET /api/stats?window=5m- Get statistics for specific time windowGET /api/latest- Get only the latest data pointGET /api/health- Health check endpointGET /api/config- Get current configurationPOST /api/config- Update configuration (interval)
GET /api/export.csv- Export data as CSVGET /api/export.json- Export data as JSONGET /metrics- Comprehensive Prometheus metrics for Grafana integration (if enabled)
POST /api/rocm-test- Run comprehensive ROCm system diagnostics
# Get latest GPU data
curl http://localhost:8080/api/latest
# Get last 5 minutes of data
curl http://localhost:8080/api/stats?window=5m
# Update monitoring interval
curl -X POST http://localhost:8080/api/config \
-H "Content-Type: application/json" \
-d '{"interval": "10s"}'
# Export data as CSV
curl http://localhost:8080/api/export.csv > gpu_data.csvAccess the comprehensive monitoring dashboard at http://localhost:8080
The web interface provides real-time monitoring with professional charts and comprehensive system diagnostics.
- 🔧 Test ROCm - Launch comprehensive system diagnostics
- 📥 Export CSV - Download monitoring data in CSV format
- 📥 Export JSON - Download monitoring data in JSON format
- Interval Selector - Choose collection frequency (1s, 5s, 10s, 30s, 60s)
- Settings persist across browser sessions via localStorage
- Syncs with backend configuration on page load
- Time Window - Select data range (5min, 15min, 30min, 1h, all)
- Client-side filtering preserves historical data when changing windows
- Settings persist across browser sessions
- 🌓 Theme - Toggle between dark and light themes
- Connection Status - Green dot indicates active connection to ROCm monitor
- GPU Count - Shows number of detected GPUs (e.g., "1 GPU detected")
- Error Messages - Red banner appears if connection or data issues occur
Displays static GPU information when available:
- Product Name - GPU model (e.g., "AMD Radeon Graphics")
- Vendor - GPU manufacturer
- Serial Number - Hardware serial (if available)
- VRAM Vendor - Memory manufacturer
- Bus Info - PCIe bus location
- Firmware Versions - Various firmware component versions
🌡️ Temperature Chart
- Displays GPU edge temperature in Celsius
- Typical range: 30-85°C
- Red zone: >80°C indicates potential thermal issues
- Multiple GPU support with color-coded lines
⚡ Power Consumption Chart
- Shows real-time power usage in Watts
- Includes socket-level power measurements
- Useful for monitoring power efficiency and thermal design power (TDP)
- Helps identify power-hungry workloads
🎮 GPU Usage Chart
- GPU utilization percentage (0-100%)
- Indicates how busy the GPU cores are
- Useful for performance monitoring and bottleneck identification
- Shows compute workload intensity
🖥️ CPU Usage Chart
- Overall system CPU utilization percentage
- Helps correlate GPU and CPU workloads
- Useful for identifying system bottlenecks
- Single line showing aggregate CPU usage
⏱️ GPU Clock Frequencies Chart
- SCLK - System/Shader Clock (solid lines)
- MCLK - Memory Clock (dashed lines)
- Frequencies shown in MHz
- Indicates performance states and boost behavior
- Multiple clock domains for different GPU functions
💾 VRAM Usage Chart
- Video memory utilization in GB
- Shows used vs. total VRAM capacity
- Critical for memory-intensive applications
- Helps prevent out-of-memory conditions
Multi-GPU Selection
- Checkbox controls appear when multiple GPUs detected
- "All" checkbox toggles all GPUs simultaneously
- Individual GPU checkboxes for selective monitoring
- Real-time chart updates when selection changes
Chart Interactions
- Hover - Shows exact values and timestamps in tooltips
- Responsive Design - Adapts to screen size and mobile devices
- Smooth Updates - Real-time data with minimal animation delay
- Auto-Scaling - Y-axis automatically adjusts to data ranges
Quick Access - Click the prominent "🔧 Test ROCm" button in the dashboard header
What It Does:
- Runs 11 comprehensive diagnostic tests
- Validates ROCm installation and configuration
- Identifies common issues and provides solutions
- Tests take 10-30 seconds to complete
- Results displayed in professional modal popup
When to Use:
- After ROCm installation or updates
- When experiencing GPU performance issues
- Before deploying GPU workloads
- When troubleshooting system problems
- For system validation and health checks
The built-in ROCm Test feature provides comprehensive diagnostics to verify your ROCm installation and identify configuration issues. Access via the "🔧 Test ROCm" button in the dashboard or API endpoint.
The diagnostics run 11 comprehensive tests covering all aspects of ROCm functionality:
1. ROCm Info - rocminfo
- Purpose: Validates HSA (Heterogeneous System Architecture) runtime and device detection
- What it checks:
- ROCk kernel module loading status
- HSA runtime version and capabilities
- GPU agents and their properties
- Memory pools and accessibility
- Instruction Set Architecture (ISA) support
- Success indicators:
- "ROCk module is loaded" message present
- HSA Agents section with detected devices
- GPU device type properly identified
- Runtime version information available
- Typical duration: 100-200ms
- Critical for: Verifying basic ROCm installation
2. ROCm SMI - rocm-smi
- Purpose: Basic GPU information and operational status
- What it checks:
- GPU device detection and enumeration
- Basic metrics availability (temperature, power, usage)
- Device power state and performance levels
- Overall system health
- Success indicators:
- GPU devices listed with IDs
- Temperature readings available
- Power consumption data
- Performance metrics accessible
- Common warnings: "GPU in low-power state" (normal when idle)
- Typical duration: 50-100ms
- Critical for: Basic GPU functionality verification
3. ROCm SMI Detailed - rocm-smi -a
- Purpose: Comprehensive GPU metrics and detailed hardware information
- What it checks:
- All available GPU sensors and metrics
- Hardware identification (Device ID, VBIOS, PCIe info)
- Clock frequencies and voltage information
- Memory subsystem details
- Firmware versions and capabilities
- Success indicators:
- Detailed GPU information displayed
- Multiple metric categories available
- Hardware identifiers present
- Common informational messages (not errors):
- "Clock exists but EMPTY! Likely driver error!" - normal on APUs
- "Not supported on the given system" - expected for many APU features
- "Failed to retrieve GPU metrics" - normal when metric version unsupported
- Typical duration: 100-150ms
- Critical for: Deep hardware analysis and troubleshooting
4. GPU List - rocm-smi -l
- Purpose: Enumerate all available ROCm-compatible GPU devices
- What it checks:
- Device discovery and listing
- GPU accessibility by ROCm stack
- Device power profiles and capabilities
- Success indicators: Clean device enumeration
- Typical duration: 50-80ms
- Critical for: Multi-GPU system validation
5. Temperature Monitoring - rocm-smi -t
- Purpose: Validate thermal sensor functionality
- What it checks:
- Temperature sensor availability
- Thermal reading accuracy
- Sensor communication with driver
- Success indicators: Temperature values in reasonable range (20-90°C)
- Typical duration: 50-80ms
- Critical for: Thermal management verification
6. Power Monitoring - rocm-smi -p
- Purpose: Verify power measurement capabilities
- What it checks:
- Power sensor functionality
- Performance level reporting
- Power management features
- Success indicators: Power readings and performance levels
- Typical duration: 50-80ms
- Critical for: Power management validation
7. Clock Frequency Monitoring - rocm-smi -c
- Purpose: Validate clock frequency reporting and control
- What it checks:
- System clock (SCLK) frequency reporting
- Memory clock (MCLK) frequency reporting
- System-on-chip clock (SOCCLK) information
- Dynamic frequency scaling
- Success indicators: Clock frequencies reported in MHz
- Typical duration: 50-80ms
- Critical for: Performance monitoring capabilities
8. Memory Usage Monitoring - rocm-smi -u
- Purpose: Validate VRAM usage reporting
- What it checks:
- Video memory utilization reporting
- VRAM capacity detection
- Memory controller communication
- Success indicators: VRAM usage percentages
- Typical duration: 50-80ms
- Critical for: Memory management validation
9. HIP Version Check - hipconfig --version
- Purpose: Verify HIP (Heterogeneous-Compute Interface for Portability) runtime
- What it checks:
- HIP runtime installation
- Version compatibility
- Runtime library availability
- Expected output: Version number format (e.g., "7.1.25424-4179531dcd")
- Success indicators: Valid version string returned
- Typical duration: 30-50ms
- Critical for: HIP application compatibility
10. HIP Platform Detection - hipconfig --platform
- Purpose: Validate HIP platform configuration
- What it checks:
- Platform backend detection (AMD vs NVIDIA)
- Runtime configuration validity
- Backend library availability
- Expected output: Platform identifier ("amd", "nvidia", etc.)
- Success indicators: Valid platform name returned
- Typical duration: 30-50ms
- Critical for: Platform-specific optimization
11. Device Capability Query - rocminfo (second run)
- Purpose: Detailed device capabilities and architecture verification
- What it checks:
- Complete device feature enumeration
- Architecture-specific capabilities
- Memory hierarchy and access patterns
- Compute unit organization
- Success indicators: Detailed capability information
- Typical duration: 100-200ms
- Critical for: Application optimization and compatibility
- Open Dashboard - Navigate to
http://localhost:8080 - Click Test Button - Click "🔧 Test ROCm" in the header
- Wait for Results - Tests run for 10-30 seconds
- Review Output - Click individual tests to expand details
# Run diagnostics via API
curl -X POST http://localhost:8080/api/rocm-test
# Example response structure:
{
"overall_success": true,
"summary": "✅ All ROCm tests passed - Tests: 11 total, 11 passed, 0 failed (completed in 1240ms)",
"test_results": [
{
"command": "rocminfo",
"success": true,
"output": "[full command output]",
"duration_ms": 156,
"issues": [],
"summary": "✅ ROCm Info - Success"
}
]
}The test modal displays a comprehensive summary at the top:
✅ Green Summary Examples:
- "✅ All ROCm tests passed - Tests: 11 total, 11 passed, 0 failed (completed in 1240ms)"
- Indicates: ROCm is fully functional and properly configured
- "
⚠️ ROCm tests passed with warnings - Tests: 11 total, 11 passed, 0 failed, 3 with warnings (completed in 1340ms)" - Indicates: ROCm works but has minor issues or informational warnings (usually normal on APUs)
❌ Red Summary Examples:
- "❌ ROCm tests failed - Tests: 11 total, 7 passed, 4 failed (completed in 890ms)"
- Indicates: Critical issues requiring immediate attention
Each test result is displayed as an expandable card showing:
Test Header (Always Visible):
- Status Icon: ✅ (Success), ❌ (Failure),
⚠️ (Warning) - Test Name: Descriptive name (e.g., "ROCm Info - Success")
- Warning Indicator: Additional
⚠️ if issues detected but test passed - Execution Time: Duration in milliseconds (e.g., "156ms")
- Expand Arrow: ▼ (collapsed) / ▲ (expanded)
Expandable Details (Click to View):
- Command: Exact command executed (e.g.,
rocminfoorrocm-smi -a) - Issues Detected: Specific warnings or problems found (if any)
- Output: Complete raw output from the command
- Error Output: Error messages if command failed
✅ Success with No Issues
✅ ROCm SMI Temperature - Success 78ms ▼
- Command executed successfully
- No warnings or issues detected
- Output contains expected data
✅ Success with Warnings
✅ ROCm SMI Detailed - Success ⚠️ 118ms ▼
Issues detected:
⚠️ Some metrics unavailable (temperature, power, etc.)
- Command executed successfully
- Minor issues detected (often normal on APUs)
- Functionality still working correctly
❌ Failure
❌ HIP Version - Failed 45ms ▼
Issues detected:
❌ Command not found - ROCm may not be installed
- Command failed to execute
- Critical issue requiring attention
- Specific guidance provided in issues section
Execution Times:
- Fast (< 50ms): hipconfig commands, simple queries
- Normal (50-150ms): rocm-smi commands, system queries
- Slower (> 150ms): rocminfo commands with full device enumeration
Abnormal Timing Indicators:
- > 1000ms: May indicate system performance issues
- Timeout (30s): Critical system problems, driver issues
- Variable timing: Inconsistent performance, potential instability
Modal Layout:
- Header: Test title with close button (×)
- Summary Bar: Color-coded overall result
- Test List: Expandable cards for each test
- Scrollable Content: Handle large output easily
Color Coding:
- Green: Success states, healthy systems
- Yellow: Warning states, minor issues
- Red: Error states, critical issues
- Blue: Informational elements, neutral states
Interactive Elements:
- Click Test Cards: Expand/collapse detailed information
- Hover Effects: Visual feedback on interactive elements
- Responsive Design: Works on desktop and mobile devices
- Keyboard Support: Accessible via keyboard navigation
-
Launch Application
./rocm-monitor
-
Open Dashboard
- Navigate to
http://localhost:8080 - Wait for "Connected" status indicator
- Navigate to
-
Verify System Health
- Click "🔧 Test ROCm" for comprehensive diagnostics
- Review any warnings or issues
-
Monitor Real-Time Data
- Observe temperature, power, and usage trends
- Adjust time window and refresh interval as needed
For Development/Testing:
- Interval: 1-5 seconds for responsive monitoring
- Time Window: 5-15 minutes for recent trends
- Export: Regular JSON exports for analysis
For Production Monitoring:
- Interval: 10-30 seconds to reduce overhead
- Time Window: 30 minutes to 1 hour for trend analysis
- Metrics: Enable Prometheus endpoint (
-metricsflag)
For Troubleshooting:
- Interval: 1 second for maximum responsiveness
- Time Window: 5 minutes for immediate issue correlation
- Testing: Run ROCm tests during problem periods
Temperature Monitoring:
- Normal Range: 30-70°C for most workloads
- Concerning: >80°C sustained temperatures
- Critical: >90°C temperatures (thermal throttling likely)
Power Monitoring:
- Baseline: Note idle power consumption (typically 5-15W)
- Load Testing: Monitor power spikes during workloads
- Efficiency: Correlate power usage with performance metrics
Memory Monitoring:
- Capacity Planning: Keep VRAM usage <80% for performance
- Leak Detection: Watch for gradually increasing memory usage
- Allocation Patterns: Monitor usage spikes during operations
Performance Correlation:
- GPU + CPU: High GPU usage should correlate with application CPU usage
- Clock Scaling: Observe frequency changes under load
- Thermal Throttling: Watch for clock reduction when temperature rises
Development Workflow:
- Start monitoring before launching GPU workloads
- Run ROCm tests after driver updates
- Export data for performance regression testing
- Use real-time monitoring during development
DevOps Integration:
- Monitor during deployment and scaling
- Set up automated ROCm testing in CI/CD
- Export metrics to monitoring infrastructure
- Create alerts for temperature/power thresholds
Research/Analysis Workflow:
- Baseline system before experiments
- Monitor continuously during long-running tasks
- Export detailed data for analysis
- Correlate performance with hardware metrics
❌ "Command not found" Errors
Issue: rocminfo: command not found
Solution: Install ROCm runtime
# Ubuntu 24.04 installation
sudo apt update
sudo apt install rocminfo rocm-smi-lib hip-runtime-dev❌ "No AMD/ROCm GPU devices detected"
Issue: No compatible GPU found
Solutions:
1. Verify GPU compatibility: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html
2. Check if GPU is detected: lspci | grep AMD
3. Install GPU drivers: sudo apt install amdgpu-dkms
❌ "Permission denied - add user to render group"
Issue: User lacks GPU access permissions
Solution: Add user to render group
sudo usermod -a -G render $USER
sudo usermod -a -G video $USER
# Log out and back in, or rebootIssue: ROCm kernel driver not loaded
Solutions:
1. Restart system after ROCm installation
2. Load modules manually: sudo modprobe amdgpu
3. Check module status: lsmod | grep amdgpu
Issue: HSA runtime not properly installed
Solution: Install HSA runtime
sudo apt install hsa-rocr-dev hsa-runtime-devIssue: GPU not detected by HSA runtime
Solutions:
1. Check ROCm compatibility
2. Verify driver installation
3. Restart system
❌ "HIP not properly installed"
Issue: HIP runtime missing or broken
Solution: Install/reinstall HIP
sudo apt install hip-runtime-dev hipccIssue: Test was expecting wrong output format from hipconfig commands
Note: hipconfig --version returns "7.1.25424-4179531dcd" (version numbers)
hipconfig --platform returns "amd" (platform name)
Solution: Update to latest version - this was a test detection bug
Issue: GPU running in power-saving mode
Note: This is normal when GPU is idle
Action: No action required unless performance is poor
Issue: Certain GPU sensors not supported
Note: Common on APUs and some GPU models
Action: Normal behavior, not a critical issue
Issue: Empty clock frequency domains (dcefclk, mclk)
Note: Normal on APU systems where certain clocks are managed differently
Examples: "Clock [mclk] on device [0] exists but EMPTY! Likely driver error!"
Action: No action required - this is expected behavior on APUs
Issue: GPU metrics version not supported for this device
Note: Common when monitoring interface version doesn't match GPU generation
Action: No action required - basic monitoring still works
❌ "Critical driver error detected"
Issue: Driver malfunction or corruption
Solutions:
1. Restart system
2. Reinstall ROCm: sudo apt remove --purge rocm-* && sudo apt install rocm
3. Check system logs: dmesg | grep amdgpu
❌ "Fatal error detected"
Issue: System-level failure
Solutions:
1. Check system logs: journalctl -b | grep rocm
2. Verify hardware: memtest86+
3. Reinstall drivers and ROCm
-
Verify Installation
which rocminfo rocm-smi hipconfig
-
Check Groups
groups $USER # Should include 'render' and 'video'
-
Test Basic Commands
rocminfo | head -10 rocm-smi
# Check ROCm installation status
dpkg -l | grep rocm
# Verify GPU detection at hardware level
lspci -nn | grep AMD
# Check kernel modules
lsmod | grep amdgpu
# Review system logs for errors
dmesg | grep -i rocm
journalctl -b | grep -i amd# Run tests every hour via cron
0 * * * * curl -X POST http://localhost:8080/api/rocm-test > /var/log/rocm-test.log 2>&1#!/bin/bash
# ROCm validation script for CI/CD
response=$(curl -s -X POST http://localhost:8080/api/rocm-test)
success=$(echo "$response" | jq -r '.overall_success')
if [ "$success" = "true" ]; then
echo "✅ ROCm tests passed"
exit 0
else
echo "❌ ROCm tests failed"
echo "$response" | jq '.summary'
exit 1
fiThe application is structured into modular components:
- main.go - HTTP server and route handlers
- collector.go - Data collection service with rocm-smi integration
- rocm_data.go - Data structures and parsing logic
- exporter.go - Export functionality (CSV, JSON, Prometheus)
- test_rocm.go - ROCm diagnostics and system testing
- static/index.html - Web dashboard
- Command execution timeout (3 seconds)
- Input validation for all parameters
- Configurable CORS origins
- Graceful error handling
- No shell injection vulnerabilities
- Pre-compiled regex patterns
- Efficient circular buffer for history
- Minimal memory allocations
- Concurrent-safe data access
- Optimized chart updates
Ensure ROCm is properly installed and rocm-smi is in your PATH:
which rocm-smiThe application needs permission to execute rocm-smi. Run with appropriate permissions or add your user to the video group:
sudo usermod -a -G video $USERCheck if your GPU is detected by ROCm:
rocm-smiContributions are welcome! Please ensure:
- Code follows Go best practices
- Tests are included for new features
- Documentation is updated
- Security considerations are addressed
ROCm Monitor provides comprehensive observability capabilities for GPU systems through multiple interfaces:
- 📊 Real-time Dashboard - Interactive web interface with live charts
- 📈 Prometheus Metrics - Enterprise-grade metrics export for Grafana
- 🔍 ROCm Diagnostics - Comprehensive system testing and validation
- 📁 Data Export - Historical data export in CSV/JSON formats
- ⚡ REST API - Programmatic access to all monitoring data
- GPU Metrics: Temperature, power, utilization, clock frequencies
- Memory Metrics: VRAM usage, capacity, utilization percentages
- Thermal Metrics: Temperature thresholds and thermal management
- Performance Metrics: Clock speeds, performance states, boost behavior
- Host Metrics: CPU utilization, system resource usage
- Collection Health: Data collection performance and reliability
- Service Metrics: Monitor uptime, memory usage, error tracking
- Connectivity: GPU detection and communication health
- ROCm Stack: Runtime validation, driver status, platform detection
- Diagnostic Results: Test execution metrics, validation outcomes
- Historical Analysis: Trend analysis and capacity planning data
- Alert Conditions: Threshold monitoring and anomaly detection
# Enable comprehensive metrics
./rocm-monitor -metrics
# Access metrics endpoint
curl http://localhost:8080/metrics21+ Production Metrics:
- GPU hardware metrics with rich labels
- System health and performance indicators
- Collection reliability and error tracking
- Performance threshold monitoring
- Multi-GPU support with device identification
# Export historical data
curl http://localhost:8080/api/export.csv > gpu_data.csvIncludes:
- Timestamp-indexed data points
- All GPU metrics per collection interval
- System CPU utilization correlation
- Clock frequency tracking over time
# Export with metadata
curl http://localhost:8080/api/export.json > gpu_data.jsonStructured Format:
- Export metadata and statistics
- Complete historical dataset
- Collector performance metrics
- Data validation and integrity info
The enhanced /metrics endpoint provides 21+ metrics for complete observability:
GPU Hardware Metrics:
rocm_gpu_temperature_celsius- GPU edge temperature in Celsiusrocm_gpu_power_watts- GPU power consumption in wattsrocm_gpu_usage_percent- GPU compute utilization percentagerocm_gpu_vram_usage_gb/rocm_gpu_vram_total_gb- VRAM capacity metricsrocm_gpu_vram_utilization_percent- VRAM utilization percentagerocm_gpu_sclk_mhz/rocm_gpu_mclk_mhz- System and memory clock frequenciesrocm_gpu_fan_speed_percent- GPU fan speed percentage
System Metrics:
rocm_system_cpu_usage_percent- System CPU utilizationrocm_system_gpu_count- Number of detected GPUs
Monitoring Health Metrics:
rocm_monitor_collection_errors_total- Total collection errors (counter)rocm_monitor_collection_duration_ms- Collection time in millisecondsrocm_monitor_data_points_total- Total data points collected (counter)rocm_monitor_uptime_seconds- Monitor uptime in secondsrocm_monitor_memory_usage_mb- Monitor memory usage in MBrocm_monitor_history_size_points- Historical data points stored
Performance Thresholds:
rocm_gpu_temperature_warning_threshold- Temperature warning (>70°C)rocm_gpu_temperature_critical_threshold- Temperature critical (>80°C)rocm_gpu_vram_high_utilization- VRAM high usage alert (>80%)
ROCm Test Metrics (when tests are executed):
rocm_test_suite_success- Overall test suite success statusrocm_test_suite_duration_ms- Total test execution timerocm_test_suite_total_tests/rocm_test_suite_passed_tests- Test countsrocm_test_success- Individual test success statusrocm_test_duration_ms- Individual test execution timesrocm_test_issues_count- Number of issues detected per test
Rich Labels for Multi-GPU Support: All GPU metrics include comprehensive labels:
gpu_id- GPU identifier (0, 1, 2...)product_name- GPU model name (AMD Radeon Graphics)vendor- GPU vendor (AMD, NVIDIA)serial_number- Hardware serial numbervram_vendor- VRAM manufacturer
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'rocm-monitor'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15s
scrape_timeout: 10s
honor_labels: trueFor rapid deployment, use the included Docker Compose stack:
# Start ROCm Monitor with metrics
./rocm-monitor -metrics &
# Launch complete monitoring stack
cd container
./start.sh
# Access dashboards
# Grafana: http://localhost:3000 (admin/admin123)
# Prometheus: http://localhost:9090Included Components:
- Prometheus - Metrics collection with 7-day retention
- Grafana - Pre-configured ROCm GPU dashboard
- Alertmanager - Production-ready alert rules
- Custom Dashboard - Temperature, power, VRAM, performance metrics
See container/README.md for detailed setup and configuration options.
GPU Performance Overview:
# Temperature monitoring
rocm_gpu_temperature_celsius{gpu_id="0"}
# Power consumption trends
rocm_gpu_power_watts{gpu_id="0"}
# GPU utilization
rocm_gpu_usage_percent{gpu_id="0"}
# VRAM utilization
rocm_gpu_vram_utilization_percent{gpu_id="0"}
Multi-GPU Monitoring:
# All GPUs temperature
rocm_gpu_temperature_celsius{gpu_id=~".*"}
# Average GPU utilization across all devices
avg(rocm_gpu_usage_percent)
# Total system power consumption
sum(rocm_gpu_power_watts)
System Health Dashboard:
# Collection reliability
rate(rocm_monitor_collection_errors_total[5m])
# Monitor performance
rocm_monitor_collection_duration_ms
# System uptime
rocm_monitor_uptime_seconds
# Data collection rate
rate(rocm_monitor_data_points_total[5m])
Performance Analysis:
# Clock frequency trends
rocm_gpu_sclk_mhz{gpu_id="0"}
rocm_gpu_mclk_mhz{gpu_id="0"}
# Power efficiency (performance per watt)
rocm_gpu_usage_percent{gpu_id="0"} / rocm_gpu_power_watts{gpu_id="0"}
# Thermal efficiency (performance vs temperature)
rocm_gpu_usage_percent{gpu_id="0"} / rocm_gpu_temperature_celsius{gpu_id="0"}
groups:
- name: rocm_critical_alerts
rules:
- alert: GPUTemperatureCritical
expr: rocm_gpu_temperature_celsius > 85
for: 2m
labels:
severity: critical
component: hardware
annotations:
summary: "GPU {{ $labels.gpu_id }} temperature critically high"
description: "GPU temperature {{ $value }}°C exceeds safe operating limits"
- alert: VRAMExhaustion
expr: rocm_gpu_vram_utilization_percent > 95
for: 1m
labels:
severity: critical
component: memory
annotations:
summary: "GPU {{ $labels.gpu_id }} VRAM near exhaustion"
description: "VRAM utilization {{ $value }}% may cause out-of-memory errors"
- name: rocm_warning_alerts
rules:
- alert: GPUTemperatureWarning
expr: rocm_gpu_temperature_celsius > 75
for: 5m
labels:
severity: warning
component: hardware
annotations:
summary: "GPU {{ $labels.gpu_id }} running warm"
description: "GPU temperature {{ $value }}°C approaching thermal limits"
- alert: VRAMHighUtilization
expr: rocm_gpu_vram_utilization_percent > 80
for: 3m
labels:
severity: warning
component: memory
annotations:
summary: "GPU {{ $labels.gpu_id }} VRAM utilization high"
description: "VRAM utilization {{ $value }}% may impact performance"
- alert: ROCmCollectionErrors
expr: increase(rocm_monitor_collection_errors_total[10m]) > 5
for: 2m
labels:
severity: warning
component: monitoring
annotations:
summary: "ROCm data collection experiencing errors"
description: "{{ $value }} collection errors in the last 10 minutes"
- alert: MonitorPerformanceDegraded
expr: rocm_monitor_collection_duration_ms > 1000
for: 5m
labels:
severity: warning
component: monitoring
annotations:
summary: "ROCm monitor collection performance degraded"
description: "Collection duration {{ $value }}ms exceeds normal range"
- name: rocm_info_alerts
rules:
- alert: ROCmTestFailure
expr: rocm_test_suite_success == 0
for: 0s
labels:
severity: info
component: diagnostics
annotations:
summary: "ROCm diagnostic tests failed"
description: "ROCm system validation detected issues requiring attention"Real-time Monitoring:
- Use 15-second intervals for production workloads
- Monitor temperature and power consumption continuously
- Track VRAM utilization during compute-intensive tasks
- Set up immediate alerts for critical thresholds
Historical Analysis:
- Retain 24-48 hours of high-resolution data
- Export daily summaries for long-term trending
- Analyze performance patterns and capacity planning
- Correlate GPU metrics with application performance
Health Monitoring:
- Monitor collector reliability and performance
- Track ROCm stack health with periodic diagnostic tests
- Validate driver stability and runtime functionality
- Monitor for hardware degradation over time
Executive Dashboard:
- System overview with key health indicators
- Multi-GPU summary with aggregate metrics
- Alert status and system uptime tracking
- Performance trending and capacity utilization
Operations Dashboard:
- Detailed GPU metrics with drill-down capability
- Real-time troubleshooting and diagnostic tools
- Collection health and monitoring system status
- Historical data analysis and export capabilities
Engineering Dashboard:
- Clock frequency analysis and performance tuning
- Thermal management and power efficiency metrics
- VRAM allocation patterns and optimization opportunities
- ROCm stack validation and compatibility tracking
Alert Hierarchy:
- Critical - Immediate response required (temperature >85°C, VRAM >95%)
- Warning - Action needed soon (temperature >75°C, VRAM >80%)
- Info - Awareness alerts (test failures, performance changes)
Escalation Procedures:
- Critical alerts: Immediate notification and automated response
- Warning alerts: Team notification within 15 minutes
- Info alerts: Daily digest and trend analysis
Alert Fatigue Prevention:
- Use proper thresholds based on workload characteristics
- Implement alert suppression during maintenance windows
- Group related alerts to avoid notification spam
- Regular review and tuning of alert sensitivity
[Your License Here]
- ROCm team for the GPU driver and tools
- Chart.js for the visualization library
