-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Background
MCV currently uses the rocm-smi command-line tool via exec.Command to query AMD GPU information. While functional, this approach has several limitations:
- Subprocess overhead for each CLI invocation
- JSON parsing from command output
- Less type-safe than native Go bindings
- No compile-time validation of API usage
- Error handling relies on parsing CLI output
AMD now provides official Go bindings for the AMD SMI library that offer native Go integration with better performance and type safety.
Current Implementation
File: pkg/accelerator/devices/rocm.go
Currently uses two CLI commands:
// GPU information
exec.CommandContext(ctx, "rocm-smi", "--json", "--showproductname", "--showuniqueid", "--showserial", "--showmeminfo", "all")
// System/driver information
exec.CommandContext(ctx, "rocm-smi", "--json", "--showdriverversion")
Proposed Solution
Migrate to the AMD SMI Go library: https://github.com/ROCm/amdsmi
Package: github.com/ROCm/amdsmi
Documentation: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-go-api.html
AMD SMI Go Library Features
Initialization
import "github.com/ROCm/amdsmi"
// Initialize GPU library
goamdsmi.GO_gpu_init()
// Cleanup when done
defer goamdsmi.GO_gpu_shutdown()Key Functions Available:
- Device Enumeration:
- GO_gpu_num_monitor_devices() - Get number of GPUs
- Device Information:
- GO_gpu_dev_name_get(index) - GPU name
- GO_gpu_dev_id_get(index) - Device ID
- GO_gpu_dev_pci_id_get(index) - PCI ID
- GO_gpu_dev_vbios_version_get(index) - VBIOS version
- GO_gpu_dev_vendor_name_get(index) - Vendor name
- Memory Information:
- GO_gpu_dev_gpu_memory_total_get(index) - Total VRAM
- GO_gpu_dev_gpu_memory_usage_get(index) - Used VRAM
- GO_gpu_dev_gpu_memory_busy_percent_get(index) - Memory utilization
- Performance Monitoring:
- GO_gpu_dev_gpu_busy_percent_get(index) - GPU utilization
- GO_gpu_dev_temp_metric_get(index, sensor, metric) - Temperature
- Clock frequency queries
Implementation Plan
- Update dependencies (go.mod)
- Add: github.com/ROCm/amdsmi
- Minimum Go version: 1.20
- Update Dockerfile (images/amd64.dockerfile)
- Already installs amd-smi-lib rocm-smi-lib (line 71)
- Ensure library paths are set: /opt/rocm/lib:/opt/rocm/lib64
- Refactor pkg/accelerator/devices/rocm.go
- Replace initROCmLib() to use goamdsmi.GO_gpu_init()
- Update Init() method to query devices using Go API
- Replace JSON parsing with direct API calls
- Add proper shutdown handling
- Maintain backward compatibility with existing TritonGPUInfo structure
- Update build configuration
- Ensure CGO_ENABLED=1 (already set in Dockerfile)
- Set LD_LIBRARY_PATH for AMD SMI libraries
Benefits
- Performance: No subprocess overhead, direct library calls
- Type Safety: Compile-time checking of API usage
- Better Error Handling: Native Go errors instead of parsing CLI output
- Maintainability: Cleaner code without JSON marshaling/unmarshaling
- Consistency: Similar pattern to NVML usage in pkg/accelerator/devices/nvml.go
Example Migration
Before (CLI approach):
cmd := exec.CommandContext(ctx, "rocm-smi", "--json", "--showmeminfo", "all")
output, err := cmd.Output()
json.Unmarshal(output, &gpuInfo)
After (Go library):
goamdsmi.GO_gpu_init()
defer goamdsmi.GO_gpu_shutdown()
numGPUs := int(goamdsmi.GO_gpu_num_monitor_devices())
for i := 0; i < numGPUs; i++ {
name := goamdsmi.GO_gpu_dev_name_get(i)
memTotal := goamdsmi.GO_gpu_dev_gpu_memory_total_get(i)
uuid := goamdsmi.GO_gpu_dev_unique_id_get(i)
}
Files to Update
- go.mod - Add AMD SMI Go dependency
- pkg/accelerator/devices/rocm.go - Main implementation
- pkg/accelerator/devices/device.go - Update initialization if needed
- images/amd64.dockerfile - Verify library paths
- Unit tests for new implementation
- Documentation updates
References
- AMD SMI GitHub: https://github.com/ROCm/amdsmi
- Go API Documentation: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-go-api.html
- AMD SMI Exporter (example usage): https://github.com/amd/amd_smi_exporter
Testing
- Verify functionality on systems with AMD GPUs
- Test with multiple GPU configurations
- Ensure backward compatibility with existing cache detection
- Validate Triton cache preflight checks still work
- Test on ROCm 6.x and 7.x versions
Notes
- The library requires AMD GPU drivers to be installed
- Must maintain compatibility with existing device detection flow
- Consider keeping amd-smi CLI as fallback for environments where Go library isn't available
- The AMD type in pkg/accelerator/devices/amd.go already uses amd-smi CLI and should remain unchanged (different from rocm-smi)
This migration will modernize the ROCm GPU detection code and align it with industry best practices for native library integration.