This is an unofficial interface to the AMD ROCM SMI library for Golang applications. It is heavily
inspired by go-nvml by also using cgo, c-for-go and its dlopen wrapper.
This Golang interface is planned to be used in cc-metric-collector.
Disclaimer: These bindings are created without any collaboration with AMD. Use them as you like but we, the developers of these bindings, are not responsible for any damage or anything that was caused by them. If you want official Golang bindings for the ROCm SMI library, use this package.
package main
import (
"fmt"
"log"
"github.com/ClusterCockpit/go-rocm-smi/pkg/rocm_smi"
)
func main() {
ret := rocm_smi.Init()
if ret != rocm_smi.STATUS_SUCCESS {
log.Fatalf("Unable to initialize ROCM SMI: %v", rocm_smi.StatusStringNoError(ret))
}
defer func() {
ret := rocm_smi.Shutdown()
if ret != rocm_smi.STATUS_SUCCESS {
log.Fatalf("Unable to shutdown ROCM SMI: %v", rocm_smi.StatusStringNoError(ret))
}
}()
count, ret := rocm_smi.NumMonitorDevices()
if ret != rocm_smi.STATUS_SUCCESS {
log.Fatalf("Unable to get device count: %v", rocm_smi.StatusStringNoError(ret))
}
for i := 0; i < count; i++ {
device, ret := rocm_smi.DeviceGetHandleByIndex(i)
if ret != rocm_smi.STATUS_SUCCESS {
log.Fatalf("Unable to get device at index %d: %v", i, rocm_smi.StatusStringNoError(ret))
}
uuid, ret := device.GetUniqueId()
if ret != rocm_smi.STATUS_SUCCESS {
log.Fatalf("Unable to get uuid of device at index %d: %v", i, rocm_smi.StatusStringNoError(ret))
}
fmt.Printf("%v\n", uuid)
}
}The librocm_smi64.so is dynamically loaded by the rocm_smi package. Make sure that the directory containing this library is in your LD_LIBRARY_PATH.
See pkg.go.dev.
There are three ROCm SMI Headers, all located at rocm_smi/rocm_smi
rocm_smi.hrocm_smi64Config.hkfd_ioctl.h
The files are copied from ROCm 5.1.0. For the generation, the rocm_smi.h header is changed to support c-for- go's parser.
- All occurences of
uint64_tare changed tounsigned long long, otherwisec-for-gowouldn't use Golang'suint64type. - All occurences of
int64_tare changed tolong long, otherwisec-for-gowouldn't use Golang'sint64type. - The
union idis renamed tounion id_renameto avoid problems with clang. The type is never addressed with the nameidbut atypedefname.
Calling c-for-go with the rocm_smi.yml as input
After the generation, the types.go file still contains the C types but it is more suitable to have
Golang types for them. Luckly cgo has a bootstrapping option -godefs to
generate the Go types.
Before:
type RSMI_pcie_bandwidth C.rsmi_pcie_bandwidth_tAfter:
type RSMI_pcie_bandwidth struct {
Rate RSMI_frequencies
Lanes [32]uint32
}In the end, the generated functions are wrapped to have more Golang style. This is similar to the
wrappers created in go-nvml. Most of them are straight-forward
with a little bit of casting.
// rocm_smi.DeviceGetSerial()
func DeviceGetSerial(Device DeviceHandle) (string, RSMI_status) {
var Serial []byte = make([]byte, 100)
sptr := &Serial[0]
ret := rsmi_dev_serial_number_get(Device.index, sptr, 100)
return bytes2String(Serial), ret
}
func (Device DeviceHandle) DeviceGetSerial() (string, RSMI_status) {
return DeviceGetSerial(Device)
}For most libraries which handle multiple devices (go-nvml is an example), the user at first requests a handle for each device, mostly through the logical index in the list of available devices. The official rocm_smi library uses the logical index instead but in order to get everything right, you have to do quite some work to know what is supported. The rocm_smi provides a feature (APISupport in rocm_smi.h) to determine which functions are supported for a device and if a function accepts arguments, which ones are valid for this device. An example would be the function to get the firmware version and the list of GPU parts that provide such a version. The go-rocm-smi bindings introduce a virtual type DeviceHandle, retrivable through the logical index (so similar to go-nvml), which encapsulates the APISupport lookup: DeviceGetHandleByIndex(). The DeviceHandle is used for all device related calls in go-rocm-smi. You can get the logical index by deviceHandle.Index(), the not unique ID of a GPU by deviceHandle.ID() and the list of supported functions through deviceHandle.Supported()
-
One big problem is currently, that
c-for-godoes not generateuint64types for the C typeuint64_t. It is one of the main data type used in the ROCm SMI headers. While I was able to generate underlying code foruint64_t, the Golang function still usesuint32:rsmi_status_t rsmi_dev_unique_id_get(uint32_t dv_ind, uint64_t *id);
Output:
func rsmi_dev_unique_id_get(Dv_ind uint32, Id *uint32) RSMI_status { cDv_ind, cDv_indAllocMap := (C.uint32_t)(Dv_ind), cgoAllocsUnknown cId, cIdAllocMap := (*C.uint64_t)(unsafe.Pointer(Id)), cgoAllocsUnknown __ret := C.rsmi_dev_unique_id_get(cDv_ind, cId) runtime.KeepAlive(cIdAllocMap) runtime.KeepAlive(cDv_indAllocMap) __v := (RSMI_status)(__ret) return __v }
One can see, that the
cIdis casted to*C.uint64_t, but theIdvariable used by the function is*uint32. I was not able to persuadec-for-goto useuint64. See also xlab/c-for-go#120. As a workaround,uint64_tgets replaced byunsigned long longandint64_tgets replaced bylong long, seeMakefile. Interestingly, the translation of the C types to Golang types withcgogeneratesuint64without the type exchange in the header. If we wouldn't useunsigned long long, theuint32generated byc-for-gowould clash with theuint64generated bycgo. -
The symbol
rsmi_dev_sku_getis defined by therocm_smi.hheader but on the test system with ROCm 5.1.0, the symbol lookup fails. There is now anupdateFunctionPointers()function that is called atInit(). This is quite similar the functionupdateVersionedSymbols()ingo-nvml. TheAPISupportfeature of therocm_smilibrary shows,rsmi_dev_sku_getis supported by the device. -
The function
rsmi_status_stringcannot use the wrapper generated byc-for-gobecause it requires a pointer to achararray whilec-for-gowants to use thechararray directly. There is a manually created version to get the status stringStatusString(). One issue is when using it in prints (see example) becausersmi_status_stringaccepts a status and returns a new status and the string. To drop the new status, useStatusStringNoError(). -
I havn't found a way to access the
Buildfield inRSMI_version. It is achar*inrocm_smibutc-for-gogenerates an*int8entry for it.