On the python side, we have GPUUtils: https://pypi.org/project/gpuutils/
On the C/C++ side, we have NVML's API: https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html
This may be challenging as we'll need to detect number of GPUs per node (hwloc doesn't provide this easily) and obtain the relevant per-device utilization number.