Skip to content

Commit 5619704

Browse files
authored
GPUM: changes and additions in gpu metrics (#21148)
* added new gpu memory metrics * updated memory.usage and core.usage metrics names
1 parent f624b89 commit 5619704

File tree

2 files changed

+11
-2
lines changed

2 files changed

+11
-2
lines changed

gpu/CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# CHANGELOG - GPU
22

3+
## 0.4.1
4+
5+
***Added***:
6+
7+
* Added GPU device level memory metrics: `gpu.memory.free`, `gpu.memory.reserved`.
8+
* Renamed gpu.core.usage to gpu.process.core.usage for naming consistency.
9+
* Renamed gpu.memory.usage to gpu.process.memory.usage for naming consistency.
310

411
## 0.4.0
512

gpu/metadata.csv

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ gpu.clock.throttle_reasons.sw_power_cap,gauge,,,,GPU clocks that are throttled t
1515
gpu.clock.throttle_reasons.sw_thermal_slowdown,gauge,,,,GPU clocks that are throttled to avoid exceeding temperaturelimits,0,gpu,clock.throttle_reasons.sw_thermal_slowdown,,
1616
gpu.clock.throttle_reasons.sync_boost,gauge,,,,GPU clocks that are throttled to match clock speed of another GPU in the current sync boost group,0,gpu,clock.throttle_reasons.sync_boost,,
1717
gpu.core.limit,gauge,,core,,Number of GPU cores that the process/container/device has available,0,gpu,core.limit,,
18-
gpu.core.usage,gauge,,core,,Average number of GPU cores that a process was using in the interval. Only emitted when processes are active.,0,gpu,core.usage,,
1918
gpu.decoder_utilization,gauge,,percent,,Percentage of time the decoder was active,0,gpu,decoder_utilization,,
2019
gpu.device.total,gauge,,,,Number of devices active in the host,0,gpu,device.total,,
2120
gpu.dram_active,gauge,,percent,,Percentage of time the DRAM was active,0,gpu,dram_active,,
@@ -29,9 +28,10 @@ gpu.integer_active,gauge,,percent,,Percentage of the time that the integer calcu
2928
gpu.memory.bar1.free,gauge,,byte,,Unallocated BAR1 memory (in bytes),0,gpu,memory.bar1.free,,
3029
gpu.memory.bar1.total,gauge,,byte,,Total BAR1 memory (in bytes).,0,gpu,memory.bar1.total,,
3130
gpu.memory.bar1.used,gauge,,byte,,Allocated used memory (in bytes),0,gpu,memory.bar1.used,,
31+
gpu.memory.free,gauge,,byte,,Unallocated device memory (in bytes).,0,gpu,memory.free,,
3232
gpu.memory.limit,gauge,,byte,,The maximum amount of memory a process/container/device could allocate,0,gpu,memory.limit,,
33+
gpu.memory.reserved,gauge,,byte,,Device memory (in bytes) reserved for system use (driver or firmware)..,0,gpu,memory.reserved,,
3334
gpu.memory.temperature,gauge,,degree celsius,,Temperature of the memory chip,0,gpu,memory.temperature,,
34-
gpu.memory.usage,gauge,,byte,,The memory used by this process at the point the metric was given. Only emitted when processes are active.,0,gpu,memory.usage,,
3535
gpu.nvlink.count.active,gauge,,,,Number of active nvlinks for the device,0,gpu,,,
3636
gpu.nvlink.count.inactive,gauge,,,,Number of inactive nvlinks for the device,0,gpu,,,
3737
gpu.nvlink.count.total,gauge,,,,Number of total nvlinks for the device,0,gpu,,,
@@ -52,9 +52,11 @@ gpu.pci.throughput.tx,gauge,,byte,second,Bytes transmitted through PCI to the GP
5252
gpu.performance_state,gauge,,,,Returns the current performance state of the device,0,gpu,performance_state,,
5353
gpu.power.management_limit,gauge,,milliwatt,,Upper boundary for the device power draw.,0,gpu,power.management_limit,,
5454
gpu.power.usage,gauge,,milliwatt,,"Power usage for the GPU device. On GA100 and older architectures this is the instantaneous power at that moment, in newer ones it represents the average power draw over one second",0,gpu,power.usage,,
55+
gpu.process.core.usage,gauge,,core,,Average number of GPU cores that a process was using in the interval. Only emitted when processes are active.,0,gpu,process.core.usage,,
5556
gpu.process.decoder_utilization,gauge,,percent,,Percentage of time the decoder was active for a specific process,0,gpu,process.decoder_utilization,,
5657
gpu.process.dram_active,gauge,,percent,,Percentage of time the DRAM was active for a specific process,0,gpu,process.dram_active,,
5758
gpu.process.encoder_utilization,gauge,,percent,,Percentage of time the encoder was active for a specific process,0,gpu,process.encoder_utilization,,
59+
gpu.process.memory.usage,gauge,,byte,,The memory used by this process at the point the metric was given. Only emitted when processes are active.,0,gpu,process.memory.usage,,
5860
gpu.process.sm_active,gauge,,percent,,Percentage of time the streaming multiprocessor was active for a specific process,0,gpu,process.sm_active,,
5961
gpu.remapped_rows.correctable,count,,,,Number of rows remapped due to correctable errors,0,gpu,remapped_rows.correctable,,
6062
gpu.remapped_rows.failed,count,,,,Number of rows that failed remapping,0,gpu,remapped_rows.failed,,

0 commit comments

Comments
 (0)