1+ # TUI use only
2+ # NOTE: This is used as a TUI-only yaml file for the beta release of the new performance metric organizationPanel Config:
3+ id : 3300
4+ title : Compute Throughput
5+ metrics_description :
6+ VALU FLOPs : ' The total floating-point operations executed per second on the VALU.
7+ This is also presented as a percent of the peak theoretical FLOPs achievable
8+ on the specific accelerator. Note: this does not include any floating-point
9+ operations from MFMA instructions.'
10+ VALU IOPs : ' The total integer operations executed per second on the VALU. This
11+ is also presented as a percent of the peak theoretical IOPs achievable on the
12+ specific accelerator. Note: this does not include any integer operations from
13+ MFMA instructions.'
14+ MFMA FLOPs (F8) : The total number of 8-bit brain floating point MFMA operations
15+ executed per second. This does not include any 16-bit brain floating point operations
16+ from VALU instructions. This is also presented as a percent of the peak theoretical
17+ F8 MFMA operations achievable on the specific accelerator. It is supported on
18+ AMD Instinct MI300 series and later only.
19+ MFMA FLOPs (BF16) : ' The total number of 16-bit brain floating point MFMA operations
20+ executed per second. Note: this does not include any 16-bit brain floating point
21+ operations from VALU instructions. This is also presented as a percent of the
22+ peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
23+ MFMA FLOPs (F16) : ' The total number of 16-bit floating point MFMA operations executed
24+ per second. Note: this does not include any 16-bit floating point operations
25+ from VALU instructions. This is also presented as a percent of the peak theoretical
26+ F16 MFMA operations achievable on the specific accelerator.'
27+ MFMA FLOPs (F32) : ' The total number of 32-bit floating point MFMA operations executed
28+ per second. Note: this does not include any 32-bit floating point operations
29+ from VALU instructions. This is also presented as a percent of the peak theoretical
30+ F32 MFMA operations achievable on the specific accelerator.'
31+ MFMA FLOPs (F64) : ' The total number of 64-bit floating point MFMA operations executed
32+ per second. Note: this does not include any 64-bit floating point operations
33+ from VALU instructions. This is also presented as a percent of the peak theoretical
34+ F64 MFMA operations achievable on the specific accelerator.'
35+ MFMA IOPs (Int8) : ' The total number of 8-bit integer MFMA operations executed
36+ per second. Note: this does not include any 8-bit integer operations from VALU
37+ instructions. This is also presented as a percent of the peak theoretical INT8
38+ MFMA operations achievable on the specific accelerator.'
39+ SALU Utilization : Indicates what percent of the kernel's duration the SALU was
40+ busy executing instructions. Computed as the ratio of the total number of cycles
41+ spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
42+ VALU Utilization : Indicates what percent of the kernel's duration the VALU was
43+ busy executing instructions. Does not include VMEM operations. Computed as the
44+ ratio of the total number of cycles spent by the scheduler issuing VALU instructions
45+ over the total CU cycles.
46+ MFMA Utilization : Indicates what percent of the kernel's duration the MFMA unit
47+ was busy executing instructions. Computed as the ratio of the total number of
48+ cycles the MFMA was busy over the total CU cycles.
49+ VMEM Utilization : Indicates what percent of the kernel's duration the VMEM unit
50+ was busy executing instructions, including both global/generic and spill/scratch
51+ operations (see the VMEM instruction count metrics) for more detail). Does not
52+ include VALU operations. Computed as the ratio of the total number of cycles
53+ spent by the scheduler issuing VMEM instructions over the total CU cycles.
54+ Branch Utilization : Indicates what percent of the kernel's duration the branch
55+ unit was busy executing instructions. Computed as the ratio of the total number
56+ of cycles spent by the scheduler issuing branch instructions over the total
57+ CU cycles
58+ IPC : The ratio of the total number of instructions executed on the CU over the
59+ total active CU cycles. This is also presented as a percent of the peak theoretical
60+ bandwidth achievable on the specific accelerator.
61+ data source :
62+ - metric_table :
63+ id : 3301
64+ title : Compute Throughput
65+ header :
66+ metric : Metric
67+ value : Avg
68+ unit : Unit
69+ peak : Peak
70+ pop : Pct of Peak
71+ metric :
72+ VALU FLOPs :
73+ value : AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
74+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
75+ + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
76+ + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
77+ + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
78+ unit : GFLOP/s
79+ peak : (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
80+ pop : ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
81+ + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
82+ + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
83+ + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
84+ + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
85+ / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
86+ VALU IOPs :
87+ value : AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
88+ - Start_Timestamp)))
89+ unit : GIOP/s
90+ peak : (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
91+ pop : ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
92+ - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
93+ MFMA FLOPs (F8) :
94+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))
95+ unit : GFLOP/s
96+ peak : ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
97+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp -
98+ Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
99+ MFMA FLOPs (BF16) :
100+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
101+ unit : GFLOP/s
102+ peak : ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
103+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
104+ - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
105+ MFMA FLOPs (F16) :
106+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
107+ unit : GFLOP/s
108+ peak : ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
109+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
110+ Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
111+ MFMA FLOPs (F32) :
112+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
113+ unit : GFLOP/s
114+ peak : ((($max_sclk * $cu_per_gpu) * 256) / 1000)
115+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
116+ Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
117+ MFMA FLOPs (F64) :
118+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
119+ unit : GFLOP/s
120+ peak : ((($max_sclk * $cu_per_gpu) * 256) / 1000)
121+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
122+ Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
123+ MFMA IOPs (Int8) :
124+ value : AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
125+ unit : GIOP/s
126+ peak : ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
127+ pop : ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
128+ Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
129+ SALU Utilization :
130+ value : AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
131+ unit : pct
132+ peak : 100
133+ pop : AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
134+ VALU Utilization :
135+ value : AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
136+ unit : pct
137+ peak : 100
138+ pop : AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
139+ MFMA Utilization :
140+ value : AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
141+ * $cu_per_gpu) * 4)))
142+ unit : pct
143+ peak : 100
144+ pop : AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
145+ * $cu_per_gpu) * 4)))
146+ VMEM Utilization :
147+ value : AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
148+ / $cu_per_gpu))
149+ unit : pct
150+ peak : 100
151+ pop : AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
152+ / $cu_per_gpu))
153+ Branch Utilization :
154+ value : AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
155+ unit : pct
156+ peak : 100
157+ pop : AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
158+ IPC :
159+ value : AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
160+ unit : Instr/cycle
161+ peak : 5
162+ pop : ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
0 commit comments