Skip to content

Commit 8e2a597

Browse files
committed
[AMD] Added a script which computes TFLOP/s using perf. counters
1 parent 48455d1 commit 8e2a597

File tree

2 files changed

+391
-0
lines changed

2 files changed

+391
-0
lines changed
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
### Install Latest rocprofv3 from sources
2+
3+
Compile that latest rocprofv3 from sources (use `amd-staging` branch)
4+
5+
```bash
6+
cd ~
7+
mkdir -p ~/usr/rocprofv3
8+
INSTALL_DIR=$(realpath ~/usr/rocprofv3)
9+
git clone https://github.com/rocm/rocprofiler-sdk
10+
cd rocprofiler-sdk
11+
mkdir -p build && cd build
12+
cmake .. -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} -DROCPROFILER_BUILD_TESTS=ON -DROCPROFILER_BUILD_SAMPLES=ON
13+
make -j
14+
make install
15+
```
16+
17+
Set the corresponding env. variables
18+
19+
```bash
20+
$ cat ~/load.rocprofv3.sh
21+
#!/bin/bash
22+
23+
INSTALL_DIR=$(realpath ~/usr/rocprofv3)
24+
25+
export PATH=${INSTALL_DIR}/bin:${PATH}
26+
export LD_LIBRARY_PATH=${INSTALL_DIR}/lib:${LD_LIBRARY_PATH}
27+
export LIBRARY_PATH=${INSTALL_DIR}/lib:${LIBRARY_PATH}
28+
29+
$ source ~/load.rocprofv3.sh
30+
```
31+
32+
### Adjust Triton Source Code
33+
34+
The `flash-attention.py` kernel comes with auto-tuning. In this example, we want to measure performance of the best performing FA configuration. Run the kernel with the enabled auto-tuner.
35+
36+
37+
```bash
38+
$ TRITON_PRINT_AUTOTUNING=1 python3 ./flash-attention.py -b 2 -hq 16 -hk 16 -sq 8192 -sk 8192 -d 128 -causal -layout thd
39+
Autotuning kernel attn_fwd with config BLOCK_M: 128, BLOCK_N: 128, waves_per_eu: 2, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
40+
Autotuning kernel attn_fwd with config BLOCK_M: 128, BLOCK_N: 64, waves_per_eu: 2, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
41+
Autotuning kernel attn_fwd with config BLOCK_M: 128, BLOCK_N: 64, waves_per_eu: 3, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
42+
Autotuning kernel attn_fwd with config BLOCK_M: 128, BLOCK_N: 64, waves_per_eu: 1, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
43+
Autotuning kernel attn_fwd with config BLOCK_M: 128, BLOCK_N: 32, waves_per_eu: 2, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
44+
Triton autotuning for function attn_fwd finished after 15.06s; best config selected: BLOCK_M: 128, BLOCK_N: 64, waves_per_eu: 2, PRE_LOAD_V: False, GRID_CU_MULTIP: 2, instruction_sched_variant: none, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
45+
fused-attention-fwd-d128-layoutthd:
46+
BATCH HQ HK N_CTX_Q N_CTX_K triton torch
47+
0 2.0 16.0 16.0 8192.0 8192.0 221.869662 17.140226
48+
```
49+
50+
51+
Open the script and find the function which sets tuning parameters (i.e., `get_cdna_autotune_configs`). You can see that the function returns a list of suggested configs to the tuner. Comment everything except the winning config that we found in the previous step. For example,
52+
53+
54+
```python
55+
def get_cdna_autotune_configs():
56+
return [
57+
#triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'waves_per_eu': 2, 'PRE_LOAD_V': False, 'GRID_CU_MULTIP': 2},
58+
# num_stages=1, num_warps=4),
59+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'waves_per_eu': 2, 'PRE_LOAD_V': False, 'GRID_CU_MULTIP': 2},
60+
num_stages=1, num_warps=4),
61+
#triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'waves_per_eu': 3, 'PRE_LOAD_V': False, 'GRID_CU_MULTIP': 2},
62+
# num_stages=1, num_warps=4),
63+
#triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'waves_per_eu': 1, 'PRE_LOAD_V': False, 'GRID_CU_MULTIP': 2},
64+
# num_stages=1, num_warps=4),
65+
#triton.Config({'BLOCK_M': 128, 'BLOCK_N': 32, 'waves_per_eu': 2, 'PRE_LOAD_V': False, 'GRID_CU_MULTIP': 2},
66+
# num_stages=1, num_warps=4),
67+
], ['IS_CAUSAL', 'dropout_p', 'MAX_SEQLENS_Q', 'MAX_SEQLENS_K', 'ACTUAL_BLOCK_DMODEL', 'VARLEN', 'HQ', 'HK']
68+
69+
```
70+
71+
72+
### Collect Performance Data
73+
74+
Make a softlink to `./rocm-triton-prof.py` in the directory where you perform the test. For example,
75+
76+
```bash
77+
ln -s <rocm-triton-dir>/python/perf-kernels/tools/rocm-triton-prof/rocm-triton-prof.py rocm-triton-prof.py
78+
```
79+
80+
Run the tool as follows:
81+
82+
```bash
83+
$ python3 ./rocm-triton-prof.py --kernel attn_fwd --cmd python3 ./flash-attention.py -b 2 -hq 16 -hk 16 -sq 8192 -sk 8192 -d 128 -causal -layout thd
84+
Timing info in `nsec`:
85+
count 269.000000
86+
mean 326119.100372
87+
std 7120.765559
88+
min 304946.000000
89+
25% 322147.000000
90+
50% 327960.000000
91+
75% 331047.000000
92+
max 352857.000000
93+
dtype: float64
94+
95+
NON-FLOP related data:
96+
Counter Name Max Min Mean Median
97+
0 GRBM_COUNT 8955952.0 4043501.0 4.284156e+06 4261916.0
98+
1 TCC_HIT_sum 5347185.0 4074880.0 4.112117e+06 4107955.0
99+
2 TCC_MISS_sum 5932281.0 3526537.0 3.572396e+06 3556786.5
100+
101+
FLOP related data:
102+
Counter Name Raw Data FLOP Relative FLOP, %
103+
0 SQ_INSTS_VALU_ADD_F16 0.0 0.000000e+00 0.000000
104+
1 SQ_INSTS_VALU_MUL_F16 0.0 0.000000e+00 0.000000
105+
2 SQ_INSTS_VALU_FMA_F16 192512.0 2.464154e+07 0.030844
106+
3 SQ_INSTS_VALU_TRANS_F16 0.0 0.000000e+00 0.000000
107+
4 SQ_INSTS_VALU_ADD_F32 4898176.0 3.134833e+08 0.392393
108+
5 SQ_INSTS_VALU_MUL_F32 2411456.0 1.543332e+08 0.193182
109+
6 SQ_INSTS_VALU_FMA_F32 2486720.0 3.183002e+08 0.398422
110+
7 SQ_INSTS_VALU_TRANS_F32 2489728.0 1.593426e+08 0.199452
111+
8 SQ_INSTS_VALU_ADD_F64 0.0 0.000000e+00 0.000000
112+
9 SQ_INSTS_VALU_MUL_F64 0.0 0.000000e+00 0.000000
113+
10 SQ_INSTS_VALU_FMA_F64 0.0 0.000000e+00 0.000000
114+
11 SQ_INSTS_VALU_TRANS_F64 0.0 0.000000e+00 0.000000
115+
12 SQ_INSTS_VALU_MFMA_MOPS_F16 154140672.0 7.892002e+10 98.785706
116+
13 SQ_INSTS_VALU_MFMA_MOPS_BF16 0.0 0.000000e+00 0.000000
117+
14 SQ_INSTS_VALU_MFMA_MOPS_F32 0.0 0.000000e+00 0.000000
118+
15 SQ_INSTS_VALU_MFMA_MOPS_F64 0.0 0.000000e+00 0.000000
119+
120+
Performance info in TFLOP/s:
121+
count 269.000000
122+
mean 245.090089
123+
std 5.420713
124+
min 226.409352
125+
25% 241.325627
126+
50% 243.597161
127+
75% 247.992764
128+
max 261.981219
129+
dtype: float64
130+
```
131+
132+
### Known limits
133+
134+
The tool currently supports only FP64, FP32 and FP16 operations.
135+
Note, it can be extended to supoprt other data types.
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
#!/usr/bin/python3
2+
3+
import argparse
4+
import os
5+
import pandas as pd
6+
import yaml
7+
import subprocess
8+
import shutil
9+
import re
10+
from collections import OrderedDict
11+
12+
13+
def get_perf_metrics():
14+
pmc0 = OrderedDict()
15+
pmc0['SQ_INSTS_VALU_ADD_F16'] = {'value': 0, 'factor': 64, 'flop': 0}
16+
pmc0['SQ_INSTS_VALU_MUL_F16'] = {'value': 0, 'factor': 64, 'flop': 0}
17+
pmc0['SQ_INSTS_VALU_FMA_F16'] = {'value': 0, 'factor': 128, 'flop': 0}
18+
pmc0['SQ_INSTS_VALU_TRANS_F16'] = {'value': 0, 'factor': 64, 'flop': 0}
19+
20+
pmc1 = OrderedDict()
21+
pmc1['SQ_INSTS_VALU_ADD_F32'] = {'value': 0, 'factor': 64, 'flop': 0}
22+
pmc1['SQ_INSTS_VALU_MUL_F32'] = {'value': 0, 'factor': 64, 'flop': 0}
23+
pmc1['SQ_INSTS_VALU_FMA_F32'] = {'value': 0, 'factor': 128, 'flop': 0}
24+
pmc1['SQ_INSTS_VALU_TRANS_F32'] = {'value': 0, 'factor': 64, 'flop': 0}
25+
26+
pmc2 = OrderedDict()
27+
pmc2['SQ_INSTS_VALU_ADD_F64'] = {'value': 0, 'factor': 64, 'flop': 0}
28+
pmc2['SQ_INSTS_VALU_MUL_F64'] = {'value': 0, 'factor': 64, 'flop': 0}
29+
pmc2['SQ_INSTS_VALU_FMA_F64'] = {'value': 0, 'factor': 128, 'flop': 0}
30+
pmc2['SQ_INSTS_VALU_TRANS_F64'] = {'value': 0, 'factor': 64, 'flop': 0}
31+
32+
pmc3 = OrderedDict()
33+
pmc3['SQ_INSTS_VALU_MFMA_MOPS_F16'] = {'value': 0, 'factor': 512, 'flop': 0}
34+
pmc3['SQ_INSTS_VALU_MFMA_MOPS_BF16'] = {'value': 0, 'factor': 512, 'flop': 0}
35+
pmc3['SQ_INSTS_VALU_MFMA_MOPS_F32'] = {'value': 0, 'factor': 512, 'flop': 0}
36+
pmc3['SQ_INSTS_VALU_MFMA_MOPS_F64'] = {'value': 0, 'factor': 512, 'flop': 0}
37+
38+
pmc4 = OrderedDict()
39+
pmc4['GRBM_COUNT'] = {'value': 0}
40+
pmc4['TCC_HIT_sum'] = {'value': 0}
41+
pmc4['TCC_MISS_sum'] = {'value': 0}
42+
43+
jobs = OrderedDict()
44+
jobs[0] = pmc0
45+
jobs[1] = pmc1
46+
jobs[2] = pmc2
47+
jobs[3] = pmc3
48+
jobs[4] = pmc4
49+
return jobs
50+
51+
52+
def get_metrics_as_yaml():
53+
perf_metrics = get_perf_metrics()
54+
pmcs = [
55+
{'pmc': list(perf_metrics[0].keys())},
56+
{'pmc': list(perf_metrics[1].keys())},
57+
{'pmc': list(perf_metrics[2].keys())},
58+
{'pmc': list(perf_metrics[3].keys())},
59+
{'pmc': list(perf_metrics[4].keys())},
60+
]
61+
62+
spec = {}
63+
spec['jobs'] = pmcs
64+
65+
spec_str = yaml.dump(spec)
66+
return spec_str
67+
68+
69+
def run_external_binary(binary_path, arguments=[], verbose=False):
70+
try:
71+
# Run the external binary and capture its standard output
72+
cmd = [binary_path] + arguments if binary_path else arguments
73+
if verbose:
74+
print(f"CURR.CMD: {' '.join(cmd)}")
75+
result = subprocess.run(cmd, capture_output=True, text=True)
76+
77+
# Check if the process was successful
78+
if result.returncode == 0:
79+
if result.stderr:
80+
print(result.stderr.strip())
81+
return result.stdout.strip()
82+
else:
83+
cmd = ' '.join(cmd)
84+
raise RuntimeError(f'Error: The external binary returned non-zero exit code {result.returncode}. '
85+
f'Attempted command:\n{cmd}')
86+
except FileNotFoundError:
87+
raise RuntimeError(f'Error: The binary could not be found - i.e., {binary_path}')
88+
except Exception as err:
89+
raise RuntimeError(f'Error: {str(err)}')
90+
91+
92+
def check_rocprofv3():
93+
run_external_binary('which', ['rocprofv3'])
94+
95+
96+
def find_file(rootdir, regex):
97+
for root, _, files in os.walk(rootdir):
98+
for file in files:
99+
if regex.match(file):
100+
return os.path.join(root, file)
101+
102+
103+
def filter(df, name):
104+
return df[df['Kernel_Name'] == name]
105+
106+
107+
def process_files(metrics_dir, timing_dir, kernel_name, verbose):
108+
timing_file = find_file(timing_dir, re.compile(r'.*kernel_trace.csv'))
109+
df = pd.read_csv(timing_file)
110+
df = filter(df, kernel_name)
111+
timing = df['End_Timestamp'] - df['Start_Timestamp']
112+
print('Timing info in `nsec`:')
113+
print(timing.describe())
114+
print()
115+
116+
# post process all passes
117+
num_flop_sum = 0
118+
perf_metrics = get_perf_metrics()
119+
num_passes = 5
120+
metrics_file_regex = re.compile(r'.*counter_collection.csv')
121+
for pass_id in range(1, num_passes + 1):
122+
search_dir = os.path.join(metrics_dir, f'pass_{pass_id}')
123+
metrics_file = find_file(search_dir, metrics_file_regex)
124+
df = pd.read_csv(metrics_file)
125+
df = filter(df, kernel_name)
126+
127+
curr_metrics = perf_metrics[pass_id - 1]
128+
curr_metrics_names = list(curr_metrics.keys())
129+
for name in curr_metrics_names:
130+
data = df[df['Counter_Name'] == name]
131+
value = data['Counter_Value'].mean()
132+
133+
if 'flop' in curr_metrics[name].keys():
134+
curr_metrics[name]['value'] = value
135+
num_flops = value * curr_metrics[name]['factor']
136+
137+
num_flop_sum += num_flops
138+
curr_metrics[name]['flop'] = num_flops
139+
else:
140+
curr_metrics[name]['value'] = value
141+
print()
142+
143+
# Print data from non-flop-passes
144+
print('NON-FLOP related data:')
145+
table = {'Counter Name': [], 'Max': [], 'Min': [], 'Mean': [], 'Median': []}
146+
non_flop_passes = [5]
147+
for pass_id in non_flop_passes:
148+
search_dir = os.path.join(metrics_dir, f'pass_{pass_id}')
149+
metrics_file = find_file(search_dir, metrics_file_regex)
150+
df = pd.read_csv(metrics_file)
151+
df = filter(df, kernel_name)
152+
153+
curr_metrics = perf_metrics[pass_id - 1]
154+
curr_metrics_names = list(curr_metrics.keys())
155+
for name in curr_metrics_names:
156+
data = df[df['Counter_Name'] == name]
157+
values = data['Counter_Value']
158+
table['Counter Name'].append(name)
159+
table['Max'].append(values.max())
160+
table['Min'].append(values.min())
161+
table['Mean'].append(values.mean())
162+
table['Median'].append(values.median())
163+
print(pd.DataFrame(table))
164+
print()
165+
166+
# Print data from flop-passes
167+
print('FLOP related data:')
168+
table = {'Counter Name': [], 'Raw Data': [], 'FLOP': [], 'Relative FLOP, %': []}
169+
flop_passes = [1, 2, 3, 4]
170+
for pass_id in flop_passes:
171+
search_dir = os.path.join(metrics_dir, f'pass_{pass_id}')
172+
metrics_file = find_file(search_dir, metrics_file_regex)
173+
df = pd.read_csv(metrics_file)
174+
df = filter(df, kernel_name)
175+
176+
curr_metrics = perf_metrics[pass_id - 1]
177+
curr_metrics_names = list(curr_metrics.keys())
178+
for name in curr_metrics_names:
179+
data = df[df['Counter_Name'] == name]
180+
value = data['Counter_Value'].mean()
181+
182+
num_flops = curr_metrics[name]['flop']
183+
relative_value = 100 * num_flops / num_flop_sum
184+
table['Counter Name'].append(name)
185+
table['Raw Data'].append(value)
186+
table['FLOP'].append(num_flops)
187+
table['Relative FLOP, %'].append(relative_value)
188+
print(pd.DataFrame(table))
189+
print()
190+
191+
print('Performance info in TFLOP/s:')
192+
performance = num_flop_sum / (timing * 1000)
193+
print(performance.describe())
194+
print()
195+
196+
197+
def main(args):
198+
check_rocprofv3()
199+
200+
curr_dir = os.path.dirname(os.path.abspath(__file__))
201+
metrics_spec = get_metrics_as_yaml()
202+
metrics_spec_path = os.path.join(curr_dir, "metrics_spec.yaml")
203+
with open(metrics_spec_path, 'w') as file:
204+
file.write(metrics_spec)
205+
206+
metrics_dir = os.path.join(curr_dir, "metrics_dir")
207+
timing_dir = os.path.join(curr_dir, "timing_dir")
208+
209+
if not args.display_only:
210+
# test original command
211+
if (args.verbose):
212+
print('running original program...')
213+
user_cmd = args.cmd
214+
output = run_external_binary([], user_cmd, args.verbose)
215+
216+
if (args.verbose):
217+
print(output)
218+
219+
# collect performance metrices
220+
if (args.verbose):
221+
print('running rocprofv3 passes...')
222+
223+
if os.path.exists(metrics_dir):
224+
shutil.rmtree(metrics_dir)
225+
226+
rocprof_cmd = ['rocprofv3', '-i', metrics_spec_path, '-d', metrics_dir, '--']
227+
output = run_external_binary([], rocprof_cmd + user_cmd, args.verbose)
228+
229+
if (args.verbose):
230+
print(output)
231+
232+
# collect timing
233+
if (args.verbose):
234+
print('running rocprofv3 for timing info...')
235+
236+
if os.path.exists(timing_dir):
237+
shutil.rmtree(timing_dir)
238+
239+
rocprof_cmd = ['rocprofv3', '--kernel-trace', '-d', timing_dir, '--']
240+
output = run_external_binary([], rocprof_cmd + user_cmd, args.verbose)
241+
242+
if (args.verbose):
243+
print(output)
244+
245+
process_files(metrics_dir, timing_dir, args.kernel, args.verbose)
246+
247+
248+
if __name__ == "__main__":
249+
parser = argparse.ArgumentParser()
250+
parser.add_argument("-k", "--kernel", type=str, required=True, help="name of a kernel")
251+
parser.add_argument('-c', '--cmd', required=True, nargs=argparse.REMAINDER, help='user command')
252+
parser.add_argument("--display-only", action='store_true', help='display info without running')
253+
parser.add_argument("-v", "--verbose", action='store_true', help='verbose output')
254+
args = parser.parse_args()
255+
256+
main(args)

0 commit comments

Comments
 (0)