Skip to content

Commit b215000

Browse files
[rocprof-compute] Generalize Roofline
* per kernel analysis Roofline * added per-kernel eval_metric calculation with display * fixed typo * updated tty.py show_all() * formatting * fixed ctest failures and updated equations * formatting * updated metric descriptoins * review tweaks * update docs * added roofline gui analysis * updated GUI docs * updated print statement * comment tweaks and ran ruff formatting [rocm-systems] ROCm/rocm-systems#325 (commit 5840940)
1 parent 6dd2b1d commit b215000

26 files changed

+2612
-158
lines changed

docs/how-to/analyze/cli.rst

Lines changed: 66 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
1919
* :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
2020
GPU ID, or dispatch ID via post-process filtering.
2121

22+
* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic
23+
intensity and performance analysis for individual kernels.
24+
2225
Run ``rocprof-compute analyze -h`` for more details.
2326

2427
.. _cli-walkthrough:
@@ -32,7 +35,7 @@ There are three high-level GPU analysis views:
3235

3336
* System Speed-of-Light: Key GPU performance metrics to show overall GPU performance and utilization.
3437
* Memory chart: Shows memory transactions and throughput on each cache hierarchical level.
35-
* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM).
38+
* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). When combined with kernel filtering, provides detailed per-kernel arithmetic intensity analysis and performance breakdowns.
3639

3740
**System Speed-of-Light:**
3841

@@ -67,7 +70,7 @@ There are three high-level GPU analysis views:
6770
.. note::
6871
* Visualized memory chart and Roofline chart are only supported in single run analysis. In multiple runs comparison mode, both are switched back to basic table view.
6972
* Visualized memory chart requires the width of the terminal output to be greater than or equal to 234 to display the whole chart properly.
70-
* Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect.
73+
* Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. Roofline analysis provides detailed, structured table output with measured empirical peak values for comparison.
7174

7275
.. _cli-list-metrics:
7376

@@ -309,6 +312,67 @@ Filter kernels
309312
You should see your filtered kernels indicated by an asterisk in the **Top
310313
Stats** table.
311314

315+
.. _per-kernel-roofline:
316+
317+
Per-kernel roofline analysis
318+
When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
319+
320+
.. code-block:: shell-session
321+
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
322+
This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
323+
324+
.. code-block:: text
325+
================================================================================
326+
4. Roofline
327+
================================================================================
328+
(4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points
329+
--------------------------------------------------------------------------------
330+
Kernel 0: vecCopy(double*, double*, double*, int, int) (100.0%)
331+
|
332+
├─ 4.1 Roofline Rate Metrics:
333+
| ╒═════════════╤════════════════════╤═══════════════════╤═════════╤════════════════════╕
334+
| │ Metric_ID │ Metric │ Value │ Unit │ Peak (Empirical) │
335+
| ╞═════════════╪════════════════════╪═══════════════════╪═════════╪════════════════════╡
336+
| │ 4.1.0 │ VALU FLOPs │ │ Gflop/s │ 61286.40 │
337+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
338+
| │ 4.1.1 │ MFMA FLOPs (F64) │ │ Gflop/s │ 108544.33 │
339+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
340+
| │ 4.1.2 │ MFMA FLOPs (F32) │ │ Gflop/s │ 104531.42 │
341+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
342+
| │ 4.1.3 │ MFMA FLOPs (F16) │ │ Gflop/s │ 709169.38 │
343+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
344+
| │ 4.1.4 │ MFMA FLOPs (BF16) │ 0.0 │ Gflop/s │ 388161.09 │
345+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
346+
| │ 4.1.5 │ MFMA FLOPs (F8) │ 0.0 │ Gflop/s │ 1446089.60 │
347+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
348+
| │ 4.1.6 │ MFMA IOPs (Int8) │ │ Giop/s │ 737317.94 │
349+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
350+
| │ 4.1.7 │ HBM Bandwidth │ │ Gb/s │ 3231.95 │
351+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
352+
| │ 4.1.8 │ L2 Cache Bandwidth │ │ Gb/s │ 19096.81 │
353+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
354+
| │ 4.1.9 │ L1 Cache Bandwidth │ 3880.358726762844 │ Gb/s │ 25006.24 │
355+
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
356+
| │ 4.1.10 │ LDS Bandwidth │ │ Gb/s │ 54920.88 │
357+
| ╘═════════════╧════════════════════╧═══════════════════╧═════════╧════════════════════╛
358+
├─ 4.2 Roofline AI Plot Points:
359+
| ╒═════════════╤══════════════════════╤═════════╤════════════╕
360+
| │ Metric_ID │ Metric │ Value │ Unit │
361+
| ╞═════════════╪══════════════════════╪═════════╪════════════╡
362+
| │ 4.2.0 │ AI HBM │ │ Flops/byte │
363+
| ├─────────────┼──────────────────────┼─────────┼────────────┤
364+
| │ 4.2.1 │ AI L2 │ │ Flops/byte │
365+
| ├─────────────┼──────────────────────┼─────────┼────────────┤
366+
| │ 4.2.2 │ AI L1 │ │ Flops/byte │
367+
| ├─────────────┼──────────────────────┼─────────┼────────────┤
368+
| │ 4.2.3 │ Performance (GFLOPs) │ │ Gflop/s │
369+
| ╘═════════════╧══════════════════════╧═════════╧════════════╛
370+
The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
371+
372+
Analyze multiple kernels for comparison:
373+
374+
.. code-block:: shell-session
375+
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4
312376
313377
Baseline comparison
314378
.. code-block:: shell

docs/how-to/analyze/standalone-gui.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ application's profiling data:
8383
#. Top Stats (Top Kernel Statistics)
8484
#. System Info
8585
#. System Speed-of-Light
86+
#. Roofline AI Data Metrics
8687

8788
To dive deeper, use the dropdown menus at the top of the screen to isolate
8889
particular kernels or dispatches. You should see the web page update with

src/argparser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ def validate_block(value):
307307
"\t\t\t For stochastic sampling, the interval is in cycles.\n"
308308
"\t\t\t For host_trap sampling, the interval is in microsecond "
309309
"(DEFAULT: 1048576)."
310-
)
310+
),
311311
)
312312

313313
profile_group.add_argument(

src/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,6 @@
3232
HIDDEN_COLUMNS = ["coll_level"]
3333
HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
3434
HIDDEN_COLUMNS_TUI = ["Description", "coll_level"]
35-
HIDDEN_SECTIONS = [400, 1900, 2000]
35+
HIDDEN_SECTIONS = [1900, 2000]
3636

3737
TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}

src/rocprof_compute_analyze/analysis_base.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,16 @@
3030
from collections import OrderedDict
3131
from pathlib import Path
3232

33+
import pandas as pd
34+
3335
from utils import file_io, parser, schema
34-
from utils.logger import console_debug, console_error, console_log, demarcate
36+
from utils.logger import (
37+
console_debug,
38+
console_error,
39+
console_log,
40+
console_warning,
41+
demarcate,
42+
)
3543
from utils.utils import is_workload_empty, merge_counters_spatial_multiplex
3644

3745

@@ -189,6 +197,21 @@ def initalize_runs(self, normalization_filter=None):
189197
else file_io.find_1st_sub_dir(d[0])
190198
)
191199
w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv"))
200+
201+
if not getattr(self.get_args(), "no_roof", False):
202+
try:
203+
roofline_path = sysinfo_path.joinpath("roofline.csv")
204+
roofline_df = pd.read_csv(roofline_path)
205+
206+
# use original column names from roofline.csv directly
207+
w.roofline_peaks = roofline_df
208+
209+
except FileNotFoundError:
210+
console_warning("roofline.csv not found.")
211+
w.roofline_peaks = pd.DataFrame()
212+
else:
213+
w.roofline_peaks = pd.DataFrame()
214+
192215
arch = w.sys_info.iloc[0]["gpu_arch"]
193216
mspec = self.get_socs()[arch]._mspec
194217
if self.__args.specs_correction:

src/rocprof_compute_analyze/analysis_cli.py

Lines changed: 31 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,9 @@ def pre_processing(self):
4040
if self.get_args().random_port:
4141
console_error("--gui flag is required to enable --random-port")
4242
for d in self.get_args().path:
43+
workload = self._runs[d[0]]
4344
# create 'mega dataframe'
44-
self._runs[d[0]].raw_pmc = file_io.create_df_pmc(
45+
workload.raw_pmc = file_io.create_df_pmc(
4546
d[0],
4647
self.get_args().nodes,
4748
self.get_args().spatial_multiplexing,
@@ -51,29 +52,27 @@ def pre_processing(self):
5152
)
5253

5354
if self.get_args().spatial_multiplexing:
54-
self._runs[d[0]].raw_pmc = self.spatial_multiplex_merge_counters(
55-
self._runs[d[0]].raw_pmc
55+
workload.raw_pmc = self.spatial_multiplex_merge_counters(
56+
workload.raw_pmc
5657
)
5758

5859
file_io.create_df_kernel_top_stats(
59-
df_in=self._runs[d[0]].raw_pmc,
60+
df_in=workload.raw_pmc,
6061
raw_data_dir=d[0],
61-
filter_gpu_ids=self._runs[d[0]].filter_gpu_ids,
62-
filter_dispatch_ids=self._runs[d[0]].filter_dispatch_ids,
63-
filter_nodes=self._runs[d[0]].filter_nodes,
62+
filter_gpu_ids=workload.filter_gpu_ids,
63+
filter_dispatch_ids=workload.filter_dispatch_ids,
64+
filter_nodes=workload.filter_nodes,
6465
time_unit=self.get_args().time_unit,
6566
max_stat_num=self.get_args().max_stat_num,
6667
kernel_verbose=self.get_args().kernel_verbose,
6768
)
6869

6970
# demangle and overwrite original 'Kernel_Name'
70-
kernel_name_shortener(
71-
self._runs[d[0]].raw_pmc, self.get_args().kernel_verbose
72-
)
71+
kernel_name_shortener(workload.raw_pmc, self.get_args().kernel_verbose)
7372

7473
# create the loaded table
7574
parser.load_table_data(
76-
workload=self._runs[d[0]],
75+
workload=workload,
7776
dir=d[0],
7877
is_gui=False,
7978
args=self.get_args(),
@@ -85,42 +84,41 @@ def run_analysis(self):
8584
"""Run CLI analysis."""
8685
super().run_analysis()
8786

87+
workload_path = self.get_args().path[0][0]
88+
workload = self._runs[workload_path]
89+
gpu_arch = workload.sys_info.iloc[0]["gpu_arch"]
90+
arch_config = self._arch_configs[gpu_arch]
91+
8892
if self.get_args().list_stats:
8993
tty.show_kernel_stats(
9094
self.get_args(),
9195
self._runs,
92-
self._arch_configs[
93-
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
94-
],
96+
arch_config,
9597
self._output,
9698
)
9799
else:
98100
roof_plot = None
99101
# 1. check if not baseline && compatible soc:
100-
if (len(self.get_args().path)) == 1 and self._runs[
101-
self.get_args().path[0][0]
102-
].sys_info.iloc[0]["gpu_arch"] in [
103-
"gfx90a",
104-
"gfx940",
105-
"gfx941",
106-
"gfx942",
107-
"gfx950",
108-
]:
109-
# add roofline plot to cli output
110-
roof_obj = self.get_socs()[
111-
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
112-
].roofline_obj
102+
if (len(self.get_args().path)) == 1:
103+
if gpu_arch in ["gfx90a", "gfx940", "gfx941", "gfx942", "gfx950"]:
104+
roof_obj = self.get_socs()[gpu_arch].roofline_obj
105+
106+
if roof_obj:
107+
# store path in workload for calc_ai_analyze
108+
workload.path = workload_path
113109

114-
if roof_obj:
115-
# NOTE: using default data type
116-
roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0])
110+
# NOTE: using default data type
111+
roof_plot = roof_obj.cli_generate_plot(
112+
dtype=roof_obj.get_dtype()[0],
113+
workload=workload,
114+
config=self._profiling_config,
115+
arch_config=arch_config,
116+
)
117117

118118
tty.show_all(
119119
self.get_args(),
120120
self._runs,
121-
self._arch_configs[
122-
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
123-
],
121+
arch_config,
124122
self._output,
125123
self._profiling_config,
126124
roof_plot=roof_plot,

src/rocprof_compute_analyze/analysis_webui.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ def __init__(self, args, supported_archs):
4848
self.dest_dir = str(Path(args.path[0][0]).absolute().resolve())
4949
self.arch = None
5050

51-
self.__hidden_sections = ["Memory Chart", "Roofline"]
51+
self.__hidden_sections = ["Memory Chart"]
5252
self.__hidden_columns = HIDDEN_COLUMNS
5353
# define different types of bar charts
5454
self.__barchart_elements = {
@@ -151,7 +151,7 @@ def generate_from_filter(
151151
# Only display basic metrics if no filters are applied
152152
if not (disp_filt or kernel_filter or gcd_filter):
153153
temp = {}
154-
keep = [1, 2, 101, 201, 301, 401]
154+
keep = [1, 2, 101, 201, 301, 401, 402]
155155
for key in base_data[base_run].dfs:
156156
if keep.count(key) != 0:
157157
temp[key] = base_data[base_run].dfs[key]
@@ -219,7 +219,6 @@ def generate_from_filter(
219219
.lower()
220220
)
221221
html_section = []
222-
223222
if panel["title"] not in self.__hidden_sections:
224223
# Iterate over each table per section
225224
for data_source in panel["data source"]:

src/rocprof_compute_base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ def list_sets(self):
289289
if sets_info:
290290
first_set = next(iter(sets_info.keys()))
291291
print(f" rocprof-compute profile --set {first_set} # Profile this set")
292-
print(f" rocprof-compute profile --list-sets # Show this help")
292+
print(" rocprof-compute profile --list-sets # Show this help")
293293
print()
294294

295295
sys.exit(0)

0 commit comments

Comments
 (0)