ROCm
diff --git a/‎docs/how-to/analyze/cli.rst‎
Lines changed: 66 additions & 2 deletions b/‎docs/how-to/analyze/cli.rst‎
Lines changed: 66 additions & 2 deletions
diff --git a/‎docs/how-to/analyze/standalone-gui.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/how-to/analyze/standalone-gui.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/argparser.py‎
Lines changed: 1 addition & 1 deletion b/‎src/argparser.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/config.py‎
Lines changed: 1 addition & 1 deletion b/‎src/config.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/rocprof_compute_analyze/analysis_base.py‎
Lines changed: 24 additions & 1 deletion b/‎src/rocprof_compute_analyze/analysis_base.py‎
Lines changed: 24 additions & 1 deletion
diff --git a/‎src/rocprof_compute_analyze/analysis_cli.py‎
Lines changed: 31 additions & 33 deletions b/‎src/rocprof_compute_analyze/analysis_cli.py‎
Lines changed: 31 additions & 33 deletions
diff --git a/‎src/rocprof_compute_analyze/analysis_webui.py‎
Lines changed: 2 additions & 3 deletions b/‎src/rocprof_compute_analyze/analysis_webui.py‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎src/rocprof_compute_base.py‎
Lines changed: 1 addition & 1 deletion b/‎src/rocprof_compute_base.py‎
Lines changed: 1 addition & 1 deletion
@@ -19,6 +19,9 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
 * :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
   GPU ID, or dispatch ID via post-process filtering.
 
+* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic 
+   intensity and performance analysis for individual kernels.
+
 Run ``rocprof-compute analyze -h`` for more details.
 
 .. _cli-walkthrough:
@@ -32,7 +35,7 @@ There are three high-level GPU analysis views:
 
 * System Speed-of-Light: Key GPU performance metrics to show overall GPU performance and utilization.
 * Memory chart: Shows memory transactions and throughput on each cache hierarchical level.
-* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM).
+* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). When combined with kernel filtering, provides detailed per-kernel arithmetic intensity analysis and performance breakdowns.
 
 **System Speed-of-Light:**
 
@@ -67,7 +70,7 @@ There are three high-level GPU analysis views:
 .. note::
    * Visualized memory chart and Roofline chart are only supported in single run analysis. In multiple runs comparison mode, both are switched back to basic table view.
    * Visualized memory chart requires the width of the terminal output to be greater than or equal to 234 to display the whole chart properly.
-   * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect.
+   * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. Roofline analysis provides detailed, structured table output with measured empirical peak values for comparison.
 
 .. _cli-list-metrics:
 
@@ -309,6 +312,67 @@ Filter kernels
   You should see your filtered kernels indicated by an asterisk in the **Top
   Stats** table.
 
+.. _per-kernel-roofline:
+
+Per-kernel roofline analysis
+  When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
+
+  .. code-block:: shell-session
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
+  This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
+
+  .. code-block:: text
+   ================================================================================
+   4. Roofline
+   ================================================================================
+   (4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points
+   --------------------------------------------------------------------------------
+   Kernel 0: vecCopy(double*, double*, double*, int, int) (100.0%)
+      |
+      ├─ 4.1 Roofline Rate Metrics:
+      |   ╒═════════════╤════════════════════╤═══════════════════╤═════════╤════════════════════╕
+      |   │ Metric_ID   │ Metric             │ Value             │ Unit    │   Peak (Empirical) │
+      |   ╞═════════════╪════════════════════╪═══════════════════╪═════════╪════════════════════╡
+      |   │ 4.1.0       │ VALU FLOPs         │                   │ Gflop/s │           61286.40 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.1       │ MFMA FLOPs (F64)   │                   │ Gflop/s │          108544.33 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.2       │ MFMA FLOPs (F32)   │                   │ Gflop/s │          104531.42 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.3       │ MFMA FLOPs (F16)   │                   │ Gflop/s │          709169.38 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.4       │ MFMA FLOPs (BF16)  │ 0.0               │ Gflop/s │          388161.09 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.5       │ MFMA FLOPs (F8)    │ 0.0               │ Gflop/s │         1446089.60 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.6       │ MFMA IOPs (Int8)   │                   │ Giop/s  │          737317.94 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.7       │ HBM Bandwidth      │                   │ Gb/s    │            3231.95 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.8       │ L2 Cache Bandwidth │                   │ Gb/s    │           19096.81 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.9       │ L1 Cache Bandwidth │ 3880.358726762844 │ Gb/s    │           25006.24 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.10      │ LDS Bandwidth      │                   │ Gb/s    │           54920.88 │
+      |   ╘═════════════╧════════════════════╧═══════════════════╧═════════╧════════════════════╛
+      ├─ 4.2 Roofline AI Plot Points:
+      |   ╒═════════════╤══════════════════════╤═════════╤════════════╕
+      |   │ Metric_ID   │ Metric               │ Value   │ Unit       │
+      |   ╞═════════════╪══════════════════════╪═════════╪════════════╡
+      |   │ 4.2.0       │ AI HBM               │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.1       │ AI L2                │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.2       │ AI L1                │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.3       │ Performance (GFLOPs) │         │ Gflop/s    │
+      |   ╘═════════════╧══════════════════════╧═════════╧════════════╛
+  The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
+
+  Analyze multiple kernels for comparison:
+
+  .. code-block:: shell-session
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4
 
 Baseline comparison
   .. code-block:: shell
 
@@ -83,6 +83,7 @@ application's profiling data:
 #. Top Stats (Top Kernel Statistics)
 #. System Info
 #. System Speed-of-Light
+#. Roofline AI Data Metrics
 
 To dive deeper, use the dropdown menus at the top of the screen to isolate
 particular kernels or dispatches. You should see the web page update with
 
@@ -307,7 +307,7 @@ def validate_block(value):
             "\t\t\t  For stochastic sampling, the interval is in cycles.\n"
             "\t\t\t  For host_trap sampling, the interval is in microsecond "
             "(DEFAULT: 1048576)."
-        )
+        ),
     )
 
     profile_group.add_argument(
 
@@ -32,6 +32,6 @@
 HIDDEN_COLUMNS = ["coll_level"]
 HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
 HIDDEN_COLUMNS_TUI = ["Description", "coll_level"]
-HIDDEN_SECTIONS = [400, 1900, 2000]
+HIDDEN_SECTIONS = [1900, 2000]
 
 TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}
@@ -30,8 +30,16 @@
 from collections import OrderedDict
 from pathlib import Path
 
+import pandas as pd
+
 from utils import file_io, parser, schema
-from utils.logger import console_debug, console_error, console_log, demarcate
+from utils.logger import (
+    console_debug,
+    console_error,
+    console_log,
+    console_warning,
+    demarcate,
+)
 from utils.utils import is_workload_empty, merge_counters_spatial_multiplex
 
 
@@ -189,6 +197,21 @@ def initalize_runs(self, normalization_filter=None):
                 else file_io.find_1st_sub_dir(d[0])
             )
             w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv"))
+
+            if not getattr(self.get_args(), "no_roof", False):
+                try:
+                    roofline_path = sysinfo_path.joinpath("roofline.csv")
+                    roofline_df = pd.read_csv(roofline_path)
+
+                    # use original column names from roofline.csv directly
+                    w.roofline_peaks = roofline_df
+
+                except FileNotFoundError:
+                    console_warning("roofline.csv not found.")
+                    w.roofline_peaks = pd.DataFrame()
+            else:
+                w.roofline_peaks = pd.DataFrame()
+
             arch = w.sys_info.iloc[0]["gpu_arch"]
             mspec = self.get_socs()[arch]._mspec
             if self.__args.specs_correction:
 
@@ -40,8 +40,9 @@ def pre_processing(self):
         if self.get_args().random_port:
             console_error("--gui flag is required to enable --random-port")
         for d in self.get_args().path:
+            workload = self._runs[d[0]]
             # create 'mega dataframe'
-            self._runs[d[0]].raw_pmc = file_io.create_df_pmc(
+            workload.raw_pmc = file_io.create_df_pmc(
                 d[0],
                 self.get_args().nodes,
                 self.get_args().spatial_multiplexing,
@@ -51,29 +52,27 @@ def pre_processing(self):
             )
 
             if self.get_args().spatial_multiplexing:
-                self._runs[d[0]].raw_pmc = self.spatial_multiplex_merge_counters(
-                    self._runs[d[0]].raw_pmc
+                workload.raw_pmc = self.spatial_multiplex_merge_counters(
+                    workload.raw_pmc
                 )
 
             file_io.create_df_kernel_top_stats(
-                df_in=self._runs[d[0]].raw_pmc,
+                df_in=workload.raw_pmc,
                 raw_data_dir=d[0],
-                filter_gpu_ids=self._runs[d[0]].filter_gpu_ids,
-                filter_dispatch_ids=self._runs[d[0]].filter_dispatch_ids,
-                filter_nodes=self._runs[d[0]].filter_nodes,
+                filter_gpu_ids=workload.filter_gpu_ids,
+                filter_dispatch_ids=workload.filter_dispatch_ids,
+                filter_nodes=workload.filter_nodes,
                 time_unit=self.get_args().time_unit,
                 max_stat_num=self.get_args().max_stat_num,
                 kernel_verbose=self.get_args().kernel_verbose,
             )
 
             # demangle and overwrite original 'Kernel_Name'
-            kernel_name_shortener(
-                self._runs[d[0]].raw_pmc, self.get_args().kernel_verbose
-            )
+            kernel_name_shortener(workload.raw_pmc, self.get_args().kernel_verbose)
 
             # create the loaded table
             parser.load_table_data(
-                workload=self._runs[d[0]],
+                workload=workload,
                 dir=d[0],
                 is_gui=False,
                 args=self.get_args(),
@@ -85,42 +84,41 @@ def run_analysis(self):
         """Run CLI analysis."""
         super().run_analysis()
 
+        workload_path = self.get_args().path[0][0]
+        workload = self._runs[workload_path]
+        gpu_arch = workload.sys_info.iloc[0]["gpu_arch"]
+        arch_config = self._arch_configs[gpu_arch]
+
         if self.get_args().list_stats:
             tty.show_kernel_stats(
                 self.get_args(),
                 self._runs,
-                self._arch_configs[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ],
+                arch_config,
                 self._output,
             )
         else:
             roof_plot = None
             # 1. check if not baseline && compatible soc:
-            if (len(self.get_args().path)) == 1 and self._runs[
-                self.get_args().path[0][0]
-            ].sys_info.iloc[0]["gpu_arch"] in [
-                "gfx90a",
-                "gfx940",
-                "gfx941",
-                "gfx942",
-                "gfx950",
-            ]:
-                # add roofline plot to cli output
-                roof_obj = self.get_socs()[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ].roofline_obj
+            if (len(self.get_args().path)) == 1:
+                if gpu_arch in ["gfx90a", "gfx940", "gfx941", "gfx942", "gfx950"]:
+                    roof_obj = self.get_socs()[gpu_arch].roofline_obj
+
+                    if roof_obj:
+                        # store path in workload for calc_ai_analyze
+                        workload.path = workload_path
 
-                if roof_obj:
-                    # NOTE: using default data type
-                    roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0])
+                        # NOTE: using default data type
+                        roof_plot = roof_obj.cli_generate_plot(
+                            dtype=roof_obj.get_dtype()[0],
+                            workload=workload,
+                            config=self._profiling_config,
+                            arch_config=arch_config,
+                        )
 
             tty.show_all(
                 self.get_args(),
                 self._runs,
-                self._arch_configs[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ],
+                arch_config,
                 self._output,
                 self._profiling_config,
                 roof_plot=roof_plot,
 
@@ -48,7 +48,7 @@ def __init__(self, args, supported_archs):
         self.dest_dir = str(Path(args.path[0][0]).absolute().resolve())
         self.arch = None
 
-        self.__hidden_sections = ["Memory Chart", "Roofline"]
+        self.__hidden_sections = ["Memory Chart"]
         self.__hidden_columns = HIDDEN_COLUMNS
         # define different types of bar charts
         self.__barchart_elements = {
@@ -151,7 +151,7 @@ def generate_from_filter(
             # Only display basic metrics if no filters are applied
             if not (disp_filt or kernel_filter or gcd_filter):
                 temp = {}
-                keep = [1, 2, 101, 201, 301, 401]
+                keep = [1, 2, 101, 201, 301, 401, 402]
                 for key in base_data[base_run].dfs:
                     if keep.count(key) != 0:
                         temp[key] = base_data[base_run].dfs[key]
@@ -219,7 +219,6 @@ def generate_from_filter(
                     .lower()
                 )
                 html_section = []
-
                 if panel["title"] not in self.__hidden_sections:
                     # Iterate over each table per section
                     for data_source in panel["data source"]:
 
@@ -289,7 +289,7 @@ def list_sets(self):
         if sets_info:
             first_set = next(iter(sets_info.keys()))
             print(f"  rocprof-compute profile --set {first_set}  # Profile this set")
-        print(f"  rocprof-compute profile --list-sets        # Show this help")
+        print("  rocprof-compute profile --list-sets        # Show this help")
         print()
 
         sys.exit(0)
Original file line number	Diff line number	Diff line change
`@@ -307,7 +307,7 @@ def validate_block(value):`
`307`	`307`	`"\t\t\t For stochastic sampling, the interval is in cycles.\n"`
`308`	`308`	`"\t\t\t For host_trap sampling, the interval is in microsecond "`
`309`	`309`	`"(DEFAULT: 1048576)."`
`310`		`- )`
	`310`	`+ ),`
`311`	`311`	`)`
`312`	`312`
`313`	`313`	`profile_group.add_argument(`