Skip to content

Commit 0a71cae

Browse files
[rocprofiler-systems] Add README documentation and standalone build support for examples (#3987)
## Motivation - The rocprofiler-systems examples ecosystem was tightly coupled with project itself and wasn't at all documented. - This PR adds documentation to each of the examples and also improves the capabilities of building the examples as as standalone project, as well as building each example separately, ## Technical Details - Top-level: examples/README.md - Categorized table of all 22 examples (GPU Compute, Profiler API, CPU Threading, Distributed, OpenMP, GPU Libraries, HPC, Python) - Build instructions for the entire examples suite - Profiling modes table (system-level, binary rewrite, runtime instrument, sampling, causal) - Common environment variables reference - Per-example READMEs - each contains: - Overview: 3-5 sentences explaining what the example does and why it's useful for profiling - Source Files: Description of each source file and its role - Prerequisites: Required dependencies (HIP, MPI, Kokkos, etc.) - Building: Standalone and suite build commands - Running: CLI arguments and usage examples - Profiling with rocprofiler-systems: rocprof-sys-run commands and recommended ROCPROFSYS_* environment variables - This PR also adds: `examples/cmake/standalone-helpers.cmake` which provides CMake functions that enable building examples independently ## JIRA ID AIPROFSYST-237 ## Test Plan N/A ## Test Result N/A ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: David Galiffi <David.Galiffi@amd.com>
1 parent 1fe7989 commit 0a71cae

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+2505
-257
lines changed

projects/rocprofiler-systems/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,8 +310,8 @@ elseif("$ENV{ROCPROFSYS_CI}")
310310
endif()
311311

312312
if(ROCPROFSYS_INSTALL_TESTING)
313-
set(ROCPROFSYS_INSTALL_EXAMPLES ON CACHE BOOL "Enable installing examples" FORCE)
314313
set(ROCPROFSYS_BUILD_TESTING ON CACHE BOOL "Enable building the testing suite" FORCE)
314+
set(ROCPROFSYS_INSTALL_EXAMPLES ON CACHE BOOL "Enable installing examples" FORCE)
315315
endif()
316316

317317
if(ROCPROFSYS_INSTALL_EXAMPLES)

projects/rocprofiler-systems/examples/CMakeLists.txt

Lines changed: 37 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ if(CMAKE_PROJECT_NAME STREQUAL "rocprofiler-systems")
3131
rocprofiler_systems_add_option(ROCPROFSYS_INSTALL_EXAMPLES
3232
"Install rocprofiler-systems examples" OFF
3333
)
34+
elseif(CMAKE_PROJECT_NAME STREQUAL PROJECT_NAME)
35+
# Standalone build of the full examples suite
36+
include(${CMAKE_CURRENT_LIST_DIR}/cmake/standalone-helpers.cmake)
37+
option(ROCPROFSYS_INSTALL_EXAMPLES "Install rocprofiler-systems examples" ON)
3438
else()
3539
option(ROCPROFSYS_INSTALL_EXAMPLES "Install rocprofiler-systems examples" ON)
3640
endif()
@@ -72,25 +76,36 @@ set(ROCPROFSYS_EXAMPLE_ROOT_DIR ${CMAKE_CURRENT_LIST_DIR} CACHE INTERNAL "")
7276
# defines function for creating causal profiling exes
7377
include(${CMAKE_CURRENT_LIST_DIR}/causal-helpers.cmake)
7478

75-
add_subdirectory(transpose)
76-
add_subdirectory(parallel-overhead)
77-
add_subdirectory(code-coverage)
78-
add_subdirectory(user-api)
79-
add_subdirectory(openmp)
80-
add_subdirectory(mpi)
81-
add_subdirectory(python)
82-
add_subdirectory(lulesh)
83-
add_subdirectory(rccl)
84-
add_subdirectory(rewrite-caller)
85-
add_subdirectory(causal)
86-
add_subdirectory(trace-time-window)
87-
add_subdirectory(fork)
88-
add_subdirectory(videodecode)
89-
add_subdirectory(jpegdecode)
90-
add_subdirectory(roctx)
91-
add_subdirectory(thread-limit)
92-
add_subdirectory(transferBench)
93-
add_subdirectory(hpc)
94-
add_subdirectory(shmem)
95-
add_subdirectory(scratch-memory)
96-
add_subdirectory(sdma_test)
79+
# List of all example directories
80+
set(ROCPROFSYS_EXAMPLE_DIRS
81+
transpose
82+
parallel-overhead
83+
code-coverage
84+
user-api
85+
openmp
86+
mpi
87+
python
88+
lulesh
89+
rccl
90+
rewrite-caller
91+
causal
92+
trace-time-window
93+
fork
94+
videodecode
95+
jpegdecode
96+
roctx
97+
thread-limit
98+
transferBench
99+
hpc
100+
shmem
101+
scratch-memory
102+
sdma_test
103+
)
104+
105+
foreach(_example_dir IN LISTS ROCPROFSYS_EXAMPLE_DIRS)
106+
if(NOT "${_example_dir}" IN_LIST ROCPROFSYS_DISABLE_EXAMPLES)
107+
add_subdirectory(${_example_dir})
108+
else()
109+
message(STATUS "[rocprofiler-systems] Skipping example: ${_example_dir}")
110+
endif()
111+
endforeach()
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# rocprofiler-systems Examples
2+
3+
This directory contains example applications demonstrating various profiling scenarios with rocprofiler-systems. Each example targets a specific workload type - GPU compute, CPU threading, distributed computing, or library integration - and shows how to capture performance data using `rocprof-sys-run`.
4+
5+
## Examples by Category
6+
7+
### GPU Compute
8+
9+
| Example | Description | Dependencies |
10+
|---------|-------------|--------------|
11+
| [transpose](transpose/) | Tiled matrix transpose on GPU with multi-threaded stream execution | HIP |
12+
| [scratch-memory](scratch-memory/) | GPU scratch memory allocation stress test across primary and overflow slots | HIP, HSA |
13+
| [sdma_test](sdma_test/) | SDMA engine bandwidth benchmark for H2D, D2D, and D2H transfers | HIP |
14+
| [transferBench](transferBench/) | All-to-all transfer benchmark across CPU, GPU, SDMA, and NIC executors | HIP, HSA |
15+
16+
### Profiler API
17+
18+
| Example | Description | Dependencies |
19+
|---------|-------------|--------------|
20+
| [user-api](user-api/) | User API for named regions, annotations, and selective thread tracing | rocprofiler-systems user library |
21+
| [roctx](roctx/) | ROCTx range/marker API with thread naming, pause/resume, and device labeling | rocprofiler-sdk-roctx, HIP |
22+
| [causal](causal/) | Causal profiling with slow/fast parallel workloads and progress point tracking | rocprofiler-systems user library |
23+
| [rewrite-caller](rewrite-caller/) | Minimal call chain for binary rewrite instrumentation testing | None |
24+
| [trace-time-window](trace-time-window/) | Mixed CPU-bound and sleep workload for time-windowed trace analysis | None |
25+
26+
### CPU Threading
27+
28+
| Example | Description | Dependencies |
29+
|---------|-------------|--------------|
30+
| [thread-limit](thread-limit/) | Thread scaling stress test with batched Fibonacci workers | pthreads |
31+
| [parallel-overhead](parallel-overhead/) | Mutex vs. atomic synchronization overhead comparison | pthreads |
32+
| [code-coverage](code-coverage/) | Dual code-path execution for coverage analysis testing | pthreads |
33+
| [fork](fork/) | Multi-process forking from worker threads with child process profiling | pthreads |
34+
35+
### Distributed Computing
36+
37+
| Example | Description | Dependencies |
38+
|---------|-------------|--------------|
39+
| [mpi](mpi/) | MPI collective and point-to-point operations with communicator patterns | MPI |
40+
| [rccl](rccl/) | RCCL collective communication performance tests across GPUs | HIP, RCCL |
41+
| [shmem](shmem/) | OpenSHMEM hello world and ping-pong latency benchmark | OpenSHMEM (oshcc) |
42+
43+
### OpenMP
44+
45+
| Example | Description | Dependencies |
46+
|---------|-------------|--------------|
47+
| [openmp](openmp/) | NAS Parallel Benchmarks (CG, LU) with OpenMP threading | OpenMP |
48+
49+
### GPU Libraries
50+
51+
| Example | Description | Dependencies |
52+
|---------|-------------|--------------|
53+
| [jpegdecode](jpegdecode/) | Batch JPEG decoding performance benchmark using rocJPEG | HIP, rocJPEG |
54+
| [videodecode](videodecode/) | Batch video decoding benchmark using ROCDecode with VCN hardware | HIP, ROCDecode, FFmpeg |
55+
56+
### HPC
57+
58+
| Example | Description | Dependencies |
59+
|---------|-------------|--------------|
60+
| [lulesh](lulesh/) | LULESH shock hydrodynamics mini-app with Kokkos parallelism | Kokkos, optional MPI |
61+
| [hpc](hpc/) | Six HPC training examples covering Jacobi solvers, matrix exponentials, and stream overlap | HIP, rocBLAS, Fortran (varies) |
62+
63+
### Python
64+
65+
| Example | Description | Dependencies |
66+
|---------|-------------|--------------|
67+
| [python](python/) | Python profiling with decorators, user regions, and selective tracing | Python 3, optional NumPy |
68+
69+
## Building All Examples
70+
71+
- The examples are built as part of the `rocprofiler-systems` CMake project.
72+
- There is an option to build them also as a **standalone** applications or as a part of **examples suite**
73+
- The following commands will focus on a building a whole **examples suite**:
74+
75+
- From `examples` directory run:
76+
77+
```bash
78+
cmake -B <build_dir> \
79+
-DCMAKE_PREFIX_PATH=/opt/rocm \
80+
-DCMAKE_INSTALL_PREFIX=./install \
81+
.
82+
83+
cmake --build <build_dir> --parallel
84+
```
85+
86+
- Or from the repository root:
87+
88+
```bash
89+
cmake -B <build_dir> \
90+
-DCMAKE_PREFIX_PATH=/opt/rocm \
91+
projects/rocprofiler-systems/examples
92+
93+
cmake --build <build_dir> --parallel
94+
```
95+
96+
- Individual examples can be built by specifying the target:
97+
98+
```bash
99+
cmake --build <build_dir> --target <example_name>
100+
```
101+
102+
GPU examples require ROCm (`hipcc` or `amdclang++`) and detect available architectures automatically. To specify architectures manually:
103+
104+
```bash
105+
cmake -B <build_dir> -DROCPROFSYS_GFX_TARGETS="gfx90a;gfx942" ...
106+
```
107+
108+
## Profiling Modes
109+
110+
rocprofiler-systems supports several instrumentation modes:
111+
112+
| Mode | Command | Description |
113+
|------|---------|-------------|
114+
| System-level | `rocprof-sys-run -- ./app` | Lightweight tracing via `LD_PRELOAD`, no binary modification |
115+
| Binary rewrite | `rocprof-sys-instrument -o app.inst -- ./app` then `rocprof-sys-run -- ./app.inst` | Statically rewrite the binary for repeated profiling |
116+
| Runtime instrument | `rocprof-sys-instrument -- ./app` | Dynamically instrument at launch without modifying the binary |
117+
| Sampling | `rocprof-sys-sample -- ./app` | Statistical sampling of call stacks at configurable frequency |
118+
| Causal | `rocprof-sys-causal -- ./app` | Causal profiling to identify optimization opportunities |
119+
120+
### Common Environment Variables
121+
122+
| Variable | Description | Default |
123+
|----------|-------------|---------|
124+
| `ROCPROFSYS_TRACE` | Enable Perfetto trace output | `true` |
125+
| `ROCPROFSYS_PROFILE` | Enable call-stack profile output | `true` |
126+
| `ROCPROFSYS_USE_ROCPD` | Generate `rocpd` database output | `false` |
127+
| `ROCPROFSYS_USE_SAMPLING` | Enable statistical sampling | `false` |
128+
| `ROCPROFSYS_SAMPLING_FREQ` | Sampling frequency (interrupts/sec) | `50` |
129+
| `ROCPROFSYS_USE_PROCESS_SAMPLING` | Enable process-level resource sampling | `true` |
130+
| `ROCPROFSYS_OUTPUT_PATH` | Base directory for output files | `rocprofsys-output` |
131+
| `ROCPROFSYS_TIME_OUTPUT` | Timestamp output subdirectories | `true` |
132+
| `ROCPROFSYS_ROCM_DOMAINS` | ROCm API domains to trace | all |
133+
| `ROCPROFSYS_USE_MPIP` | Enable MPI profiling interposition | `false` |

projects/rocprofiler-systems/examples/causal/CMakeLists.txt

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,13 @@
1+
# Copyright (c) Advanced Micro Devices, Inc.
2+
# SPDX-License-Identifier: MIT
3+
14
cmake_minimum_required(VERSION 3.21 FATAL_ERROR)
25

36
project(rocprofiler-systems-causal-example LANGUAGES CXX)
47

5-
if(ROCPROFSYS_DISABLE_EXAMPLES)
6-
get_filename_component(_DIR ${CMAKE_CURRENT_LIST_DIR} NAME)
7-
8-
if(
9-
${PROJECT_NAME} IN_LIST ROCPROFSYS_DISABLE_EXAMPLES
10-
OR ${_DIR} IN_LIST ROCPROFSYS_DISABLE_EXAMPLES
11-
)
12-
return()
13-
endif()
8+
# Support standalone builds
9+
if(CMAKE_PROJECT_NAME STREQUAL PROJECT_NAME)
10+
include(${CMAKE_CURRENT_LIST_DIR}/../cmake/standalone-helpers.cmake OPTIONAL)
1411
endif()
1512

1613
set(CMAKE_BUILD_TYPE "Release")
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Causal Profiling
2+
3+
## Overview
4+
5+
This example implements a causal profiling workload that runs slow and fast functions in parallel threads to evaluate causal profiling accuracy. Two implementation strategies are provided: an RNG-based timing approach using the Mersenne Twister random number generator, and a CPU-bound timing approach using `CLOCK_THREAD_CPUTIME_ID`. The program measures the actual execution time ratio between threads and compares it against the expected ratio to validate that causal profiling correctly identifies optimization opportunities. Progress points (`CAUSAL_PROGRESS_NAMED`) mark iteration boundaries for the causal analysis.
6+
7+
## Source Files
8+
9+
- `causal.cpp` - Sets up the threading workload with synchronization barriers and measures execution ratios.
10+
- `impl.cpp` - Implements `rng_slow_func()`/`rng_fast_func()` (RNG-based timing) and `cpu_slow_func()`/`cpu_fast_func()` (CPU clock-based timing), with `CAUSAL_PROGRESS_NAMED` annotations.
11+
- `causal.hpp` - Declarations for slow/fast function variants.
12+
- `impl.hpp` - Implementation template declarations and timing utilities.
13+
14+
## Prerequisites
15+
16+
- CMake 3.21+
17+
- C++17 compiler
18+
- rocprofiler-systems user library (`rocprofiler-systems::rocprofiler-systems-user-library`)
19+
20+
## Building
21+
22+
**Standalone build:**
23+
24+
```bash
25+
cmake -B <build_dir> -S <project_root>/examples/casual -DCMAKE_PREFIX_PATH=/opt/rocm
26+
cmake --build <build_dir>
27+
```
28+
29+
**As part of the examples suite:**
30+
31+
```bash
32+
cmake -B <build_dir> -DCMAKE_PREFIX_PATH=/opt/rocm <project_root>/examples/
33+
cmake --build <build_dir> --target causal-both causal-rng causal-cpu
34+
```
35+
36+
The build generates multiple variants of each executable:
37+
38+
**Targets:**
39+
40+
| Target | Description |
41+
|--------|-------------|
42+
| `causal-both` | Both RNG and CPU workloads |
43+
| `causal-rng` | RNG-based workload only |
44+
| `causal-cpu` | CPU-based workload only |
45+
| `causal-*-rocprofsys` | Linked with rocprofiler-systems user library |
46+
| `causal-*-coz` | Linked with COZ profiler (if available) |
47+
48+
## Running
49+
50+
```bash
51+
# Default: 70% work ratio, 50 iterations
52+
./causal-both
53+
54+
# Custom: 80% ratio, 20 iterations, seed 12345, slow value 1000000000
55+
./causal-cpu 80 20 12345 1000000000
56+
```
57+
58+
**Arguments:**
59+
60+
| Position | Description | Default |
61+
|----------|-------------|---------|
62+
| 1 | Work fraction (percentage ratio fast/slow) | 70 |
63+
| 2 | Number of iterations | 50 |
64+
| 3 | Random seed | random |
65+
| 4 | Slow value (CPU cycles/work units) | 200000000 |
66+
| 5 | Sync points or fast value | 1 |
67+
68+
## Profiling with rocprofiler-systems
69+
70+
Causal profiling uses the dedicated `rocprof-sys-causal` command or the `--causal` flag:
71+
72+
```bash
73+
# Function-level causal profiling
74+
rocprof-sys-causal -n 2 -w 1 -d 3 -- ./causal-cpu-rocprofsys 70 10 432525 1000000000
75+
76+
# Line-level causal profiling
77+
rocprof-sys-causal --mode line -- ./causal-cpu-rocprofsys 70 10 432525 1000000000
78+
```
79+
80+
### Recommended Configuration
81+
82+
| Variable | Value | Purpose |
83+
|----------|-------|---------|
84+
| `ROCPROFSYS_CAUSAL_RANDOM_SEED` | `1342342` | Fixed seed for reproducible causal experiments |
85+
| `ROCPROFSYS_TIME_OUTPUT` | `OFF` | Disable timestamped output directories |
86+
| `ROCPROFSYS_FILE_OUTPUT` | `ON` | Enable file output |
87+
88+
**Causal CLI flags:**
89+
90+
| Flag | Description |
91+
|------|-------------|
92+
| `-n` | Number of causal experiment iterations |
93+
| `-w` | Number of warmup iterations |
94+
| `-d` | Virtual speedup delta (percentage steps) |
95+
| `-b timer` | Use timer-based progress tracking |
96+
| `-v 3` | Verbosity level |

0 commit comments

Comments
 (0)