Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 69 additions & 5 deletions docs/src/tutorials/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

## rocprof

[rocprofv2](https://github.com/ROCm/rocprofiler?tab=readme-ov-file#rocprofiler-v2)
allows profiling both HSA & HIP API calls (rocprof being deprecated).
[rocprof](https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler)
allows profiling HSA & HIP API calls, kernel launches, and more...
Multiple major versions are available: `rocprof`, `rocprofv2` and `rocprofv3`.

Let's profile simple copying kernel saved in `profile.jl` file:
Let's profile a simple copying kernel saved in a `profile.jl` file:
```julia
using AMDGPU

Expand Down Expand Up @@ -38,13 +39,52 @@ main(2^24)

### Profiling problematic code

As mentioned above, there are different `rocprof` versions and see which one works the best.
On older ROCm versions, the newer `rocprofv2` and `rocprofv3` may not work so well.

!!! note
While AMDGPU.jl uses the HIP API, only `--hsa-trace` seems to capture CPU API calls,
of the lower-level HSA API, while `--hip-trace` has no effect.

This applies to ROCm 6.2.4 at least, and was tested on AMDGPU 2.1.3.
If with other versions HIP API calls can be captured then please amend this documentation.

#### rocprof
```bash
rocprof --hsa-trace --roctx-trace julia ./profile.jl
```

This enables HSA and ROC-TX (see below) tracing.
Memory copies and kernel launches are reported as well.

This will produce an `output.json` file which can be visualized
using [Perfetto UI](https://ui.perfetto.dev/).

#### rocprofv2
```bash
rocprofv2 --plugin perfetto --hsa-trace --roctx-trace --kernel-trace -o prof julia ./profile.jl
```

In principle this should enable various types of tracing,
but note that on ROCm 6.2.4 only kernel launches seem to be reported.

This will produce a `prof_output.pftrace` file which can be visualized
using [Perfetto UI](https://ui.perfetto.dev/).

#### rocprofv3
```bash
ENABLE_JITPROFILING=1 rocprofv2 --plugin perfetto --hip-trace --hsa-trace --kernel-trace -o prof julia ./profile.jl
rocprofv3 --output-format pftrace --hsa-trace --marker-trace --kernel-trace --memory-copy-trace -- julia ./profile.jl
```

This will produce `prof_output.pftrace` file which can be visualized
This will produce a number of `prof_output.pftrace` files which can be visualized
using [Perfetto UI](https://ui.perfetto.dev/).

`rocprofv3` is now recommended by AMD, however on ROCm 6.2.4 nothing seems to be reported.

#### Visualization of the results
Here is an example of visualizing the `profile.jl` script above in Perfetto.
Use `W`/`S` to zoom in/out and `A`/`D` to move left/right in the timeline.

![image](../assets/profile_1.png)

Here we can clearly see that host synchronization after each kernel dispatch
Expand All @@ -69,6 +109,30 @@ wall duration is lower.

![image](../assets/profile_2.png)

### Marking regions
When launching lots of kernels, it can be difficult to understand
the trace in terms of high-level program behavior.
In that case, the ROC-TX API can be used to mark regions that will be visible in the traces.

Here is an example of calling the API directly:
```julia
function rangePush(message)
@ccall "libroctx64".roctxRangePushA(message::Ptr{Cchar})::Cint
end

function rangePop()
@ccall "libroctx64".roctxRangePop()::Cint
end

rangePush("Section name")
# Launch some kernels, call some functions, etc...
rangePop()
```

While [ROCTX.jl](https://github.com/JuliaGPU/ROCTX.jl) aims to offer a Julia wrapper around it,
it does not seem to be working yet. PRs welcome!
(Note: the `ccall`s above do _not_ require ROCTX.jl to be loaded!)

## Debugging

Use `HIP_LAUNCH_BLOCKING=1` to synchronize immediately after launching GPU kernels.
Expand Down