UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#16827
In the HIP BUILD docs
-DGGML_HIP_ROCWMMA_FATTN=ONis recommended for improved FA performance for RDNA3+/CDNA and in broadpp512/tg128performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.
The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:
Llama 3.2 1B Q4_K_M
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
gpt-oss-20b F16/MXFP4
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...