Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Oct 29, 2025

Mirrored from ggml-org/llama.cpp#16827

In the HIP BUILD docs -DGGML_HIP_ROCWMMA_FATTN=ON is recommended for improved FA performance for RDNA3+/CDNA and in broad pp512/tg128 performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.

I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.

  • Dramatically improve long context WMMA prefill improvements on RDNA3: increased HIP occupancy and reduced LDS footprint via adaptive KQ stride; pp speedups without touching CUDA or the deprecated Volta WMMA path.
  • Fix long‑context decode regression on rocWMMA builds: decode now uses HIP’s tuned VEC/TILE selection instead of WMMA, aligning performance with the HIP baseline.
  • Remove HIP‑side TILE pruning in WMMA builds: matches HIP‑only behavior and avoids device traps, binary growth for all tiles was neglible, ~+4 MiB to the build
  • Add a decode‑time (HIP+rocWMMA only) safety guard: if a predicted TILE split has no config, fall back to VEC. This guard is not present in HIP‑only builds but seemed like a good idea to and avoid crashes on unusual dims.
  • Changes are gated to ROCWMMA/HIP only; no impact to CUDA or the legacy Volta WMMA path.

The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

Prefill (pp)

model size params test HIP WMMA Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4703.28 4884.42 3.85%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4076.03 4204.81 3.16%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2936.89 2959.54 0.77%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1350.48 1265.62 -6.28%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 424.76 360.24 -15.19%

Decode (tg)

model size params test HIP WMMA Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 195.65 193.01 -1.35%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 188.79 182.6 -3.28%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 173.36 143.51 -17.22%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 126.86 87.53 -31.01%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 64.62 27.35 -57.68%

My rocWMMA vs HIP

Prefill (pp)

model size params test HIP lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4703.28 4970.14 5.67%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4076.03 4575.18 12.25%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2936.89 3788.92 29.01%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1350.48 2064.78 52.89%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 424.76 706.46 66.32%

Decode (tg)

model size params test HIP lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 195.65 195.59 -0.03%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 188.79 188.84 0.03%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 173.36 173.28 -0.05%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 126.86 127.01 0.12%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 64.62 64.55 -0.10%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model size params test default-rocwmma lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 4884.42 4970.14 1.75%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d1024 4204.81 4575.18 8.81%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d4096 2959.54 3788.92 28.02%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d16384 1265.62 2064.78 63.14%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B pp512 @ d65536 360.24 706.46 96.11%

Decode (tg)

model size params test default-rocwmma lhl-tune-tile Δ%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 193.01 195.59 1.34%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d1024 182.6 188.84 3.42%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d4096 143.51 173.28 20.74%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d16384 87.53 127.01 45.11%
llama 1B Q4_K - Medium 762.81 MiB 1.24 B tg128 @ d65536 27.35 64.55 136.06%

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

Prefill (pp)

model size params test HIP WMMA Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1472.01 1513.79 2.84%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1387.58 1417.45 2.15%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1175.72 1205.37 2.52%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 713.9 669.77 -6.18%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 277.58 227.24 -18.14%

Decode (tg)

model size params test HIP WMMA Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 49.92 50.23 0.61%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 49.27 48.65 -1.26%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 48.15 45.11 -6.32%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 44.38 32.91 -25.85%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 34.76 14.63 -57.92%

My rocWMMA vs HIP

Prefill (pp)

model size params test HIP lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1472.01 1495.97 1.63%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1387.58 1456.15 4.94%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1175.72 1347.75 14.63%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 713.9 962.98 34.89%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 277.58 426.81 53.76%

Decode (tg)

model size params test HIP lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 49.92 49.9 -0.04%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 49.27 49.21 -0.11%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 48.15 48.05 -0.20%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 44.38 44.34 -0.11%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 34.76 34.77 0.03%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model size params test default-rocwmma lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 1513.79 1495.97 -1.18%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d1024 1417.45 1456.15 2.73%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d4096 1205.37 1347.75 11.81%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d16384 669.77 962.98 43.78%
gpt-oss 20B F16 13141.28 MiB 20.91 B pp512 @ d65536 227.24 426.81 87.83%

Decode (tg)

model size params test default-rocwmma lhl-tune-tile Δ%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 50.23 49.9 -0.64%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d1024 48.65 49.21 1.16%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d4096 45.11 48.05 6.53%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d16384 32.91 44.34 34.72%
gpt-oss 20B F16 13141.28 MiB 20.91 B tg128 @ d65536 14.63 34.77 137.71%

I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...

lhl added 2 commits October 28, 2025 17:33
…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3
…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints
@loci-agentic-ai-dev
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: ROCm/HIP rocWMMA Optimization (PR #12)

Key Findings

Performance Impact Analysis

  • Minimal Core Degradation: The worst Response Time degradation (0.066%) occurs in _Vector_impl_data@plt, a PLT stub function unrelated to the actual code changes
  • Non-Critical Function Impact: Performance degradations affect STL template instantiation and dynamic linking overhead, not llama.cpp's core inference functions (llama_encode/decode, attention mechanisms, or model loading)
  • Power Consumption: Negligible change across all binaries (-0.0% for libllama.so), indicating optimization changes don't impact overall energy efficiency

Flame Graph & CFG Analysis

  • PLT Overhead Isolation: Flame graph reveals the degraded function represents only dynamic symbol resolution (7 ns total), not actual computation
  • Identical Assembly Code: CFG comparison shows byte-for-byte identical assembly between versions, confirming performance difference stems from binary layout changes rather than code generation
  • Infrastructure-Level Impact: The 0.066% degradation likely results from symbol table reorganization or GOT section alignment changes

Code Review Critical Insights

  • Significant Performance Gains: The PR delivers substantial improvements for ROCm/HIP users (up to 136% decode speedup at long contexts, 96% prefill improvement)
  • Platform-Specific Optimizations:
    • Adaptive KQ stride reduces memory footprint for small head dimensions (D ≤ 128)
    • Increased GPU occupancy (2x blocks per SM) optimizes RDNA3+ utilization
    • Decode path now uses tuned VEC/TILE kernels instead of underperforming WMMA
  • Architecture Isolation: Changes are properly gated to HIP/ROCm builds, preserving CUDA behavior

Risk Assessment

  • Code Complexity Increase: Added 53 lines of kernel selection prediction logic that duplicates existing selection algorithms
  • Maintenance Risk: Configuration prediction must stay synchronized with actual kernel selection logic
  • Binary Size Growth: ~4 MiB increase for HIP builds due to disabled kernel pruning (acceptable trade-off for stability)

Overall Assessment

Impact Evaluation

The changes represent high-value, low-risk optimization for the llama.cpp codebase:

  • Core Function Preservation: No impact on critical inference paths (attention mechanisms, model loading, tokenization)
  • Targeted Improvements: Addresses specific ROCm/HIP performance bottlenecks without affecting CUDA or CPU backends
  • Measurable Benefits: Delivers substantial performance gains (20-136% in decode scenarios) for affected hardware

Maintainability Considerations

  • Platform Fragmentation: Introduces HIP-specific code paths that require specialized testing and validation
  • Technical Debt: Kernel selection prediction logic creates maintenance overhead and potential synchronization issues
  • Documentation Need: Complex optimization logic requires comprehensive documentation for future maintainers

Future Performance Outlook

  • Scalability: Adaptive stride and occupancy optimizations position the codebase well for future RDNA architectures
  • Extensibility: The gated optimization pattern provides a template for future hardware-specific enhancements
  • Monitoring Requirements: Performance regression testing should include long-context decode scenarios across GPU architectures

Recommendation

Approve with monitoring: The PR delivers significant performance improvements for ROCm users while maintaining stability for other platforms. The minimal core degradation (0.066% PLT overhead) is acceptable given the substantial gains achieved. Implement performance monitoring for decode scenarios and consider refactoring kernel selection logic in future iterations to reduce maintenance complexity.

Priority: The changes address a critical performance regression in ROCm builds and should be merged to restore competitive performance for AMD GPU users in the llama.cpp ecosystem.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16827-branch_lhl-rocm-wmma-tune branch October 30, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants