Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457

peizhang56 · 2025-10-07T07:27:19Z

Added an hipblaslt implementation for batched gemm to improve the benchmark performance
This acts an alternative way to fix the coredump when export ROCBLAS_USE_HIPBLASLT=1
The feature is disabled by default, and only applies to CDNA3
The feature should be removed once hipblasLt redesigned the grouped gemm for rocblas, which could take some time.
In order to use the feature:

Use export USE_HIPBLASLT_GROUPED_GEMM=1, to enable the feature
Use export USE_HIPBLASLT_GROUPED_GEMM=2, to run offline-bench
Use export USE_HIPBLASLT_GROUPED_GEMM=3, to run the best algo solution.

Add support for rocm6.4

peizhang56 · 2025-10-07T08:14:43Z

@slaren This is for AMD GPU. But the bot labeled it as Nvidia GPU

JohannesGaessler · 2025-10-07T12:07:03Z

Does this PR contain code which was copypasted from another project?

IMbackK · 2025-10-07T12:21:20Z

Without looking at this in detail i would lean against doing this.
This is a rather involved (in terms of loc) workaround around a bug in an external library that is not used by default.

I understand that its frustrating that llamacpp uses the kernels provided by tensile on CDNA3 which vastly underperform relative to what the hardware can do.

Using hipblaslt (directly or indirectly via ROCBLAS_USE_HIPBLASLT) is already a workaround around the problem that AMD has thus far failed to unify the tensile and tensilelt kernel pools, despite these being mostly equivalent and has also failed to even half way optimize the kernels provided by tensile for CDNA3+.
Then this pr applies a further workaround to avioid an external bug in the previous workaround.

I think this is a bridge to far and we should instead use our position to pressure AMD in cleaning up this frankly embarrassing mess. First by fixing the intimidate problem with the broken tensilelt solutions on CDNA3+ and then solving the solution pool problem, which affects every amd gpu as whether tensilelt or tensile provides better performance is essentially random depending on what gpu you look at, with the difference not being 10 or 20% but often breaching into the 1000+% range.

peizhang56 · 2025-10-08T17:39:42Z

Does this PR contain code which was copypasted from another project?

No, all the code was developed by me. Including the idea, naming.

peizhang56 · 2025-10-08T17:59:52Z

Without looking at this in detail i would lean against doing this. This is a rather involved (in terms of loc) workaround around a bug in an external library that is not used by default.

I understand that its frustrating that llamacpp uses the kernels provided by tensile on CDNA3 which vastly underperform relative to what the hardware can do.

Using hipblaslt (directly or indirectly via ROCBLAS_USE_HIPBLASLT) is already a workaround around the problem that AMD has thus far failed to unify the tensile and tensilelt kernel pools, despite these being mostly equivalent and has also failed to even half way optimize the kernels provided by tensile for CDNA3+. Then this pr applies a further workaround to avioid an external bug in the previous workaround.

I think this is a bridge to far and we should instead use our position to pressure AMD in cleaning up this frankly embarrassing mess. First by fixing the intimidate problem with the broken tensilelt solutions on CDNA3+ and then solving the solution pool problem, which affects every amd gpu as whether tensilelt or tensile provides better performance is essentially random depending on what gpu you look at, with the difference not being 10 or 20% but often breaching into the 1000+% range.

Thx for your review. And totally understand your concern.
This is an effort to fix the hipblaslt issue and improve the performance to reveal the potentials of AMD hardware.
Here are some reasons why we did this:

There is an ongoing re-design of the workflow to fix the hipblaslt-tensilite coredump/performance once and for all. But it is going to be a big change, and we don't have an ETA for this
The solution is purely added as an additional option that ppl may want to use it for fast AMD batched gemm implementation while we are waiting for the fix.
We understand the concern, therefore, we added control switches to turn it on/off. By default, this feature is off.

Please let me know if this changes your idea.

peizhang56 added 2 commits October 7, 2025 03:03

Make a fix to hipblaslt

c2429d4

Add support for rocm6.4

Fix code review comments

71100d1

peizhang56 requested a review from slaren as a code owner October 7, 2025 07:27

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 7, 2025

slaren requested review from IMbackK and JohannesGaessler October 7, 2025 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457

Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457

peizhang56 commented Oct 7, 2025 •

edited

Loading

Uh oh!

peizhang56 commented Oct 7, 2025

Uh oh!

JohannesGaessler commented Oct 7, 2025

Uh oh!

IMbackK commented Oct 7, 2025 •

edited

Loading

Uh oh!

peizhang56 commented Oct 8, 2025

Uh oh!

peizhang56 commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457

Are you sure you want to change the base?

Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457

Conversation

peizhang56 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peizhang56 commented Oct 7, 2025

Uh oh!

JohannesGaessler commented Oct 7, 2025

Uh oh!

IMbackK commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peizhang56 commented Oct 8, 2025

Uh oh!

peizhang56 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peizhang56 commented Oct 7, 2025 •

edited

Loading

IMbackK commented Oct 7, 2025 •

edited

Loading

peizhang56 commented Oct 8, 2025 •

edited

Loading