Skip to content

Conversation

peizhang56
Copy link

@peizhang56 peizhang56 commented Oct 7, 2025

  1. Added an hipblaslt implementation for batched gemm to improve the benchmark performance

  2. This acts an alternative way to fix the coredump when export ROCBLAS_USE_HIPBLASLT=1

  3. The feature is disabled by default, and only applies to CDNA3

  4. The feature should be removed once hipblasLt redesigned the grouped gemm for rocblas, which could take some time.

  5. In order to use the feature:

  • Use export USE_HIPBLASLT_GROUPED_GEMM=1, to enable the feature

  • Use export USE_HIPBLASLT_GROUPED_GEMM=2, to run offline-bench

  • Use export USE_HIPBLASLT_GROUPED_GEMM=3, to run the best algo solution.

@peizhang56 peizhang56 requested a review from slaren as a code owner October 7, 2025 07:27
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 7, 2025
@peizhang56
Copy link
Author

@slaren This is for AMD GPU. But the bot labeled it as Nvidia GPU

@JohannesGaessler
Copy link
Collaborator

Does this PR contain code which was copypasted from another project?

@IMbackK
Copy link
Collaborator

IMbackK commented Oct 7, 2025

Without looking at this in detail i would lean against doing this.
This is a rather involved (in terms of loc) workaround around a bug in an external library that is not used by default.

I understand that its frustrating that llamacpp uses the kernels provided by tensile on CDNA3 which vastly underperform relative to what the hardware can do.

Using hipblaslt (directly or indirectly via ROCBLAS_USE_HIPBLASLT) is already a workaround around the problem that AMD has thus far failed to unify the tensile and tensilelt kernel pools, despite these being mostly equivalent and has also failed to even half way optimize the kernels provided by tensile for CDNA3+.
Then this pr applies a further workaround to avioid an external bug in the previous workaround.

I think this is a bridge to far and we should instead use our position to pressure AMD in cleaning up this frankly embarrassing mess. First by fixing the intimidate problem with the broken tensilelt solutions on CDNA3+ and then solving the solution pool problem, which affects every amd gpu as whether tensilelt or tensile provides better performance is essentially random depending on what gpu you look at, with the difference not being 10 or 20% but often breaching into the 1000+% range.

@peizhang56
Copy link
Author

Does this PR contain code which was copypasted from another project?

No, all the code was developed by me. Including the idea, naming.

@peizhang56
Copy link
Author

peizhang56 commented Oct 8, 2025

Without looking at this in detail i would lean against doing this. This is a rather involved (in terms of loc) workaround around a bug in an external library that is not used by default.

I understand that its frustrating that llamacpp uses the kernels provided by tensile on CDNA3 which vastly underperform relative to what the hardware can do.

Using hipblaslt (directly or indirectly via ROCBLAS_USE_HIPBLASLT) is already a workaround around the problem that AMD has thus far failed to unify the tensile and tensilelt kernel pools, despite these being mostly equivalent and has also failed to even half way optimize the kernels provided by tensile for CDNA3+. Then this pr applies a further workaround to avioid an external bug in the previous workaround.

I think this is a bridge to far and we should instead use our position to pressure AMD in cleaning up this frankly embarrassing mess. First by fixing the intimidate problem with the broken tensilelt solutions on CDNA3+ and then solving the solution pool problem, which affects every amd gpu as whether tensilelt or tensile provides better performance is essentially random depending on what gpu you look at, with the difference not being 10 or 20% but often breaching into the 1000+% range.

Thx for your review. And totally understand your concern.
This is an effort to fix the hipblaslt issue and improve the performance to reveal the potentials of AMD hardware.
Here are some reasons why we did this:

  1. There is an ongoing re-design of the workflow to fix the hipblaslt-tensilite coredump/performance once and for all. But it is going to be a big change, and we don't have an ETA for this
  2. The solution is purely added as an additional option that ppl may want to use it for fast AMD batched gemm implementation while we are waiting for the fix.
  3. We understand the concern, therefore, we added control switches to turn it on/off. By default, this feature is off.

Please let me know if this changes your idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants