-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add hipblasLt implementation for batched gemm to improve performance for CDNA3 only #16457
Conversation
Add support for rocm6.4
@slaren This is for AMD GPU. But the bot labeled it as Nvidia GPU |
Does this PR contain code which was copypasted from another project? |
Without looking at this in detail i would lean against doing this. I understand that its frustrating that llamacpp uses the kernels provided by tensile on CDNA3 which vastly underperform relative to what the hardware can do. Using hipblaslt (directly or indirectly via ROCBLAS_USE_HIPBLASLT) is already a workaround around the problem that AMD has thus far failed to unify the tensile and tensilelt kernel pools, despite these being mostly equivalent and has also failed to even half way optimize the kernels provided by tensile for CDNA3+. I think this is a bridge to far and we should instead use our position to pressure AMD in cleaning up this frankly embarrassing mess. First by fixing the intimidate problem with the broken tensilelt solutions on CDNA3+ and then solving the solution pool problem, which affects every amd gpu as whether tensilelt or tensile provides better performance is essentially random depending on what gpu you look at, with the difference not being 10 or 20% but often breaching into the 1000+% range. |
No, all the code was developed by me. Including the idea, naming. |
Thx for your review. And totally understand your concern.
Please let me know if this changes your idea. |
Added an hipblaslt implementation for batched gemm to improve the benchmark performance
This acts an alternative way to fix the coredump when export ROCBLAS_USE_HIPBLASLT=1
The feature is disabled by default, and only applies to CDNA3
The feature should be removed once hipblasLt redesigned the grouped gemm for rocblas, which could take some time.
In order to use the feature:
Use export USE_HIPBLASLT_GROUPED_GEMM=1, to enable the feature
Use export USE_HIPBLASLT_GROUPED_GEMM=2, to run offline-bench
Use export USE_HIPBLASLT_GROUPED_GEMM=3, to run the best algo solution.