Skip to content

Conversation

@bjj
Copy link
Contributor

@bjj bjj commented Mar 3, 2025

Extends PR #12032 based on comments from @IMbackK to support CDNA, which uses warp size 64.

The first commit changes all relevant uses of the define WARP_SIZE to use the current device size. Once I did this properly, the changes to heuristics I mentioned in the other PR were not necessary.

I added one assert about the total T_BLOCK_X / T_BLOCK_Y size, based on the AMD docs. I fenced it for AMD, but someone else may know if that same limit (basically don't exceed 4*warp_size total) applies to other architectures.

The second commit removes a fence preventing the __launch_bounds__ from applying on HIP. This is a significant performance improvement on prompt processing (>15% at larger sizes) but a slight (<5%) penalty to token generation. I don't know enough about the multiple layers of kernel sizing heuristics to dig into this more right now.

This passes test-backend-ops on a Mi100 (gfx908, CDNA) and a 3090.

@bjj bjj requested a review from JohannesGaessler as a code owner March 3, 2025 03:51
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 3, 2025
@sorasoras
Copy link

would it be faster on RDNA3 with warp size of 64?

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 3, 2025

rdna3 dose not support warp64 in hip as some instructions only work in warp32 mode.

@IMbackK IMbackK closed this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants