Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #9580 .

As of right now the CUDA backend reports that for FlashAttention a head size of 256 is only supported for NVIDIA GPUs that are Volta or newer. However, for AMD GPUs and old NVIDIA GPUs a head size of 256 can also be enabled if the vector kernel is used for large batch sizes. The performance won't be great but it will be faster than CPU. This PR adapts the CUDA code to enable this.

Also I noticed that the tests were only testing batch sizes < 8 which meant that some CUDA kernels were not being invoked at all. I changed the batch sizes to cover a wider range.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Sep 21, 2024
@JohannesGaessler JohannesGaessler merged commit a5b57b0 into ggml-org:master Sep 22, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Gemma2 9B FlashAttention is offloaded to CPU on AMD (HIP)

2 participants