Skip to content

Conversation

@neilmehta24
Copy link

@neilmehta24 neilmehta24 commented Mar 4, 2025

We are seeing that this change incorrectly disabled flash attention for Turing cards (cc=75) when llama.cpp was compiled for Volta cards only (cc=70). To fix, check that we have compiled for Volta or greater, and that the card is Turing or greater. If there is a better way to fix, please do advise.

To reproduce the breakage on the current build, compile with architecture 70 and without architecture 75, and generate with flash attention on a Turing card.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 4, 2025
@JohannesGaessler
Copy link
Collaborator

Please confirm whether or not #12222 fixes the issue. The fix in this PR is definitely not correct for all scenarios.

@neilmehta24
Copy link
Author

#12222 fixes the issue too. Will close this PR in favor of your fix. Thanks!

@neilmehta24 neilmehta24 closed this Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants