Skip to content

Eval bug: b5335 break flash attention on 4070 #13430

@steampunque

Description

@steampunque

Name and Version

b5335 server

Operating systems

Linux

GGML backends

CUDA

Hardware

4070

Models

any (tested with Qwen3 8B)

Problem description & steps to reproduce

gibberish is generation when FA is turned on.

The problem goes away if making the following change in the cuda source file :

fattn-mma-f16.cuh

line 550 at b5335

//constexpr bool use_cp_async = nstages == 1;
constexpr bool use_cp_async = 0;

First Bad Commit

Unknown

Relevant log output

flash attention on:

bash-5.1$ lm Hello
郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦^Cbash-5.1$ 

flash attention off:

bash-5.1$ 
bash-5.1$ 
bash-5.1$ 
bash-5.1$ lm Hello
<think>
Okay, the user said "Hello". I need to respond appropriately. Since it's a greeting, I should acknowledge it and offer assistance. Let me keep it friendly and open-ended. Maybe ask how I can help them today. That way, they know I'm here to assist with any questions or tasks they might have. I should make sure the response is welcoming and not too formal. Let me check for any typos or errors. Alright, that should work.
</think>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions