Commit 5905ba0

authored

[webgpu] Fix deepseek incorrect issue (microsoft#24723)

### Description  Deepseek-r1 f16 works incorrectly on flash attention path. However, Deepseek-r1 f32 works correctly on flash attention path. (Both of them works correctly on non-fa path) This PR fixes the incorrect result of deepseek-r1 on flash attention path. It seems that the result of qk and softmax(qk) overflow on f16. After changing the computation to use f32 instead of f16 on qk and softmax(qk) and store qk' result as type `float`, the deepseek-r1 model works correctly. This PR includes below changes: 1. Add head_idx boundary checking. It's not related with deepseek fixing. But it's needed for boundary checking since [ProgramManager::NormalizeDispatchGroupSize](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/webgpu/program_manager.cc#L22-L39) will dispatch more workgroups based on the algorithm when it exceeds the `maxComputeWorkgroupsPerDimension`. 2. Use f32 to do the computation of qk and softmax(qk). 3. Store the results of qk and the max/sum as float instead of q's original type.

1 parent 723efe2 commit 5905ba0Copy full SHA for 5905ba0

2 files changed

+87

-67

lines changed

onnxruntime/contrib_ops/webgpu/bert
- flash_attention.cc
- flash_attention.h

2 files changed

+87

-67

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 5905ba0

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments