Implement backward pass #2

leloykun · 2024-03-19T05:49:12Z

Description

This PR implements a minimal backward pass for flash attention.

I got these results on my RTX 2060

=== profiling manual attention (backward pass) ===
...
Self CPU time total: 11.139ms
Self CUDA time total: 1.721ms
=== profiling minimal flash attention (backward pass) === 
...
Self CPU time total: 31.466ms
Self CUDA time total: 629.000us

2x speedup

Tho my GPU can only handle size 16 blocks (vs. size 32 blocks for T4)

hypertseng · 2024-03-25T12:48:21Z

@leloykun hello Franz! I have some trouble with the code and flash attention. Firstly, why the attn values sanity check return False when the seq_len is lower than 32. It lead to collapse in inference which seq_len is usually 1, I guess the block size may cause this result? Then, how to choose a appropriate block size? Looking forward to your reply!

leloykun · 2024-04-17T04:59:41Z

Hi @hypertseng!

I believe it was because we weren't exiting the loops after going past the seq length. The forward pass should be fixed in my repo here: https://github.com/leloykun/flash-hyperbolic-attention-minimal

hypertseng · 2024-05-06T16:06:15Z

@leloykun Recently, I found the flash_attn_bwd implementation in your repo is lower than the manual implementation, this is totally because the implicitly function call of cudaDeviceSynchronize which Increases the CPU time a lot. Do you have any idea to solve this problem?

By the way, I found that change the AtomicAdd to normal add will decrease the cudaDeviceSynchronize occupancy, but I don't know why, I am a beginner of cuda hhhhh.

FumoTime · 2024-07-08T10:03:41Z

@hypertseng Most likely, cudaDeviceSynchronize time includes the kernel execution time. You can Use cuda events to time it instead.

torch.cuda.reset_peak_memory_stats()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
minimal_result = minimal_attn.forward(q, k, v)
end_event.record()
torch.cuda.synchronize()

elapsed_time_ms = start_event.elapsed_time(end_event)
max_vram_MB = torch.cuda.max_memory_allocated() / (1024*1024)

leloykun added 2 commits March 19, 2024 13:46

implement backward pass

09507d3

add contributors list

3993ed3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement backward pass #2

Implement backward pass #2

Uh oh!

leloykun commented Mar 19, 2024

Uh oh!

hypertseng commented Mar 25, 2024

Uh oh!

leloykun commented Apr 17, 2024 •

edited

Loading

Uh oh!

hypertseng commented May 6, 2024

Uh oh!

FumoTime commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement backward pass #2

Are you sure you want to change the base?

Implement backward pass #2

Uh oh!

Conversation

leloykun commented Mar 19, 2024

Description

Uh oh!

hypertseng commented Mar 25, 2024

Uh oh!

leloykun commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hypertseng commented May 6, 2024

Uh oh!

FumoTime commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leloykun commented Apr 17, 2024 •

edited

Loading