Sync to upstream main 20260121 by LucasWilkinson · Pull Request #114 · vllm-project/flash-attention

LucasWilkinson · 2026-01-22T21:20:43Z

No description provided.

This hasn't been used since 2023-09

This fixes Commit 81cdf4c

Credit: Jay Shah's idea

…Lab#1795) When the parameter `cache_seqlen` is scalar, it should be expand to vector of shape (batch_size). In the original code, whenever `block_table` is used, the shape of `k_cache` is (num_blocks, page_size, ...), and thus `cache_seqlen` is expanded to shape (num_blocks) instead of (batch_size), which is wrong. This fix uses the shape of `q`, which is always `batch_size`.

Actually doesn't seem to make it faster

* use LPT order in varlen kernel * add prefill decode benchmark script * add sort in prepare * add full implementation: * add varlen kvhead swizzle * add settings for swizzle ablation * add correction term for sort when causal * remove ablation options from frontend and clean up comments * add comments in prepare kernel * remove debug code and scripts * put back defaults in tests * remove excess Nones returned in python interface for varlen * revert opinionated change to setup.py on cuda version 12.9 * force inline sort op and make east const * more templating in varlen scheduler to cure some register spilling * fix exploding build by splitting compilation and add qol macros for hdimdiff * fix metadata mismatch with seqlenk in test script * extend prepare kernel to >992 batches and always call it for varlen * do inter-batch sort per every 992 batches * better names in combine and fix prepare condition in api

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827) * ci: Allow build/deploy of arbitrary configurations Signed-off-by: oliver könig <okoenig@nvidia.com> * add Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanui Signed-off-by: oliver könig <okoenig@nvidia.com> * cxx11_abi Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * upload Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

* [BugFix] fix softcap condition softcap should only be referenced when its not none, currently the logic is reversed and will result in an error * [BugFix] fix sm80 cuteDSL error 1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100 2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces. * Fix typo of range_constexpr * Fix seqlen

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml

…1881)

* fix seqused in varlen bwd * enable store zero for zero len seqused q

Improve flash.cute paged_kv cpasync

) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes

…ILab#2174) * update row_max before safe overwrite * move up row_max_prev

…ab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

…vert to C long"

[Cute,Fwd,Sm100] Add r2p for local mask

…ab#2194) * fix * same fix for bwd and SM80

…20260121

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tridao and others added 30 commits August 11, 2025 23:13

[Cute] Don't need i64_to_f32x2 anymore

81cdf4c

Remove old xentropy kernel

c4be578

This hasn't been used since 2023-09

Remove old fused softmax kernel from apex/Megatron

3edef7c

Remove old attn decode kernel from FasterTransformer

2715c53

Remove old rotary kernel

f28841d

[Cute] Implement page table with TMA for fwd_sm100

a1c2e22

[Cute] Remove trailing bracket (Dao-AILab#1809)

581b68d

This fixes Commit 81cdf4c

[Cute] Make sure R2P happen

3c51f15

feat: add support for pytorch2.8 (Dao-AILab#1801)

d2e3fc3

[Cute] Implement PackGQA with TMA for fwd_sm100

69b33b5

Credit: Jay Shah's idea

Bump to v2.8.3

060c918

[Cute] Port fwd_combine kernel from C++ to cute-dsl

b31ae1e

[Cute] Simplify tile scheduler storing params

591dc7e

[Cute] Implement sink for fwd_sm90

f8b4f15

[Cute] Implement PackGQA with TMA for fwd_sm90

e1407db

[Cute] Use R2P for masking in fwd_sm90

0e60e39

Actually doesn't seem to make it faster

Fixes incorrect variable reference in comment (Dao-AILab#1775)

632fe2a

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

Update the initialization of dk/dv_semaphore (Dao-AILab#1839)

832d544

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

Update tile_scheduler.hpp (Dao-AILab#1841)

478841a

ci: Switch to workflow_dispatch (Dao-AILab#1847)

d0ed097

[FA3] Allow returning LSE via kwarg (Dao-AILab#1851)

203b9b3

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuT…

6387433

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

benchmark: qualify all attention backends by methods list (Dao-AILab#…

e8c7344

…1881)

drisspg and others added 30 commits January 9, 2026 17:42

[CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158)

5d4c953

[Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167)

ea8f735

* fix seqused in varlen bwd * enable store zero for zero len seqused q

[CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170)

ef7343b

Merge pull request Dao-AILab#2156 from v0i0/v0i0/improve-paged-ldgsts

dbf08eb

Improve flash.cute paged_kv cpasync

[Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171

4cb272e

) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes

[Cute][Flex] Remove no longer needed contig (Dao-AILab#2172)

4894657

[Cute] update row_max before safe overwrite for online_softmax (Dao-A…

13696f2

…ILab#2174) * update row_max before safe overwrite * move up row_max_prev

[Cute][Flex] add back in contig (Dao-AILab#2177)

506441a

[Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180)

68649fb

baseline local flops

88067b0

Add R2P dual bound masking for local attention

a512bd8

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

remove benchmark result, undo changes to benchmark

2020964

Add R2P dual bound masking for local attention

7108d1c

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

switch from xor to mask_right & ~ mask_left

e4ec1ad

flip in_bound to out_bound

ac88858

remove zero logic for right_s and left_s

e34d840

remove 24 clamp

08e6518

doc

94f0348

lint

e94012a

added back clamp to avoid "OverflowError: Python int too large to con…

2e6ae05

…vert to C long"

add comment

137ad8e

Merge pull request Dao-AILab#2185 from henrylhtsang/test_local_r2p

2d6b146

[Cute,Fwd,Sm100] Add r2p for local mask

[Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189)

a0f9f41

[Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AIL…

04e6ee1

…ab#2194) * fix * same fix for bwd and SM80

reduce chance of build oom (Dao-AILab#2079)

f15ccf5

Merge remote-tracking branch 'upstream/main' into sync/upstream-main-…

2cf8a1f

…20260121

restore changes

35756f5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix compile error

997fc13

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix

d3320d4

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync to upstream main 20260121#114

Sync to upstream main 20260121#114
LucasWilkinson wants to merge 472 commits intomainfrom
sync/upstream-main-20260121

LucasWilkinson commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

LucasWilkinson commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants