Skip to content

Sync to upstream main 20260121#114

Open
LucasWilkinson wants to merge 472 commits intomainfrom
sync/upstream-main-20260121
Open

Sync to upstream main 20260121#114
LucasWilkinson wants to merge 472 commits intomainfrom
sync/upstream-main-20260121

Conversation

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

No description provided.

tridao and others added 30 commits August 11, 2025 23:13
This hasn't been used since 2023-09
…Lab#1795)

When the parameter `cache_seqlen` is scalar, it should be expand to
vector of shape (batch_size).  In the original code, whenever `block_table`
is used, the shape of `k_cache` is (num_blocks, page_size, ...), and
thus `cache_seqlen` is expanded to shape (num_blocks) instead of
(batch_size), which is wrong.  This fix uses the shape of `q`, which
is always `batch_size`.
Actually doesn't seem to make it faster
* use LPT order in varlen kernel

* add prefill decode benchmark script

* add sort in prepare

* add full implementation:

* add varlen kvhead swizzle

* add settings for swizzle ablation

* add correction term for sort when causal

* remove ablation options from frontend and clean up comments

* add comments in prepare kernel

* remove debug code and scripts

* put back defaults in tests

* remove excess Nones returned in python interface for varlen

* revert opinionated change to setup.py on cuda version 12.9

* force inline sort op and make east const

* more templating in varlen scheduler to cure some register spilling

* fix exploding build by splitting compilation and add qol macros for hdimdiff

* fix metadata mismatch with seqlenk in test script

* extend prepare kernel to >992 batches and always call it for varlen

* do inter-batch sort per every 992 batches

* better names in combine and fix prepare condition in api
Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.
When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827)

* ci: Allow build/deploy of arbitrary configurations

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanui

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cxx11_abi

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upload

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* lse output

* style

* style

* revert test changes, introduce optional kwarg to output lse
* [BugFix] fix softcap condition

softcap should only be referenced when its not none, currently the logic is reversed and will result in an error

* [BugFix] fix sm80 cuteDSL error


1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100
2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces.

* Fix typo of range_constexpr

* Fix seqlen
…e DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs
Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.
* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml
drisspg and others added 30 commits January 9, 2026 17:42
* fix seqused in varlen bwd

* enable store zero for zero len seqused q
)

* add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing

* remove unnecessary reformatting

* reinstate changes
…ILab#2174)

* update row_max before safe overwrite

* move up row_max_prev
…ab#2104)

* fully shard paged KV address calculation across threads

* use t0 indices for static bound checking

* increase tiled copy to full KV row

* shrink predicate tensor

* clarify paged KV divisibility constraints

* increase load register allocation
Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.
Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.
[Cute,Fwd,Sm100] Add r2p for local mask
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.