Skip to content

Commit 6b9a223

Browse files
MatthewBonannireubenconductsjohnnynunezrajesh-smikaylagawarecki
authored
Merge upstream (#119)
* [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs * make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860) Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1. * [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865) * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml * benchmark: qualify all attention backends by methods list (Dao-AILab#1881) * ABI stable fa3 (Dao-AILab#1791) * squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >= * [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882) * fix typo * Update setup.py * Update setup.py * Update setup.py * Update setup.py * fix typo in flops calculation for local attention (Dao-AILab#1883) * flash-attn-cute bwd sm90 (Dao-AILab#1868) * [Cute] Make testing utils standlone for cute (Dao-AILab#1892) * Bump pin for CuTeDSL (Dao-AILab#1891) * Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893) * add spt scheduler for causal bwd determinism * add new torch check for det hdim 256 to stable api * Upgrade to cutlass v4.2.1 (Dao-AILab#1905) * switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906) * Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908) Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> * C++11 fix warnings (Dao-AILab#1904) * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * Update flash_api_stable.cpp * upstream cutlass v4.2.1 csrc * [Cute] Write ex2 emulation in a more readable form * [Cute] Simplify utils.py a bit * [Cute] Remove arith & vector import in utils.py * [CuteDSL] Fix test (Dao-AILab#1925) * Refactors to enable FlexAttention (Dao-AILab#1840) * Refactors to enable FlexAttention * Thread throught the buffers to the score_mod * add-test * add fastdivmod * comments * comments * [Cute] Fix softmax for cutlass-dsl==4.2.1 * [Cute] Fix softmax for fwd_sm100 * [Cute,Bwd] Simplify bwd_preprocessing kernel * [Cute,Fwd,Sm90] Simplify by passing around functions * [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn * [Cute] Optionally dump cubin and sass * [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n * [Cute,Bwd,Sm90] Format file w ruff * [Cute,Bwd,Sm90] Fix bwd dK & dV, more async * [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum * [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V * [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead * [Cute,Bwd,Sm90] Use TMA to store dK & dV * [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration * [Cute,Sm90] Move gemm helper functions to hopper_helpers.py * Swap masking to not use R2P * Pre-indent to make commit diffs readable * Adding varlen support + tests * Remove self refs in softmax for loop (Dao-AILab#1924) Co-authored-by: Tri Dao <tridao@users.noreply.github.com> * [Cute,Bwd,Sm90] Make postprocessing kernel work * [Cute] Run ruff format on bwd files * [CI] Add pre-commit GH action * [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1 * [Cute,Bwd,Sm90] Make causal work * [Cute,Bwd,Sm90] Implement dQ_swapAB * [Cute,Bwd,Sm90] Implement SdP_swapAB * [AMD] Torch Compile Issues (Dao-AILab#1756) * fix rounding and dropout metdata bug * fix lse shape and bug in interface * return softmax is true * [Cute,Bwd,Sm90] Implement mma_dkv_is_rs * [Cute,Bwd,Sm90] Use block size 80x128 * [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937) * Add precommit list and then uncomment in chunks (Dao-AILab#1941) * create list to work through * include ampere * [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944) See pytorch/pytorch#151845. pytorch has removed caffe2, but hipify still contained work-arounds for caffe2 vs torch compatibility. As a result of hipify v2 changes, some torch APIs are changing. * [Cute] Add flake8 config file * [Cute,Fwd,Sm90] Load Q & K using the same mbarrier * [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage * [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups * [Cute,Bwd,Sm90] Implement masking * [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in * [Cute,Fwd,Sm100] Clean up * [Cute,Fwd,Sm100] Clean up mask * [Cute] Reformat blackwell_helpers.py, block_info.py * [Cute] Format mma_sm100_desc.py, seqlen_info.py * sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945) * [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess * [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n} * [Cute,Bwd,Sm100] Clean up a bit * add barrier module (Dao-AILab#1946) * [Cute,Bwd,Sm100] Have a separate function to set up the mma * [Cute,Bwd,Sm100] Load LSE with cpasync_bulk * [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk * [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO * [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration * [Cute,Bwd,Sm100] Simplify mma by using functools.partial * [Cute,Bwd,Sm100] Don't need q_dk_consumer_state * [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier * [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max * [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem * [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one * [Cute,Bwd,Sm100] All compute warps wait for lse_barrier * [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle * [Cute,Bwd,Sm100] Try gemm_ptx * [Cute,Bwd,Sm100] Clean up compute fn * [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1 * [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables * [Cute,Bwd,Sm100] Hardcode dS_stage = 1 * [Cute,Bwd,Sm100] Add option for delay tma store * Fix hopper cuda 13 build (Dao-AILab#1949) * [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953) * Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * cutlass v4.3.0 (Dao-AILab#1952) * [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx * [Cute,Bwd,Sm100] More cleanup * [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * rename buffers -> aux_tensors, fix score_mod test in sm90 fwd * fix mask mod interface issues and tests * remove newline at end of file * format with ruff * format mask & sm100 with ruff * format more files with ruff * format barrier.py with ruff * Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957) The ABI stable implementation incorrectly used getCurrentStream().id() which returns a StreamId (int64_t) instead of the actual cudaStream_t pointer. Casting an integer ID to a stream pointer caused segmentation faults when using custom CUDA streams. Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream() which returns the actual CUDA stream pointer. * [Cute,Fwd,Sm100] Fix interface w score mod to get it to run * [Cute,Sm100] In gemm ptx, add to base smem_address instead * [Cute,Bwd,Sm100] Make postprocessing work, add interface * [Cute,Bwd,Sm100] Simplify layouts in compute_loop * [Cute,Bwd,Sm100] Causal mask * [Cute,Bwd,Sm100] Enable bwd tests * [Cute,Bwd] Enable bwd benchmarks * [Cute] Add store_shared_remote_fp32x4 util function * [Cute,Bwd,Sm100] Tune registers * [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr * [Cute,Bwd,Sm100] Reduce sync * [Cute] Change utils.view_transpose back * [Cute,Bwd,Sm100] Remove delay_tma_store option * [Cute,Bwd,Sm100] Implement cluster Co-authored-by: Ted Zadouri <tz6037@princeton.edu> * [Cute] Copy benchmark util functions to cute directory Easier to benchmark without having to install FA2 * [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum * [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS * [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load * [Cute] Blocks tweaks (Dao-AILab#1964) * [Cute,Bwd,Sm100] Use TS MMA for dK * [Cute,Blocksparse] Group block sparse input torch tensors * [Cute,Bwd,Sm100] Separate mma_S and mma_dP * [Cute,Bwd,Sm100] Try LPTBwdScheduler * [Cute,Bwd,Sm100] Try separating warps loading Q and dO * BlockSparse Tweaks (Dao-AILab#1970) * Tweaks * better errors * Switch to new API * [Cute] Fix main (Dao-AILab#1982) * [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940) * Implement split KV * Remove modal bench harness * Fixes * [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984) - Create block_sparse_utils.py with SM80/90 block-sparse logic - Refactor flash_fwd.py to use extracted utilities - Clean up whitespace in block_sparsity.py This extracts the block-sparse consumer loop and related utilities from flash_fwd.py into a reusable module for SM80/90 architectures. * Enable python-3.10+ (Dao-AILab#1998) * [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004) * add gqa for sm100 bwd * remove mha guard for test * change to cluster size 1 * [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006) * [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983) * begin block sparsity computation kernel * block sparsity computation kernel and benchmark working * loop range_constexpr * add fast kernel * merge fast and regular kernel * use TensorSSA approach to mask mod * update with OOB check * tests and benchmarks for block sparsity working * remove extraneous files * Revert mask.py to previous state - removing unintended changes from block sparsity work * remove flex attn test stub * add sleeps to benchmark * correct block sparsity benchmark to use torch.compile * Restore missing mask definitions and fix benchmark window_size handling * move benchmarks into new directory * compute_block_sparsity docstring * streamline compute block sparsity benchmark script * [NVIDIA] bump github actions (Dao-AILab#1996) * Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support * revert changes * revert * Update publish.yml * Update publish.yml * Update publish.yml * Update publish.yml * cuda-toolkit@v0.2.29 * [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999) * modal bench and correctness * implement for one thread per row * coalesced(?) gmem loads * use cp async * use 64 threads to load * fill in smem for V * pass tests * fixes * removed extra files * handle V loading for n_block < 0 * Add torch.compile support to flash attention 3 * Don't return mutated variables in mha_bwd * Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage * Remove print statements and update exception message * Fix flash_attn_backward_fake * Add `safe_aot_autograd_check` * Update namespace to flash_attn_3 * Add `flash_attn_forward.register_autograd` * Fix bug in `flash_attn_backward_fake` * Add support and tests for torch.export and aoti_compile_and_package * format code * update flash_api_stable.cpp * Fix flash_api_stable.cpp build * Only run schema_check if dtype is not float8_e4m3fn * Correctly compute kBlockM for sm88/86/80 * Fix bug in boxed_mha_bwd * don't run autograd_check when num_splits > 0 * [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985) - Implement block-sparse attention in flash_fwd_sm100.py - Update interface.py to handle SM100 block size calculations (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows) - Add mask_mod parameter support in mask.py for block-sparse masking - Add SM100 test fixtures and tile size handling in test_mask_mod.py This enables block-sparsity on SM 10.0 architecture, including mask_mod support and proper block size accounting. * [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014) * use correction warps for epi when varlen (non tma O) * properly enable fallback epilogue for varlen q * fix rebase errors * update tests * Raise TypeError if out is specified when compiling _flash_attn_forward * add fastdivmod for oob reads in mask_mods (Dao-AILab#2020) * add fastdivmod for oob reads in mask_mods * Updates for h100 * don't pass mask_fn to softmax_step generically (Dao-AILab#2026) * swap order of decorators (Dao-AILab#2029) * [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033) * enable deterministic mode for sm100 bwd and fix race conditions * turn off lpt scheduler for causal * use more regs for reduce when deterministic * make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release * use 100k iterations for default * [NFC] Trivial fix to silence linter (Dao-AILab#1928) Not much to see here, but this causes linter noise * Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032) * [Cute] Add authors * [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031) * Bump pin (Dao-AILab#2025) * Bump pin * Swtich to new fastdivmod * cleanup varlen on blackwell * Allow for only cute install * ruff all the smaller files (Dao-AILab#2040) * [Flash] Fix head dim 64 bwd (Dao-AILab#2035) * Add headdim64 tests (Dao-AILab#2041) * [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046) * add local for sm100 bwd * add deterministic * update tests * ruff files * remove old code * move comment * override window_size = None for causal * revert to fwd test defaults * Add hash attr to shortcut expensive check (Dao-AILab#2048) * [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052) * Update CK and c++ version * update CK * update ck * Update comment to reflect qscale_type in fmha_fwd_traits --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> * fixing cute bwd func def (Dao-AILab#2056) * Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063) * [CUTE] Allow grads to be preallocated (Dao-AILab#2065) * [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043) * rebase to main * varlen support for score mod * interface change for varlen score mod * implement varlen support for score mod * varlen score mod working; updated tests * modify varlen score mod to use fastdiv_mods updated per sequence * updated test suite * current working state of varlen score mod * refactor varlen score mod tests * fix to transpose * refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel * refactor test_score_mod.py to use external score mod definition file * update flash_fwd.py for varlen score mod * sm90 varlen score mod working; test revisions * enable packgqa for varlen score mod; set up fastdiv_mod recomputation * update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test * Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions * rebase to main * fix test rebase artifacts * fix floor_if_packed redundancy * correct sm90 divmods mismatch * revert test_flash_attn to main * add varlen score mod benchmark script * packgqa for varlen (independent of score mod) * rm benchmark from PR * move score mod arg wrapping to utils.py * format with ruff * major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly * reinstate varlen packgqa exclusion checks * move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support * remove duplicate fastdiv_mod recomputation * [Fix] fastdiv_mods for paged attn and seqused_* * clean up PR; fix paged_kv varlen for sm90 * update to varlen score mod test script (paged kv) * remove premature seqlen arguments from sm90 apply_mask_mod * [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042) * [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072) * basics working (Dao-AILab#2070) * Blocksparse impl (Dao-AILab#2085) * Fix IMA in fwd on m boundary (Dao-AILab#2091) * Fix IMA in fwd on m boundary * Fix compeltely OOB loads * Update to dsl 3.4.3 (Dao-AILab#2092) * README for AMD ROCm (Dao-AILab#2068) * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> --------- Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * fix shuffle sync for pack gqa epilogue (Dao-AILab#2097) * improve paged cpasync * Enable Thor (Dao-AILab#2108) * [Cute] Add quack as dependency * [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp Previous we signal per warp group, but that makes the code more complicated for a tiny bit of perf gain. * Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098) * [Cute,Fwd] improved block sparsity (Dao-AILab#2100) * improved block sparsity computation * refactor blocksparsity computation for tvm-ffi * refactor mask mod definitions and tests * refactor of block sparsity and mask mod application; eventually allow varlen * remove fastdivmods from compute block sparsity * remove unnecessary imports * revert to 1-phase block sparsity computation * update bwd kernels to use new AttentionMaskCls api * fix linter error * [Cute] Fix minor lint issue in shuffle_sync * Misc tests that should be xfailed for now (Dao-AILab#2127) * Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142) * [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993) * use q_stage=1 for split kv * determine q_stage via seqlen_q for sm100 * repurpose softmax1 warps for cp.async load * address comments * [Cute] Fix two tests that were failing (Dao-AILab#2149) * [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py The paged KV cache tests (test_score_mod_with_paged_kvcache and test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY to skip tests on SM90 since paged KV cache is only supported on SM100. However, the variable was never defined, causing a NameError. This adds the same definition used in test_mask_mod.py: COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0] * [Cute] Fix missing seqlen_info parameter in mask_mod call The mask_mod call in apply_mask_sm100_transposed was missing the seqlen_info parameter. All mask functions expect the signature: (batch, head, m_idx, n_idx, seqlen_info, aux_tensors) The other two mask_mod calls in the same file correctly pass all 6 arguments, but this one only passed 5, causing: TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors' This fixes test_mask_mod.py::test_mask_mod_ima_partial_block. * cleanup * [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150) * varlen bwd with rounded padded offsets * fix mha * change offset mode to round down multiple * enable varlen bwd tests * enable deterministic mode * fix deadlock and switch mha to no postprocess * reenable tests * fix lint error * use head swizzle/spt for deterministic, update tests * change padding offset based on arch * rebase and update interface, tests * add arch dispatch for padded offset q to postprocess * address comments * remove tile sizes from seqlen info class vars * block-sparse backward SM90 (Dao-AILab#2136) * score-mod backward SM90 (Dao-AILab#2137) * [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143) * [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146) * [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145) * [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158) * [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167) * fix seqused in varlen bwd * enable store zero for zero len seqused q * [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170) * [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes * [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172) * [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174) * update row_max before safe overwrite * move up row_max_prev * [Cute][Flex] add back in contig (Dao-AILab#2177) * [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180) * baseline local flops * [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * remove benchmark result, undo changes to benchmark * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * switch from xor to mask_right & ~ mask_left * flip in_bound to out_bound * remove zero logic for right_s and left_s * remove 24 clamp * doc * lint * added back clamp to avoid "OverflowError: Python int too large to convert to C long" * add comment * [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189) * [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194) * fix * same fix for bwd and SM80 * reduce chance of build oom (Dao-AILab#2079) * [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187) * ci: Use 1 ninja job for cu13 (Dao-AILab#2195) Signed-off-by: oliver könig <okoenig@nvidia.com> * Update README to include 'psutil' package as build requirement (Dao-AILab#2210) Added 'psutil' as a build requirement in the README. * [Flex][SM100] Replay expand fix on sm100 (Dao-AILab#2209) stack-info: PR: Dao-AILab#2209, branch: drisspg/stack/6 * [DSL] Optionally patch cute-dsl to use system's ptxas * [AMD] Triton Backend for ROCm #3 (Dao-AILab#2178) * Fused Bwd (Dao-AILab#137) * Fused with Good perf and stride fixed Fix fused bugs isolate failing case fix bug bring back test cases rm split impl in fused use exp2 is global variable now try oom fix save make fused the default limit to reproduce failure return default to split fix head size bug use exp2 back to true * new grid * BLK_SLICE_FACTOR = 1 * add tflops * new commit * test in parrallel * strides added by jusson * disable alibi * fix bugs again * default to fused * add bwd options for varlen * backend filter * default to jingning and batch 4 * best fwd config * fix TRITON_PRINT_AUTOTUNING flag bug * tune * Tuning fwd prefill * add if else * use flag * Minor mask fix * FLIP GRID * use best config for default * print when autotuning * test bfloat16 * fix k and v stride bugs * skip bfloat16 * test kvpacked * disable internal tests * pick default config based on arch * Add alibi in the new bwd kernel (Dao-AILab#139) * enable alibi for jinging kernel enable alibi for jinging kernel match * save bad configs * fix alibi and causal bug * disable autotune by default * auto tune when benching is good * set best config * remove env var * Update amd_tests.yml * upgrad to triton==3.3.0 * increase shm * use 64 x 64 for now * save * handle 1d alibi * Add fp8 to fused kernel (Dao-AILab#140) * fp8 stuff find test case compute delta fp8 basic fp8 config passing non causal path works * isolate bad case * fix fp8 bug * didnot fix fp8 bug * back to failing test * fp8 tests passing * skip * skip ref tests --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> * head, seq, batch (Dao-AILab#141) * Fix keys (Dao-AILab#144) * save * rm keys * fix keys * use GHA_RENDER_DEVICES * normal docker * Pad LSE (Dao-AILab#148) * add round multiple * fix fwd * backward fix * use rounded lse flag * passing ROUNDED_LSE * default is new rounded mode * rename to fused_atmoics and fused_no_atomics * add test for torch_compile * add varlen torch compile test * add old one kernel for ref * fix varlen mismatch bug * fix shape issue in varlen but mismatch * sync torch compile kernel launch * simple varlen test * add debug code * rm old * ignore old impls * DEBUG flag works in interface only * ref uses the righ shape for lse * rm oldest bwd kernel * fix typo * fix varlen bug * fix bug. Get info from q for now * simple shape and stride checkout * add more tests * test kvcache * kvcache safe * match case * fix segfault due to bad return_softmax * run bench * run seperate for the main functions * just output benchmark * default csv format and time stamp files * non verbsoe bench * Sliding Window Forward (Dao-AILab#151) * Compress SWA work test case set up debug inputs add fwd ref one mask ref fwd first pass save ref doesnot work for bigger seqlens save new version some causal cases failing found bad cases working new attn new atten works new attn_fwd works reorg n_extra_tokens use seqlen_delta_qk ref fwd works add sliding window to bwd ref test kvcache decode ref work with everything except sliding window add debug code for 12 failing sliding window cases for decode attention_decode_forward_ref_impl mostly works except for alibi fix alibi in attention_decode_forward_ref_impl ref works with normal, varlen & kvcache move stuff around figure out masking old attn inner two inner functions remove load_fn do Lk - Lq like ref unify IS_CAUSAL code in epilogue clean up add args rm inference stuff simplify compute_masking simpler compute mask stub out returning front masking variables remove pointer pass compute ptrs inloop compute block min and max window stub inside inner mask loop trying to use attn_fwd_mask causes issues fix compiler bug when front masking gen specifc types add sliding window and debug statements use identity for v add more taste cases add comments save use k_max_token for clarity disable debug configs basic NON-CAUSAL SLIDING WINDOW non causal sliding window works on the all the shapes non sliding window working in fwd clean up fused bwd seperate old fwd_prefill move configs to utils.py * fix bwd ref bug * skip local cases so that fa output * no sliding window causal green * add backward test skip for sliding window * clean reduce in fwd_kvcache. no is_CASUAL branching * add kvcache masking * kvcache working * fix some bugs in test.py * clean up * Fix Device Segfault (Dao-AILab#152) * Compress segfault work fix backward segfault rework offset ignore .profile ignore .analysis save * assert the kernel launch device and tensor devices are the same * fix failing asserts * add asserts to fwd * Fix SDMASK bug * Log triton, torch and fa version * Fix fp8 import issues * fix docs (Dao-AILab#154) * Sliding Window block classification logic (Dao-AILab#155) * add aiter code * remove aiter stuff * sliding window non causal masking works * causal and sliding window block masking * extract common * clean up typo * helper for swa * ignore .amd * fix last block bug * Enable FA V3 (Dao-AILab#157) * Compress PA work narrow pa test ref works on most cases inplace ref with new_kv inplace paged attention add pa ref save pa basic paged works save fix swa + causal in pa. Also new_kv only on pa path passing build fa v3 import interface from fa v3 copy fa tests use v3 api clean up rename to match old test support different head sizes remove fp8 basisc passing v3 cases test_flash_attn_varlen_output v3 working isolate bad case for kvcache case passing save use decode is seqused/ cacheseql is given use decode if not varlen basci kvcache v3 working kvcache enable more cases detect kvcache case if seqused_q is non and sequese_k is not None skip failing test find fp8 failing case mha fp8 works fix fp8 MQA/GQA bug clean up more clean up clean up more don't need fp8 dead code remove train code with fp8 stuff fp8 working in kvcache paged + fp8 seems to be working new_kv allowed * clean up * skip hopper race test * clean up more * fix paged + alibi * similar inner paged api * unify _attn_fwd_inner * AITER integration (Dao-AILab#159) * clean up v2 interface * assert fp8 scale shapes * rotary working * move rotary to impl layers * remove einops * enable rotarry in v3 * create interface * fix descale assert * unify bwd * lint from aiter * clean fp8 api * add api change * assert shapes for v2 * remove ref and bench.py * remove metadata class and clean up * bwd_prefill * one bwd.py * rename * lint * add bwd_change (Dao-AILab#156) * Tune FP8 Perf (Dao-AILab#160) * check cu count for gfx942 * create get_cu_count * update repo root * update forward tune * clean up load * use float8_e4m3fnuz * save * show bwd mode * recommend fp8 * use torch.float32 for fp8 kernel * add both best fp16 and fp8 config * tune fp8 backward * descale factors should be b, hk * fp8 bwd working on all primus configs * tune bwd configs * fa v3 tests passing * better warning * clean up bwd launcher * v3 passing * tune more * improve perf * clean up * lint * clean * start tuning gfx950 * tune non causal path * fix bug * save * Skip configs where BLOCK_M2 % BLOCK_N2 != 0 * skip more * stop tuning * fix varlen bug * fix dropout & causal/swa segfault * update the to machine new changes * save * fix more bugs * remove random seed * clean up * update readme * print tensor stats for debug * disable sliding window tests * add rdna configs * fix k partial bug * fix block_size_n bug * fix type check bug --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> * fix compute_block_sparsity usage in benchmark_mask_mod (Dao-AILab#2221) * Fix shared-memory race (Dao-AILab#2229) * Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY (Dao-AILab#2155) * short readme for flex flash (Dao-AILab#2231) * [FA3] Mark current main version as v3.0.0 stable (Dao-AILab#2223) A collaboration between Flash-Attention, PyTorch and xFormers is trying to provide pre-built wheels for FA3 across as many platforms/environments as possible (e.g., ARM, Windows, CUDA 13, ...). To simplify the installation workflow, it would help to tag these packages as stable, but the current main version is tagged as beta. FA3 hasn't received substantial updates in a while (the latest was a bugfix almost two months ago), and most new development is happening in FA4. Thus, in this PR, I propose we just claim that the current main version _is_ stable. I have heard concerns that the feature set of FA3 doesn't currently match FA2 (e.g., dropout is missing). I think this concern is partly addressed by the fact that the new wheels will have a different name than the FA2 ones (`flash_attn_3` and `flash_attn` respectively), hence the former does _not_ claim to be a replacement for the latter, and the two can coexist (and they provide different modules). * hdim 192 smem fix (Dao-AILab#2235) * Add `FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON` env var support (Dao-AILab#2239) * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON env var support Allows users to override triton config when not autotuning. * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON to readme * Rename to FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON * [CUTE]Bump to Cutedsl (Dao-AILab#2216) Co-authored-by: Cursor <cursoragent@cursor.com> * pytest-dist round robin to gpus (Dao-AILab#2241) * [DSL] Replace old fence with cute.arch.fence_view_async_shared() * [DSL]Replace utils.{fma,mul,add}_packed_f32x2 with cute.arch version * [DSL] Remove coord_offset_i64, domain_offset_i64, elem_pointer_i64 Cute-dsl now supports i64 strides by default * [Sm90] Use functions from quack.sm90_utils * [DSL] Use cute.arch.warp_reduction_{max,sum} * [Layout] Use reshape_acc_to_mn and reshape_acc_to_frgA from quack * [Layout] Use quack.layout_utils.mma_partition_C_vec * [DSL] Use cute.math.{exp2,log2,log} * [Layout] Use layout_utils.transpose_view and select from quack * [Bwd,Sm90] Use quack.copy_utils * [Bwd,Sm100] Shorten PipelineTmaUmma create * [Bwd,Sm90] Have score_mod and score_mod_bwd as partial functions * [DSL] warpgroup_reg_alloc -> setmaxregister_increase * Fix Hopper tests (Dao-AILab#2242) --------- Signed-off-by: seungrok.jung <seungrok.jung@amd.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Henry Tsang <henrylhtsang@meta.com> Co-authored-by: Ted Zadouri <tedzadouri@gmail.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: jayhshah <jayhshah@gmail.com> Co-authored-by: brandonsun <brandons@nvidia.com> Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com> Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> Co-authored-by: Tri Dao <tridpq@gmail.com> Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com> Co-authored-by: Kevin Tong <kevin@augmentcode.com> Co-authored-by: Tri Dao <tridao@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Kevin Wang <kevmo314@gmail.com> Co-authored-by: Ted Zadouri <tz6037@princeton.edu> Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com> Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com> Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com> Co-authored-by: Jean-Luc Duprat <jld@acm.org> Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: liangel-02 <liangel@meta.com> Co-authored-by: skarupke <malteskarupke@fastmail.fm> Co-authored-by: Leo Dong <leodong0315@gmail.com> Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Wang Lecheng <wanglecheng@stu.pku.edu.cn> Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> Co-authored-by: zhuochen <zhuochen@outlook.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Co-authored-by: Luca Wehrstedt <luca.wehrstedt@gmail.com> Co-authored-by: Alex Butler <alexheretic@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 1c81743 commit 6b9a223

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+11533
-7182
lines changed

.github/workflows/_build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ jobs:
165165
# Limit MAX_JOBS otherwise the github runner goes OOM
166166
# nvcc 11.8 can compile with 2 jobs, but nvcc 12.3 goes OOM
167167
168-
export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] && echo 1 || echo 2)
168+
export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] || [ "$MATRIX_CUDA_VERSION" == "130" ] && echo 1 || echo 2)
169169
export NVCC_THREADS=2
170170
export FLASH_ATTENTION_FORCE_BUILD="TRUE"
171171
export FLASH_ATTENTION_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}

README.md

Lines changed: 25 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ flash_attn_interface.flash_attn_func()
6767
- CUDA toolkit or ROCm toolkit
6868
- PyTorch 2.2 and above.
6969
- `packaging` Python package (`pip install packaging`)
70+
- `psutil` Python package (`pip install psutil`)
7071
- `ninja` Python package (`pip install ninja`) *
7172
- Linux. Might work for Windows starting v2.3.2 (we've seen a few positive [reports](https://github.com/Dao-AILab/flash-attention/issues/595)) but Windows compilation still requires more testing. If you have ideas on how to set up prebuilt CUDA wheels for Windows, please reach out via Github issue.
7273

@@ -128,74 +129,52 @@ FlashAttention-2 ROCm CK backend currently supports:
128129
3. Both forward's and backward's head dimensions up to 256.
129130

130131
#### Triton Backend
131-
The Triton implementation of the [Flash Attention v2](https://tridao.me/publications/flash2/flash2.pdf) is currently a work in progress.
132+
The Triton implementation of [Flash Attention](https://tridao.me/publications/flash2/flash2.pdf) supports AMD's CDNA (MI200, MI300) and RDNA GPUs using fp16, bf16, and fp32 datatypes. It provides forward and backward passes with causal masking, variable sequence lengths, arbitrary Q/KV sequence lengths and head sizes, MQA/GQA, dropout, rotary embeddings, ALiBi, paged attention, and FP8 (via the Flash Attention v3 interface). Sliding window attention is currently a work in progress.
132133

133-
It supports AMD's CDNA (MI200, MI300) and RDNA GPU's using fp16, bf16 and fp32 datatypes.
134-
135-
These features are supported in Fwd and Bwd
136-
1) Fwd and Bwd with causal masking
137-
2) Variable sequence lengths
138-
3) Arbitrary Q and KV sequence lengths
139-
4) Arbitrary head sizes
140-
5) Multi and grouped query attention
141-
6) Dropout
142-
7) Rotary embeddings
143-
8) ALiBi
144-
145-
We are working on the following things
146-
1) Paged Attention
147-
2) Sliding Window
148-
3) FP8
149-
4) Performance Improvements
150-
151-
##### Getting Started
152-
To get started with the triton backend for AMD, follow the steps below.
153-
154-
First install the torch for ROCm from https://pytorch.org/get-started/locally/ if it is not installed. The torch and triton will be installed.
155-
156-
Then install Flash Attention with the flag `FLASH_ATTENTION_TRITON_AMD_ENABLE` set to `"TRUE"`.
157-
158-
```
134+
To install, first get PyTorch for ROCm from https://pytorch.org/get-started/locally/, then install Triton and Flash Attention:
135+
```sh
136+
pip install triton==3.5.1
159137
cd flash-attention
160138
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
161139
```
162140

163-
To test that things are working, you can run our tests. These tests take hours so you don't need to run the full thing.
164-
```
141+
To run the tests (note: full suite takes hours):
142+
```sh
165143
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pytest tests/test_flash_attn_triton_amd.py
166144
```
167145

168-
You can use autotune for better performance by using this flag `FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE"`
169-
```
170-
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE" python $PATH_TO_CODE
171-
```
146+
For better performance, enable autotune with `FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE"`.
172147

173-
###### Docker
174-
You can also use the Dockerfile below which does the above steps on top of the latest rocm/pytorch image.
148+
Alternativly, if _not_ autotuning, `FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON` may be used to set a single triton config overriding the hardcoded defaults for `attn_fwd`. E.g.
149+
```sh
150+
FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
175151
```
152+
153+
For a quick start with Docker:
154+
```dockerfile
176155
FROM rocm/pytorch:latest
177156

178157
WORKDIR /workspace
179158

180-
# install flash attention
181-
ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
159+
# install triton
160+
RUN pip install triton==3.5.1
182161

183-
RUN git clone https://github.com/ROCm/flash-attention.git &&\
162+
# build flash attention with triton backend
163+
RUN git clone https://github.com/Dao-AILab/flash-attention &&\
184164
cd flash-attention &&\
185-
python setup.py install
165+
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
186166

187167
# set working dir
188168
WORKDIR /workspace/flash-attention
189-
```
190169

191-
To build the docker file
192-
```
193-
docker build -t fa_triton .
170+
# set env variable to use triton backend
171+
ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
194172
```
195173

196-
To run the docker image
197-
```
198-
docker run -it --network=host --user root --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 16G --device=/dev/kfd --device=/dev/dri fa_triton
174+
Build and run:
175+
```sh
176+
docker build -t flash-attn-triton .
177+
docker run -it --network=host --user root --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 16G --device=/dev/kfd --device=/dev/dri flash-attn-triton
199178
```
200179

201180
## How to use FlashAttention

csrc/cutlass

Submodule cutlass updated 1240 files

flash_attn/cute/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Flash Attention CUTE
2+
3+
## Development Installation
4+
5+
1. Clone the repository (if you haven't already):
6+
```bash
7+
git clone https://github.com/Dao-AILab/flash-attention.git
8+
cd flash-attention/cute
9+
```
10+
11+
2. Install in editable mode with dev dependencies:
12+
```bash
13+
pip install -e "./cute[dev]"
14+
```
15+
16+
## Running Tests
17+
18+
```bash
19+
pytest tests/cute/
20+
```
21+
22+
## Linting
23+
24+
```bash
25+
ruff check flash_attn/cute/
26+
```

flash_attn/cute/block_sparse_utils.py

Lines changed: 13 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414

1515
# Import data structures from block_sparsity
1616
from flash_attn.cute.block_sparsity import BlockSparseTensors
17-
from flash_attn.cute import utils
1817
from flash_attn.cute import copy_utils
1918
from flash_attn.cute.named_barrier import NamedBarrierBwd
2019

@@ -698,14 +697,14 @@ def handle_block_sparse_empty_tile_correction_sm100(
698697
row_max_value = sink_val * (LOG2_E / softmax_scale_log2)
699698
row_sum_value = Float32(1.0)
700699
else:
701-
row_sum_value = row_sum_value + utils.exp2f(
702-
sink_val * LOG2_E - row_max_value * softmax_scale_log2
700+
row_sum_value = row_sum_value + cute.math.exp2(
701+
sink_val * LOG2_E - row_max_value * softmax_scale_log2, fastmath=True
703702
)
704703
if tidx < m_block_size:
705704
scale_row_idx = tidx + stage * m_block_size
706705
sScale[scale_row_idx] = row_sum_value
707706
if const_expr(mLSE is not None or learnable_sink is not None):
708-
sScale[scale_row_idx + m_block_size * 2] = row_max_value
707+
sScale[scale_row_idx + q_stage * m_block_size] = row_max_value
709708
acc_flag = row_sum_value == Float32(0.0) or row_sum_value != row_sum_value
710709
stats[stage] = (row_sum_value, row_max_value, acc_flag)
711710

@@ -1123,8 +1122,7 @@ def _load_q_do_block_sm90(
11231122
else:
11241123
pipeline_Q.producer_acquire(producer_state_Q)
11251124
load_Q(m_block, producer_state=producer_state_Q)
1126-
with cute.arch.elect_one():
1127-
load_LSE(m_block, producer_state=producer_state_Q)
1125+
load_LSE(m_block, producer_state=producer_state_Q)
11281126

11291127
producer_state_dO_cur = (
11301128
producer_state_dO if const_expr(not Q_stage_eq_dO_stage) else producer_state_Q
@@ -1135,8 +1133,7 @@ def _load_q_do_block_sm90(
11351133
else:
11361134
pipeline_dO.producer_acquire(producer_state_dO_cur)
11371135
load_dO(m_block, producer_state=producer_state_dO_cur)
1138-
with cute.arch.elect_one():
1139-
load_dPsum(m_block, producer_state=producer_state_dO_cur)
1136+
load_dPsum(m_block, producer_state=producer_state_dO_cur)
11401137

11411138
producer_state_Q.advance()
11421139
producer_state_dO.advance()
@@ -1253,10 +1250,10 @@ def consume_block_sparse_mma_bwd_sm90(
12531250
is_causal: cutlass.Constexpr,
12541251
is_local: cutlass.Constexpr,
12551252
thr_mma_SdP,
1256-
softmax_scale,
1257-
seqlen,
1258-
subtile_factor: cutlass.Constexpr,
1259-
m_block_max: int,
1253+
score_mod_fn=None,
1254+
score_mod_bwd_fn=None,
1255+
subtile_factor: cutlass.Constexpr = 1,
1256+
m_block_max: int = 0,
12601257
aux_tensors=None,
12611258
fastdiv_mods=(None, None),
12621259
):
@@ -1318,15 +1315,9 @@ def consume_block_sparse_mma_bwd_sm90(
13181315
consumer_state_Q,
13191316
consumer_state_dO,
13201317
mask_fn=mask_fn_partial,
1318+
score_mod_fn=score_mod_fn,
1319+
score_mod_bwd_fn=score_mod_bwd_fn,
13211320
dKV_accumulate=dKV_accumulate,
1322-
thr_mma_SdP=thr_mma_SdP,
1323-
batch_idx=batch_idx,
1324-
head_idx=head_idx,
1325-
n_block=n_block,
1326-
softmax_scale=softmax_scale,
1327-
seqlen=seqlen,
1328-
aux_tensors=aux_tensors,
1329-
fastdiv_mods=fastdiv_mods,
13301321
)
13311322
dKV_accumulate = True
13321323

@@ -1342,15 +1333,9 @@ def consume_block_sparse_mma_bwd_sm90(
13421333
consumer_state_Q,
13431334
consumer_state_dO,
13441335
mask_fn=mask_fn_full,
1336+
score_mod_fn=score_mod_fn,
1337+
score_mod_bwd_fn=score_mod_bwd_fn,
13451338
dKV_accumulate=dKV_accumulate,
1346-
thr_mma_SdP=thr_mma_SdP,
1347-
batch_idx=batch_idx,
1348-
head_idx=head_idx,
1349-
n_block=n_block,
1350-
softmax_scale=softmax_scale,
1351-
seqlen=seqlen,
1352-
aux_tensors=aux_tensors,
1353-
fastdiv_mods=fastdiv_mods,
13541339
)
13551340
dKV_accumulate = True
13561341

flash_attn/cute/cute_dsl_ptxas.py

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
"""
2+
System ptxas replacement for CUTLASS DSL.
3+
Environment variables:
4+
CUTE_DSL_PTXAS_PATH - Path to ptxas (e.g., /usr/local/cuda/bin/ptxas)
5+
CUTE_DSL_PTXAS_VERBOSE - Set to 1 for verbose output
6+
"""
7+
8+
import os
9+
import sys
10+
import re
11+
import ctypes
12+
import subprocess
13+
from pathlib import Path
14+
15+
import cutlass
16+
17+
18+
CUTE_DSL_PTXAS_PATH = os.environ.get("CUTE_DSL_PTXAS_PATH", None)
19+
VERBOSE = os.environ.get("CUTE_DSL_PTXAS_VERBOSE", "0") == "1"
20+
21+
_original_load_cuda_library = None
22+
_user_wanted_ptx = False # True if user originally set CUTE_DSL_KEEP_PTX=1
23+
24+
25+
def _log(msg):
26+
if VERBOSE:
27+
print(f"[ptxas] {msg}", file=sys.stderr)
28+
29+
30+
def _get_ptx(compiled_func) -> tuple[str, Path] | None:
31+
"""Find and read PTX file, stripping null bytes."""
32+
func_name = getattr(compiled_func, "function_name", None)
33+
if not func_name:
34+
return None
35+
36+
dump_dir = os.environ.get("CUTE_DSL_DUMP_DIR", Path.cwd())
37+
for ptx_path in Path(dump_dir).glob(f"*{func_name}*.ptx"):
38+
content = ptx_path.read_text().rstrip("\x00")
39+
if ".entry " in content and content.rstrip().endswith("}"):
40+
_log(f"Found PTX: {ptx_path}")
41+
return content, ptx_path
42+
return None
43+
44+
45+
def _compile_ptx(ptx_path: Path, ptx_content: str) -> bytes:
46+
"""Compile PTX to cubin using system ptxas."""
47+
# Extract arch from PTX
48+
match = re.search(r"\.target\s+(sm_\d+[a-z]?)", ptx_content)
49+
arch = match.group(1) if match else "sm_90a"
50+
51+
# Write stripped content back if needed
52+
if ptx_path.read_text() != ptx_content:
53+
ptx_path.write_text(ptx_content)
54+
55+
# Compile
56+
cubin_tmp = ptx_path.with_suffix(".cubin.tmp")
57+
try:
58+
assert CUTE_DSL_PTXAS_PATH is not None
59+
result = subprocess.run(
60+
[CUTE_DSL_PTXAS_PATH, f"-arch={arch}", "-O3", "-o", str(cubin_tmp), str(ptx_path)],
61+
capture_output=True,
62+
text=True,
63+
)
64+
if result.returncode != 0:
65+
raise RuntimeError(f"ptxas failed: {result.stderr}")
66+
67+
cubin_data = cubin_tmp.read_bytes()
68+
_log(f"Compiled {ptx_path.name} -> {len(cubin_data)} bytes ({arch})")
69+
70+
# Save cubin if CUTE_DSL_KEEP_CUBIN is set
71+
if os.environ.get("CUTE_DSL_KEEP_CUBIN", "0") == "1":
72+
cubin_out = ptx_path.with_suffix(".cubin")
73+
cubin_out.write_bytes(cubin_data)
74+
_log(f"Saved: {cubin_out}")
75+
76+
return cubin_data
77+
finally:
78+
cubin_tmp.unlink(missing_ok=True)
79+
80+
81+
def _patched_load_cuda_library(self):
82+
"""Replacement for _load_cuda_library that uses system ptxas."""
83+
84+
result = _get_ptx(self)
85+
if not result:
86+
_log("PTX not found, falling back to embedded ptxas")
87+
return _original_load_cuda_library(self)
88+
89+
ptx_content, ptx_path = result
90+
91+
try:
92+
cubin = _compile_ptx(ptx_path, ptx_content)
93+
except Exception as e:
94+
_log(f"Compilation failed ({e}), falling back to embedded ptxas")
95+
return _original_load_cuda_library(self)
96+
97+
# Load cubin
98+
import cuda.bindings.runtime as cuda_runtime
99+
100+
err, library = cuda_runtime.cudaLibraryLoadData(cubin, None, None, 0, None, None, 0)
101+
if err != cuda_runtime.cudaError_t.cudaSuccess:
102+
_log(f"cudaLibraryLoadData failed ({err}), falling back to embedded ptxas")
103+
return _original_load_cuda_library(self)
104+
105+
# Register kernels on all devices
106+
_, cuda_load_to_device = self._get_cuda_init_and_load()
107+
lib_ptr = ctypes.c_void_p(int(library))
108+
dev_id = ctypes.c_int32(0)
109+
err_val = ctypes.c_int32(0)
110+
args = (ctypes.c_void_p * 3)(
111+
ctypes.cast(ctypes.pointer(lib_ptr), ctypes.c_void_p),
112+
ctypes.cast(ctypes.pointer(dev_id), ctypes.c_void_p),
113+
ctypes.cast(ctypes.pointer(err_val), ctypes.c_void_p),
114+
)
115+
116+
for dev in range(self.num_devices):
117+
dev_id.value = dev
118+
cuda_load_to_device(args)
119+
if err_val.value != 0:
120+
_log("cuda_load_to_device failed, falling back to embedded ptxas")
121+
return _original_load_cuda_library(self)
122+
123+
_log(f"Loaded kernel from {ptx_path.name}")
124+
125+
# Delete PTX if user didn't originally want it kept
126+
if not _user_wanted_ptx:
127+
ptx_path.unlink(missing_ok=True)
128+
129+
return [cuda_runtime.cudaLibrary_t(lib_ptr.value)]
130+
131+
132+
def patch():
133+
"""Install system ptxas hook. Call before importing cutlass."""
134+
global _original_load_cuda_library, _user_wanted_ptx
135+
136+
assert CUTE_DSL_PTXAS_PATH is not None
137+
if not os.path.isfile(CUTE_DSL_PTXAS_PATH) or not os.access(CUTE_DSL_PTXAS_PATH, os.X_OK):
138+
raise RuntimeError(f"ptxas not found: {CUTE_DSL_PTXAS_PATH}")
139+
140+
# Track if user originally wanted PTX kept
141+
_user_wanted_ptx = os.environ.get("CUTE_DSL_KEEP_PTX", "0") == "1"
142+
# os.environ['CUTE_DSL_KEEP_PTX'] = '1'
143+
assert os.environ.get("CUTE_DSL_KEEP_PTX", "0") == "1", (
144+
"Require CUTE_DSL_KEEP_PTX=1 to use system's ptxas"
145+
)
146+
147+
cls = cutlass.cutlass_dsl.cuda_jit_executor.CudaDialectJitCompiledFunction
148+
_original_load_cuda_library = cls._load_cuda_library
149+
cls._load_cuda_library = _patched_load_cuda_library
150+
_log("Patch applied")
151+
return

0 commit comments

Comments
 (0)