Sync the vLLM fork of flash-attention (origin/main) with the upstream Dao-AILab/flash-attention (upstream/main), preserving downstream-specific features and performance optimizations.
- Workspace:
/home/LucasWilkinson/code/flash-attention - Downstream (vLLM fork):
origin/main - Upstream (Dao-AILab):
upstream/main - Last upstream sync base:
d836a6bf09bf3838c6e71c9cf675b3708fea0d71 - Current sync branch:
sync/upstream-main-20260121
-
PR #78 - Attention Sinks Performance Boost:
- Introduces
n_offsetapproach for local attention optimization - Modifies
block.hto returncute::tuple<int, int, int>(includesn_offset) - Shifts KV pointers directly instead of computing
n_block_min
- Introduces
-
PR #70 - Varlen Combine Scheduler:
- Changes to
flash_fwd_combine_kernel.handflash_fwd_combine_launch_template.h - Adds
int bto scheduler structs
- Changes to
-
Context Parallelism (CP):
cp_world_size,cp_rank,cp_tot_seqused_kparameterstot_seqlen_kinSeqlenInfo_t
-
PyTorch TORCH_LIBRARY bindings:
csrc/common/pytorch_shim.handcsrc/common/registration.hmha_fwd_tuplewrapper for correct return type
-
vLLM-specific interface:
vllm_flash_attn/flash_attn_interface.pyhopper/flash_api_torch_lib.cpp(now merged intoflash_api.cpp)
sync/upstream-main-20260121 (tracking origin/sync/upstream-main-20260121)
997fc13 fix compile error
35756f5 restore changes
2cf8a1f Merge remote-tracking branch 'upstream/main' into sync/upstream-main-20260121
hopper/block.h- Reset to origin/main (has n_offset)hopper/flash.h- Modifiedhopper/flash_api.cpp- Modified (removed static_cast, aligned attention_chunk logic)hopper/flash_api_torch_lib.cpp- DELETED (merged into flash_api.cpp)hopper/flash_fwd_combine_kernel.h- Should match origin/mainhopper/flash_fwd_combine_launch_template.h- Modifiedhopper/flash_fwd_launch_template.h- Modified (removed attention_chunk_divmod)hopper/mainloop_fwd_sm90_tma_gmma_ws.hpp- Modified (removed attention_chunk_divmod)hopper/mask.h- Modified (removed attention_chunk_divmod)hopper/tile_scheduler.hpp- Modified
- Upstream uses
attention_chunk_divmodfor attention chunking feature - Downstream (PR #78) uses
n_offsetfor local attention performance optimization - These approaches are mutually exclusive in the current implementation
Remove attention_chunk_divmod from downstream and use the n_offset approach from PR #78:
block.h::get_n_block_min_maxreturnsn_offsetinstead of computingn_block_minfor localmask.h- removeattention_chunk_divmodfrom constructor and masking logicmainloop_fwd_sm90_tma_gmma_ws.hpp- removeattention_chunkfromParamsandArgumentsflash_fwd_launch_template.h- removeattention_chunkfrom mainloop_args
hopper/mask.h- Removedattention_chunk_divmodmember and constructor paramhopper/flash_fwd_launch_template.h- Removedattention_chunkfrom argshopper/mainloop_fwd_sm90_tma_gmma_ws.hpp- Removedattention_chunkfrom Params/Args
cd /home/LucasWilkinson/code/vllm
source /mnt/data/engine/lwilkinson/vllm/.venv/bin/activate
export VLLM_FLASH_ATTN_SRC_DIR=/home/LucasWilkinson/code/flash-attention
VLLM_DISABLE_SCCACHE=1 python setup.py build_ext --inplacecd /home/LucasWilkinson/code/vllm
source /mnt/data/engine/lwilkinson/vllm/.venv/bin/activate
VLLM_DISABLED_BACKENDS=flashinfer chg run -g 1 -- \
/mnt/data/engine/lwilkinson/vllm/.venv/bin/python -m pytest \
tests/v1/attention/test_attention_backends.py -vVLLM_DISABLED_BACKENDS=flashinfer chg run -g 1 -- \
/mnt/data/engine/lwilkinson/vllm/.venv/bin/python -m pytest \
tests/v1/attention/test_mla_backends.py -vAfter removing attention_chunk_divmod, rebuild and fix any remaining compile errors.
Files still referencing attention_chunk (may need review):
hopper/flash_api.cpp- API parameter (keep, but don't use in divmod)hopper/flash.h-Flash_fwd_params.attention_chunkmemberhopper/mainloop_bwd_sm80.hpp- backward passhopper/mainloop_bwd_sm90_tma_gmma_ws.hpp- backward passhopper/flash_bwd_launch_template.h- backward passhopper/flash_api_stable.cpp- stable APIhopper/flash_attn_interface.py- Python interfacehopper/test_*.py- tests
These should match downstream exactly:
hopper/block.hhopper/flash_fwd_combine_kernel.hhopper/flash_fwd_combine_launch_template.hcsrc/common/pytorch_shim.hcsrc/common/registration.h
- Function signatures should use
int64_t(PyTorch requirement) - No unnecessary
static_cast<int>for standard parameters attention_chunkshould be passed through but not used inattention_chunk_divmod
The key test case that validates PR #78 fix:
pytest tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine[google/gemma-3-1b-it]Once validated:
cd /home/LucasWilkinson/code/flash-attention
git diff upstream/main > /tmp/flash-attn-upstream.patch- PR #78: Attention Sinks Perf Boost
- PR #70: Varlen combine scheduler
- PR #93: Context parallelism (broke local attention, fixed by reverting to n_offset approach)
The attention_chunk parameter exists in:
Flash_fwd_params.attention_chunk(inflash.h) - KEEP as API fieldflash_api.cpp- KEEP as function parameter, assigned toparams.attention_chunkflash_attn_interface.py- KEEP in Python interface
But attention_chunk_divmod logic is REMOVED from:
block.h- Noattention_chunk_divmodparametermask.h- Noattention_chunk_divmodmember or masking logicmainloop_fwd_sm90_tma_gmma_ws.hpp- Noattention_chunkin Params/Argumentsflash_fwd_launch_template.h- Noattention_chunkin mainloop_args
This means attention_chunk is accepted by the API but has no effect in the forward kernel.
The n_offset approach from PR #78 handles local attention instead.
- FlashInfer tests fail due to separate environment issue (
PagedParamsmissingk_page_stride/v_page_stride) - useVLLM_DISABLED_BACKENDS=flashinferto skip - The
attention_chunkfeature from upstream is NOT used by vLLM currently - Port 3333 preferred over 8000 for server tests
- Always check for zombie processes before GPU tests
# Navigate to workspace
cd /home/LucasWilkinson/code/flash-attention
# Check current branch
git branch -v
# See modified files
git status
# Diff with downstream baseline
git diff origin/main -- hopper/
# Diff with upstream
git diff upstream/main -- hopper/