Skip to content

Paged attention changes to THD attention #3

Draft
sudhakarsingh27 wants to merge 259 commits intosudhakarsingh27:te_gemma_generation_supportfrom
cyanguwa:paged_attention
Draft

Paged attention changes to THD attention #3
sudhakarsingh27 wants to merge 259 commits intosudhakarsingh27:te_gemma_generation_supportfrom
cyanguwa:paged_attention

Conversation

@sudhakarsingh27
Copy link
Owner

Description

Checking how difficult it is to merge Paged Attention changes into THD Attention changes

timmoon10 and others added 30 commits November 21, 2024 18:15
* Add helper function to convert C++ container to string

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Align RNG tracker with megatron

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>

* Fix module_params order and warmup bug in cudagraph

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>

* Add fp8_group argument and fix fp8 accuracy issue for cudagraph

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>

* Add TE modules and weights filters to support MoE models

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>

* Revert self.fp8

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* Use hooks to filter module params

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* Filter all TE modules in hooks

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>

* Format code

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update graph.py

Signed-off-by: Xin Yao <yaox12@outlook.com>

* Revert CudaRNGStatesTracker

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* Format Update

Signed-off-by: Yifei Song <yifeis@nvidia.com>

* Revert "Use hooks to filter module params"

This reverts commit 73a22e2.

Signed-off-by: Yifei Song <yifeis@nvidia.com>

* Remove filtering module params

Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Signed-off-by: Yifei Song <yifeis@nvidia.com>
Co-authored-by: Yifei Song <yifeis@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Moved framework agnostic THD kernels to common.

---------

Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
* retain_graph=True for grouped gemm

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove an unnecessary retain_graph=True

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make retain_graph in graph capture configurable

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* typo fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
* Update list of CI users

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update list of CI users

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
…age (NVIDIA#1308)

* draft implementation

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* compile error fix

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix compile error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove print

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Edit comments

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* edit the bulk-overlap test case

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add version guard

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add runtime version guard

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix the version guard

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
…1347)

Scale sequence length in CP tests to avoid tiny sizes.

Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Debug jobs to deploy nightly docs

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Store module extra state in tensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* always have padding mask type for both flash and fused attentions

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove an redundant assert

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Debug Mcore integration test

Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* cuDNN normalization integration
* TE Norm refactor
* TE Norm APIs changes.

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
…DIA#1364)

* Bug Fix: Use default factory for not sharing mutable default values
---------

Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix ctx.aval_out indexing for workspace
* add cudnn init to prepare phase of norm custom calls
* add thread_local for norm registry instance
---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Add Jeremy to ci users

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* softmax custom calls with correct encapsulates

* rm jax deprecated features

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…VIDIA#1358)

* draft implementation of fsdp2 fp8 all gather

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix the convergence issue

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add warning

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable lint error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the lint error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix lint error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint error

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add comments

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add ref

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add related tests

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
add max_t for KV

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Add util functions to attn_mask_type

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add util functions to qkv_layout

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix THD cross reference code

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove explicit segment_pad, encoding it to segment_ids

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add jax.jit, replace _token with segment_ids, rename bias shape enum

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add comment for make_mask

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Clean code

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add doc strings for the added functions

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove cache for fa deterministic which causes UT failed

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Rename fixture to avoid conflict

Signed-off-by: Reese Wang <rewang@nvidia.com>

---------

Signed-off-by: Reese Wang <rewang@nvidia.com>
cyanguwa and others added 4 commits March 15, 2025 02:34
* Add options to comm overlap tests

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Fix Typo

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Update tests/pytorch/distributed/run_layer_with_overlap.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
cyanguwa and others added 25 commits March 15, 2025 06:11
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py
Move attention logging into a separate class in pytorch/d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info
Move versioning info out of pytorch/attention.py

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py
Fix tests and imports for the above refactor change

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py
Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol
Rename tensor packaging methods with leading underscore to make them as internal to file

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it
Modify tests and other files to import InferenceParams correctly

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

Modify docs api for InferenceParams

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Create d_p_a/rope.py and move RoPE methods from  pytorch/attention.py to it

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Code cleanup

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix qa testing induced bug
Code clean up

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix incorrect pack_tensor arg type
Code clean up

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* nit: Resolve lint errors

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove typedef FAUtils for FlashAttentionUtils
Use attn_log instead of att_log

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

Fix lint error

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Fix the function name from get_cumul to the earlier get_cu

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* nit: Fix typos, explicit imports and remove extra comments

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
…IDIA#1554)

* support tp-comm-overlap in Current Scaling recipe

Signed-off-by: Li Tao <lit@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean

Signed-off-by: Li Tao <lit@nvidia.com>

* fix test recipe argument to generalize to MXFP8

Signed-off-by: Li Tao <lit@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reduce duplicated transpose in certain cases

Signed-off-by: Li Tao <lit@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use per_tensor_scaling() to judge DS or CS

Signed-off-by: Li Tao <lit@nvidia.com>

* minor fixes

Signed-off-by: Li Tao <lit@nvidia.com>

* change comment description

Signed-off-by: Li Tao <lit@nvidia.com>

* add multi-layer unit test for tp overlap

Signed-off-by: Li Tao <lit@nvidia.com>

* support test case that run for several times

Signed-off-by: Li Tao <lit@nvidia.com>

* avoid save ub tensor in prepare_for_saving

Signed-off-by: Li Tao <lit@nvidia.com>

* fix

Signed-off-by: Li Tao <lit@nvidia.com>

* switch to a simple fix

Signed-off-by: Li Tao <lit@nvidia.com>

* formatting

Signed-off-by: Li Tao <lit@nvidia.com>

* simply test cases; avoid additional clone()

Signed-off-by: Li Tao <lit@nvidia.com>

* fall back to get_buffer in layernormmlp

Signed-off-by: Li Tao <lit@nvidia.com>

* use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Li Tao <lit@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Add issue template

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make GPU info section

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Do not create multiple cublas handle

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for multiple GPUs per thread

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix multithreaded execution

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix from conlfict

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* DistOpt support with offloading

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>

* Added distopt support for TE2.0

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>

* Restricted this to MCore DistOpt only

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>

* Added guards

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/module/linear.py

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu>

* Update transformer_engine/pytorch/module/layernorm_linear.py

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu>

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* [QA] Add error handling

-Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function.

Signed-off-by: Linxi Ding <linxid@nvidia.com>

* Update script to use explicit python3, pip3, and python3 -m pytest calls

- Change pip to pip3.
- Change python to python3.
- Change pytest to python3 -m pytest.

Signed-off-by: Linxi Ding <linxid@nvidia.com>

---------

Signed-off-by: Linxi Ding <linxid@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.