Paged attention changes to THD attention #3
Draft
sudhakarsingh27 wants to merge 259 commits intosudhakarsingh27:te_gemma_generation_supportfrom
Draft
Paged attention changes to THD attention #3sudhakarsingh27 wants to merge 259 commits intosudhakarsingh27:te_gemma_generation_supportfrom
sudhakarsingh27 wants to merge 259 commits intosudhakarsingh27:te_gemma_generation_supportfrom
Conversation
* Add helper function to convert C++ container to string Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Align RNG tracker with megatron Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> * Fix module_params order and warmup bug in cudagraph Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> * Add fp8_group argument and fix fp8 accuracy issue for cudagraph Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> * Add TE modules and weights filters to support MoE models Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> * Revert self.fp8 Signed-off-by: Robin Zhang <robinz@nvidia.com> * Use hooks to filter module params Signed-off-by: Robin Zhang <robinz@nvidia.com> * Filter all TE modules in hooks Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> * Format code Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update graph.py Signed-off-by: Xin Yao <yaox12@outlook.com> * Revert CudaRNGStatesTracker Signed-off-by: Robin Zhang <robinz@nvidia.com> * Format Update Signed-off-by: Yifei Song <yifeis@nvidia.com> * Revert "Use hooks to filter module params" This reverts commit 73a22e2. Signed-off-by: Yifei Song <yifeis@nvidia.com> * Remove filtering module params Signed-off-by: Robin Zhang <robinz@nvidia.com> --------- Signed-off-by: Robin Zhang <robinz@nvidia.com> Signed-off-by: Xin Yao <yaox12@outlook.com> Signed-off-by: Yifei Song <yifeis@nvidia.com> Co-authored-by: Yifei Song <yifeis@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <yaox12@outlook.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Moved framework agnostic THD kernels to common. --------- Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
* retain_graph=True for grouped gemm Signed-off-by: Xiaowei Ren <xren@nvidia.com> * remove an unnecessary retain_graph=True Signed-off-by: Xiaowei Ren <xren@nvidia.com> * make retain_graph in graph capture configurable Signed-off-by: Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: Xiaowei Ren <xren@nvidia.com>
* Update list of CI users Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update list of CI users Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>
…age (NVIDIA#1308) * draft implementation Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * compile error fix Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix compile error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * remove print Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edit comments Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit the bulk-overlap test case Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add version guard Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add runtime version guard Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix the version guard Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> --------- Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
…1347) Scale sequence length in CP tests to avoid tiny sizes. Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Debug jobs to deploy nightly docs Signed-off-by: Tim Moon <tmoon@nvidia.com>
Store module extra state in tensor Signed-off-by: Tim Moon <tmoon@nvidia.com>
* always have padding mask type for both flash and fused attentions Signed-off-by: Xiaowei Ren <xren@nvidia.com> * remove an redundant assert Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Debug Mcore integration test Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data. Signed-off-by: Tim Moon <tmoon@nvidia.com>
* cuDNN normalization integration * TE Norm refactor * TE Norm APIs changes. --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
…DIA#1364) * Bug Fix: Use default factory for not sharing mutable default values --------- Signed-off-by: Reese Wang <rewang@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix ctx.aval_out indexing for workspace * add cudnn init to prepare phase of norm custom calls * add thread_local for norm registry instance --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Add Jeremy to ci users Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* softmax custom calls with correct encapsulates * rm jax deprecated features --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…VIDIA#1358) * draft implementation of fsdp2 fp8 all gather Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix the convergence issue Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Add warning Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable lint error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the lint error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix lint error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * add comments Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * add ref Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * add related tests Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
add max_t for KV Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Add util functions to attn_mask_type Signed-off-by: Reese Wang <rewang@nvidia.com> * Add util functions to qkv_layout Signed-off-by: Reese Wang <rewang@nvidia.com> * Fix THD cross reference code Signed-off-by: Reese Wang <rewang@nvidia.com> * Remove explicit segment_pad, encoding it to segment_ids Signed-off-by: Reese Wang <rewang@nvidia.com> * Add jax.jit, replace _token with segment_ids, rename bias shape enum Signed-off-by: Reese Wang <rewang@nvidia.com> * Add comment for make_mask Signed-off-by: Reese Wang <rewang@nvidia.com> * Clean code Signed-off-by: Reese Wang <rewang@nvidia.com> * Add doc strings for the added functions Signed-off-by: Reese Wang <rewang@nvidia.com> * Remove cache for fa deterministic which causes UT failed Signed-off-by: Reese Wang <rewang@nvidia.com> * Rename fixture to avoid conflict Signed-off-by: Reese Wang <rewang@nvidia.com> --------- Signed-off-by: Reese Wang <rewang@nvidia.com>
* Add options to comm overlap tests Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix Typo Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Update tests/pytorch/distributed/run_layer_with_overlap.py Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
4a2fd47 to
7f1c765
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info Move versioning info out of pytorch/attention.py Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it Modify tests and other files to import InferenceParams correctly Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> Modify docs api for InferenceParams Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py to it Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code cleanup Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix qa testing induced bug Code clean up Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect pack_tensor arg type Code clean up Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Resolve lint errors Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove typedef FAUtils for FlashAttentionUtils Use attn_log instead of att_log Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> Fix lint error Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix the function name from get_cumul to the earlier get_cu Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Fix typos, explicit imports and remove extra comments Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
…IDIA#1554) * support tp-comm-overlap in Current Scaling recipe Signed-off-by: Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean Signed-off-by: Li Tao <lit@nvidia.com> * fix test recipe argument to generalize to MXFP8 Signed-off-by: Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduce duplicated transpose in certain cases Signed-off-by: Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use per_tensor_scaling() to judge DS or CS Signed-off-by: Li Tao <lit@nvidia.com> * minor fixes Signed-off-by: Li Tao <lit@nvidia.com> * change comment description Signed-off-by: Li Tao <lit@nvidia.com> * add multi-layer unit test for tp overlap Signed-off-by: Li Tao <lit@nvidia.com> * support test case that run for several times Signed-off-by: Li Tao <lit@nvidia.com> * avoid save ub tensor in prepare_for_saving Signed-off-by: Li Tao <lit@nvidia.com> * fix Signed-off-by: Li Tao <lit@nvidia.com> * switch to a simple fix Signed-off-by: Li Tao <lit@nvidia.com> * formatting Signed-off-by: Li Tao <lit@nvidia.com> * simply test cases; avoid additional clone() Signed-off-by: Li Tao <lit@nvidia.com> * fall back to get_buffer in layernormmlp Signed-off-by: Li Tao <lit@nvidia.com> * use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Li Tao <lit@nvidia.com> Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
* Add issue template Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make GPU info section Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Do not create multiple cublas handle Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for multiple GPUs per thread Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix multithreaded execution Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix from conlfict Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* DistOpt support with offloading Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> * Added distopt support for TE2.0 Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> * Restricted this to MCore DistOpt only Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> * Added guards Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/linear.py Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu> * Update transformer_engine/pytorch/module/layernorm_linear.py Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos02.a51.clusters.nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* [QA] Add error handling -Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function. Signed-off-by: Linxi Ding <linxid@nvidia.com> * Update script to use explicit python3, pip3, and python3 -m pytest calls - Change pip to pip3. - Change python to python3. - Change pytest to python3 -m pytest. Signed-off-by: Linxi Ding <linxid@nvidia.com> --------- Signed-off-by: Linxi Ding <linxid@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Checking how difficult it is to merge Paged Attention changes into THD Attention changes