Skip to content

Generation support in TE for Gemma model#2

Open
sudhakarsingh27 wants to merge 276 commits intomainfrom
te_gemma_generation_support
Open

Generation support in TE for Gemma model#2
sudhakarsingh27 wants to merge 276 commits intomainfrom
te_gemma_generation_support

Conversation

@sudhakarsingh27
Copy link
Owner

Description

Reviving NVIDIA#829 but without the tutorial code which is, for now, in a different branch te_gemma_generation_tutorial

denera and others added 30 commits May 22, 2024 17:05
…ax.jit (NVIDIA#785)

* fixed static argnums for jax.jit in single gpu encoder test, changed warning filtering for pytest

Signed-off-by: Alp Dener <adener@nvidia.com>

* propagating the fix to the JAX mnist example

Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed missing space ibetween flags i QAA scripts

Signed-off-by: Alp Dener <adener@nvidia.com>

* added TE warnings into the ignore list

Signed-off-by: Alp Dener <adener@nvidia.com>

---------

Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add NVRTC kernels for cast-transpose

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update copyright year

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add noop flag to NVRTC cast-transpose kernel

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Apply suggestions from code review

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
)

* Support noop concat without providing full tensor

Stop storing fused buffers in linear modules.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug noop cat func

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Construct TE modules in tests with correct dtypes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tolerances to numerical tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use plain PyTorch concat when exporting to ONNX

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…#780)

* Allow multi-dims for dgamma and dbeta in LN descriptor.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the jit error in examples/jax

Signed-off-by: Ming Huang <mingh@nvidia.com>

---------

Signed-off-by: Ming Huang <mingh@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Remove unnecessary Pylint overrides

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fixes to lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* combined layernorm_geglu with layernorm_gelu into fused_layernorm

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes to pass all unit tests in test_custom_call_compute.py,
test_layer.py, and test_praxis_layer.py

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleaning and formatting

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* renaming based on reviewers suggestions

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* implemented partial fused layernorm

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* geglu + bias passed tests

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added partial fused calculation for dbias_1

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* clean up

Co-authored-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
Co-authored-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Try using global buffer for cu_seqlens

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Avoid using functools.lru_cache

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Added HF Nanotron to integrations and updated GTC 24 video to ondemand link

Signed-off-by: Santosh Bhavani <santosh@semantic.md>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Implemented swiglu and silu

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Renamed nvte-*silu to nvte-*swish + generalized GetDBiasDact functions

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* make FusedAttn with CP support bias

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* assert Alibi cannot work with CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* syntax fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix variable name

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix tensor shapes

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* a typo fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix bias indexing for CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add attn bias tests

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* change dbias update location

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix CP test model configs

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* change CP test sequence length

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make AttnFuncWithCP support qkv format of sbhd

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make sure qkv are contiguous for CP in cuDNN fused attn

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* change assert message

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix code format

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add support for MoE with FP8.

Signed-off-by: Dennis Liu <denliu@nvidia.com>

* Fix unittest.

Signed-off-by: Dennis Liu <denliu@nvidia.com>

* Fix error in linear backward.

Signed-off-by: Dennis Liu <denliu@nvidia.com>

---------

Signed-off-by: Dennis Liu <denliu@nvidia.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add module level filter for deprecation warning in common

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix module

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
remove tp_size/tp_group as amax reduction is handled by fp8_group()

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…IDIA#799)

restrict context parallel tests to sm80+ as fused/flash attn backends require sm80+

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Fix linter warnings from unused args

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update .gitignore

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Added pull request template

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes from the review

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…nite scale (NVIDIA#786)

* Handle the scaling factor when amax is too tiny that leads to an infinite scale

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* revert formatting changes

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* fix comments

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* apply review suggestion

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* revert changes to update_weight_scale_inv

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* Debug test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Jinze Xue <jinzex@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…> 1 on Paxml. (NVIDIA#774)

* Support FP8 Meta Dtype (FM32) and Align FP8 Scale Update with PyTorch.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Modify with the feedback of code review

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Hiding FlaxFloatMeta32 inside fp8.py

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Make functions to be JAX tracable objects.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Rebased with mian.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Update jax images for github workflow.

Signed-off-by: Ming Huang <mingh@nvidia.com>

---------

Signed-off-by: Ming Huang <mingh@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* initialize tp_group for FP8 DPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix cuDNN version in unit tests for cuDNN v9

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add hook to ignore missing fused_attn._extra_states if training from old checkpoints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove test and redundant implementation from last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove warning message and replace with docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move core_attention.fused_attention._extra_state to core_attention._extra_state

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* simplify post_state_dict_hooks between FU and DPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add temporary test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove previous attempts to move core_attention.fused_attention to core_attention; keep the test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove the test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable pylint self arg for hook which is required by hook

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add layernorm_fp8_dot unit test

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Update the softmax primitives support conditions

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add tests for the softmax primitives

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Round1 refactor of test_layer

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Split dropout arguments of ref code and add hidden/intermediate dropout elementwise comparison

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add dropout_braodcast_dim, self_attn_mask tests and clean a few code

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Abstract test layer and fix a rope reference code diff

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add bias tests

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add epsilon and float32 tests

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add relpos_bias and attention dropout tests

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Loose the atol

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Move common fixtures to conftest.py

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add doc string for test_layer

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add doc string for test_layer

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix conflicts of test_layer

Signed-off-by: Reese Wang <rewang@nvidia.com>

* Avoid to left bias parameters in graph when use_bias=False

Signed-off-by: Reese Wang <rewang@nvidia.com>

---------

Signed-off-by: Reese Wang <rewang@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* templated primitives and respective C++ functions

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes for LayerNormMLP, tests in test_custom_compute all passed

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added default arg for pybind get_workspace_size funcs

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes for TestTransFormer with non-gated act tests

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* renamed gelu to act

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* improved enum implementation, avoid using magic numbers

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Exposed C++ ActivationEnum to python side

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Changed error messages

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* changed conditional check on input shape for dbias_cast_transpose

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* changed dtype (tol) for bias grad tests

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes so that layer_norm_fp8_mlp can take bias = None

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Set bias = None in flax modules

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Update FP8 recipe test to handle recipe changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Bump FA version to 2.5.8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fixes for ActLuPrimitive in PAXML

* changed indices for arg_infos in sharding func in dbias_cast_transpose primitive

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 30 commits June 6, 2024 09:20
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.