Commit 7fa0f55
[Pytorch] Support for Swiglu Activation used in GPT OSS (#2161)
* Test working as I think it should work
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
revert accidental change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
fix merge conflict
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
bug: missed a } in the code
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Add cuBLASMp-backed GEMM-like API to TE common (#1824)
* Pick up cuBLASMp during build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Change lib order to fix link error
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Context creation, incomplete...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Test fixure
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A sanity AgGemm test, failing...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix axes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Take care of uneven distribution
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use MPI to get position of local matrices
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor & fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-RS
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-AR, not working...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Setting all-reduce epilogue for gemm-ar
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use supported shapes for GEMM-AR
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tolerance
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* First shot at fp8
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use TensorHolder in tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Support comm_sm_count
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Parametrize dtypes for A, B and D separately
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak scaling
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Amax ptr
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Flags parity with cublas_gemm, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Cleanup
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Bias tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix bias test
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Aux, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* aux_ld
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A fix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use test::Tensor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Set scale inv
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove unsupported test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Replace libcal with NCCL
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add NVTX markers to API functions
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak GemmAr tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test config
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix merge fallout
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove MPI dependency, comment API, add algo parameter
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem dependency
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Excluse CommGemm tests from L0_cppunittest
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add cpp_distributed sh file for CI
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Adapt tp TensorAllocator
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Skip GemmAr test on unsupported HW
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Oversibscribe is needed on some clusters
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix incomplete libcal removal
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Move CI tests to L1
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Rename context to include NVTE prefix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove leftover code
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* NVTE_WITH_CUBLASMP off by default
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed NVTE_CHECK diag
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Comment API
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Include stdbool header for legacy C compilers
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove now unused argument
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Abstract away cuBLASMp algo behind our own enum
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed shape diag messages
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update transformer_engine/common/include/transformer_engine/comm_gemm.h
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
* Add license
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086)
* FP8 AllGather in FP8 GroupedGEMM
1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Slightly refactor
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding documents of new args.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding unit-tests.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding license.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Move unit-tests to L1.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Move quantizaer store/reset into FP8 only.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding all layout support for Blackwell+
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adopt the feedback from code-review.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Fixed the wrong stream used by d2d in groupedGEMM FFI.
Signed-off-by: Ming Huang <mingh@nvidia.com>
---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] Delay MeshResource validation until first usage (#2124)
Delay MeshResource validation until first usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] Decouple Recipe and ScalingMode (#1728)
* Decouple recipe and scaling mode
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Expose global QuantizeConfig instance as a getter
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Format and lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Rename UsageType to TensorSource
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128)
* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118)
* add amax input to DBiasQuantizePrimitive and FFI
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* make sure amax is init with zero
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix sharding rule
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Temporarily remove comm_gemm tests (#2133)
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch] Disable determinism for sm100 (#2130)
* disable determinism for sm100+ and cudnn<9.14
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* fix remaining CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert some changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* revert more changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* remove sm100 from determinism table
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch] ONNX export of FP8 Current Scaling (#2068)
* Compute amax in normalization forward in current scaling in untuned kernels
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* apply tims suggestions
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
---------
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134)
use torch empty for empty shape instead of from_blob
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
build: pull cached wheels (#2127)
* build: pull cached wheels
Signed-off-by: oliver könig <okoenig@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update setup.py
Signed-off-by: oliver könig <okoenig@nvidia.com>
---------
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
feat: Add support for multiple quantization modes in the UB communicators (#2043)
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[Common] Add checks to CUDA kernel launch and CUDA API calls (#2074)
* add checks to cuda kernel launch and cuda API calls
Signed-off-by: Xin Yao <xiny@nvidia.com>
* Remove exceptions from destructors
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* fix weired dispatch in ln/rmsnorm
Signed-off-by: Xin Yao <xiny@nvidia.com>
---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch] Support bf16+fp8 cudagraph (#2098)
* support bf16+fp8 model
Signed-off-by: Robin Zhang <robinz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
Signed-off-by: Robin Zhang <robinz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
Signed-off-by: Robin Zhang <robinz@nvidia.com>
---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Dropout with 8-bit RNG (#2014)
* Add dropout kernel with 8-bit RNG
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix license
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Avoid ambiguous types
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Do not enforce dropout prob is representable in 8 bits
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Expand error message
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix small statistical bug from using less-equal instead of less-than
Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unnecessary helper function in PyTorch extensions
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Create GPU reload buffers on main stream (#2131)
* Create GPU relaod buffers on main stream
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fixed typo
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
* Fixed typo
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
---------
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
mxfp8 unfused quant support, refined unit test, remove unecessary quantization code
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
missed a quant code removal
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
minor bug fix
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Add cuBLASMp-backed GEMM-like API to TE common (#1824)
* Pick up cuBLASMp during build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Change lib order to fix link error
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Context creation, incomplete...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Test fixure
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A sanity AgGemm test, failing...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix axes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Take care of uneven distribution
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use MPI to get position of local matrices
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor & fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-RS
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-AR, not working...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Setting all-reduce epilogue for gemm-ar
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use supported shapes for GEMM-AR
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tolerance
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* First shot at fp8
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use TensorHolder in tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Support comm_sm_count
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Parametrize dtypes for A, B and D separately
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak scaling
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Amax ptr
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Flags parity with cublas_gemm, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Cleanup
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Bias tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix bias test
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Aux, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* aux_ld
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A fix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use test::Tensor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Set scale inv
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove unsupported test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Replace libcal with NCCL
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add NVTX markers to API functions
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak GemmAr tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test config
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix merge fallout
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove MPI dependency, comment API, add algo parameter
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem dependency
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Excluse CommGemm tests from L0_cppunittest
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add cpp_distributed sh file for CI
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Adapt tp TensorAllocator
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Skip GemmAr test on unsupported HW
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Oversibscribe is needed on some clusters
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix incomplete libcal removal
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Move CI tests to L1
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Rename context to include NVTE prefix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove leftover code
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* NVTE_WITH_CUBLASMP off by default
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed NVTE_CHECK diag
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Comment API
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Include stdbool header for legacy C compilers
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove now unused argument
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Abstract away cuBLASMp algo behind our own enum
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed shape diag messages
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update transformer_engine/common/include/transformer_engine/comm_gemm.h
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
* Add license
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Temporarily remove comm_gemm tests (#2133)
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
minor code cleanup
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
minor cosmetics
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Address review comment
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
minor comment update
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Fix CI failures for UB overlap changes (#2149)
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
minor bug: quantizer should not be none for unfused quantization
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135)
* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix the skip message
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Assert in fused attn bwd pass for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Add check for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add support to get all devs in the process for jax
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Code clean up
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Represent attn bias using enum instead of string
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
fix linting error
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* initial draft of changes to get GPT oss based swiglu integrated, gated kernels needs to be fixed
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* redundant implementation for the pytorch to te hook up, refactoring to be done later
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* all gated kernels modified, pytest working for oss swiglu
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* fix the merge conflict
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Add cuBLASMp-backed GEMM-like API to TE common (#1824)
* Pick up cuBLASMp during build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Change lib order to fix link error
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Context creation, incomplete...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Test fixure
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A sanity AgGemm test, failing...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix axes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Take care of uneven distribution
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use MPI to get position of local matrices
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Refactor & fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-RS
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Gemm-AR, not working...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fixes
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Setting all-reduce epilogue for gemm-ar
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use supported shapes for GEMM-AR
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tolerance
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* First shot at fp8
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use TensorHolder in tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Support comm_sm_count
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Parametrize dtypes for A, B and D separately
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak scaling
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Amax ptr
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Flags parity with cublas_gemm, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Cleanup
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Bias tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix bias test
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Aux, saving...
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* aux_ld
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* A fix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Use test::Tensor
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Set scale inv
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove unsupported test configs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Replace libcal with NCCL
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add NVTX markers to API functions
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Tweak GemmAr tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More test config
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix merge fallout
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove MPI dependency, comment API, add algo parameter
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem dependency
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix nvshmem build
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Excluse CommGemm tests from L0_cppunittest
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Add cpp_distributed sh file for CI
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Adapt tp TensorAllocator
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Skip GemmAr test on unsupported HW
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Oversibscribe is needed on some clusters
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Fix incomplete libcal removal
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Move CI tests to L1
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Rename context to include NVTE prefix
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove leftover code
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* NVTE_WITH_CUBLASMP off by default
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed NVTE_CHECK diag
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Comment API
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Include stdbool header for legacy C compilers
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove now unused argument
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Abstract away cuBLASMp algo behind our own enum
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* More detailed shape diag messages
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update transformer_engine/common/include/transformer_engine/comm_gemm.h
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
* Add license
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119)
* add noop to comp amax
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* fix for fp8 blockwise recipe
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* resolve comments
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[PyTorch] fix cross entropy vanishing gradients (#2139)
* fix cross entropy
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix comments
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix: few more style issues
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix: remove grad_output_stride (unnecessary)
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: only backward was broken
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Generalize cross entropy backward kernel to handle reduced and unreduced loss
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Casper <casperbh.96@gmail.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Fix bug when enabling --overlap-grad-reduce in mcore (#2142)
* fix bugs when enabling --overlap-grad-reduce in mcore
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix CI
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* format
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Fix CUDA version in setup.py (#2132)
* Fix CUDA version in setup.py
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Re-enable building comm-gemm tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* WAR for nvidia-nvshmem package
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] NoScaleTensor wrapper for non-quantized data (#2136)
* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[JAX] Fix GroupedScaledTensor creation with keyword arg (#2154)
Fix GroupedScaledTensor creation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Fixing few issues with multi-process launching. (#2155)
* Fixing few issues with multi-process launching.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Update list of authorized CI users (#2152)
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
a bit of cleanup
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* accidentally had removed some activations, minor bug in the templated function
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* parent de9ef2fe450daae0d4ea1b647a37219f72814f66
author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000
committer Varun Thumbe <vthumbe@nvidia.com> 1758262513 +0000
parent de9ef2fe450daae0d4ea1b647a37219f72814f66
author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000
committer Varun Thumbe <vthumbe@nvidia.com> 1758262476 +0000
parent de9ef2fe450daae0d4ea1b647a37219f72814f66
author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000
committer Varun Thumbe <vthumbe@nvidia.com> 1758262304 +0000
merge conflict
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086)
* FP8 AllGather in FP8 GroupedGEMM
1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Slightly refactor
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding documents of new args.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding unit-tests.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding license.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Move unit-tests to L1.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Move quantizaer store/reset into FP8 only.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adding all layout support for Blackwell+
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Adopt the feedback from code-review.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* Fixed the wrong stream used by d2d in groupedGEMM FFI.
Signed-off-by: Ming Huang <mingh@nvidia.com>
---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
[JAX] Delay MeshResource validation until first usage (#2124)
Delay MeshResource validation until first usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
[JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128)
* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
[JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118)
* add amax input to DBiasQuantizePrimitive and FFI
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* make sure amax is init with zero
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix sharding rule
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Temporarily remove comm_gemm tests (#2133)
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
[PyTorch] Disable determinism for sm100 (#2130)
* disable determinism for sm100+ and cudnn<9.14
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* fix remaining CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert some changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* revert more changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* remove sm100 from determinism table
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
[PyTorch] ONNX export of FP8 Current Scaling (#2068)
* Compute amax in normalization forward in current scaling in untuned kernels
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* apply tims suggestions
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
---------
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134)
use torch empty for empty shape instead of from_blob
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
build: pull cached wheels (#2127)
* build: pull cached wheels
Signed-off-by: oliver könig <okoenig@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update setup.py
Signed-off-by: oliver könig <okoenig@nvidia.com>
---------
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
[Common] Add checks to CUDA kernel launch and CUDA API calls (#2074)
* add checks to cuda kernel launch and cuda API calls
Signed-off-by: Xin Yao <xiny@nvidia.com>
* Remove exceptions from destructors
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* fix weired dispatch in ln/rmsnorm
Signed-off-by: Xin Yao <xiny@nvidia.com>
---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
[PyTorch] Support bf16+fp8 cudagraph (#2098)
* support bf16+fp8 model
Signed-off-by: Robin Zhang <robinz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
Signed-off-by: Robin Zhang <robinz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
Signed-off-by: Robin Zhang <robinz@nvidia.com>
---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Dropout with 8-bit RNG (#2014)
* Add dropout kernel with 8-bit RNG
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix license
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Avoid ambiguous types
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Do not enforce dropout prob is representable in 8 bits
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Expand error message
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix small statistical bug from using less-equal instead of less-than
Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unnecessary helper function in PyTorch extensions
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Create GPU reload buffers on main stream (#2131)
* Create GPU relaod buffers on main stream
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fixed typo
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
* Fixed typo
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
---------
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Fix CI failures for UB overlap changes (#2149)
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135)
* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix the skip message
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Assert in fused attn bwd pass for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Add check for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add support to get all devs in the process for jax
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Code clean up
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
* Represent attn bias using enum instead of string
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
[PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119)
* add noop to comp amax
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* fix for fp8 blockwise recipe
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* resolve comments
Signed-off-by: zhongboz <zhongboz@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
[PyTorch] fix cross entropy vanishing gradients (#2139)
* fix cross entropy
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix comments
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix: few more style issues
Signed-off-by: Casper <casperbh.96@gmail.com>
* fix: remove grad_output_stride (unnecessary)
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: only backward was broken
Signed-off-by: Casper <casperbh.96@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Generalize cross entropy backward kernel to handle reduced and unreduced loss
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Casper <casperbh.96@gmail.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Fix bug when enabling --overlap-grad-reduce in mcore (#2142)
* fix bugs when enabling --overlap-grad-reduce in mcore
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix CI
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* format
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix CUDA version in setup.py (#2132)
* Fix CUDA version in setup.py
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Re-enable building comm-gemm tests
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* WAR for nvidia-nvshmem package
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
[JAX] NoScaleTensor wrapper for non-quantized data (#2136)
* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
[JAX] Fix GroupedScaledTensor creation with keyword arg (#2154)
Fix GroupedScaledTensor creation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Fixing few issues with multi-process launching. (#2155)
* Fixing few issues with multi-process launching.
Signed-off-by: Ming Huang <mingh@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Update list of authorized CI users (#2152)
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Fused RoPE with combined QKV input. (#2122)
* Fused RoPE with combined QKV input.
Initial commit for Dropout with 8-bit RNG
Fix documentation
Initial commit for Fused QKV RoPE
WIP
Initial tests passing
Enable rotary percent and margin
Enable CP2, start_positions, interleaved
Cleanup test
Revert "Fix documentation"
This reverts commit 53df10044e7769982bd4af2ae2628e6b7717e715.
Revert "Initial commit for Dropout with 8-bit RNG"
This reverts commit 301505e24031cbcd679069e1c2cd4d00eedf2dca.
Cleanup.
Minor cleanup
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* Optimize kernels
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* Misc. Cleanup
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* Optimize kernel performance
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* Move fused_qkv_rope test to test_fused_rope.py
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* apply shared memory optimization to separate fused rope kernels
Signed-off-by: Xin Yao <xiny@nvidia.com>
* fix lint
Signed-off-by: Xin Yao <xiny@nvidia.com>
---------
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* accidentally removed the copyright
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* fix linting issue
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* minor issue in comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* Commit is for another PR
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* revert changes since this belongs to another PR
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert change back since belongs to another PR
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Changes belong to another PR
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert changes here
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Add bf16/fp32 token-per-expert to the MoE aux loss kernel (#2162)
* add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion
Signed-off-by: tongliu <tongliu@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: tongliu <tongliu@nvidia.com>
Co-authored-by: tongliu <tongliu@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
[JAX] Scale swizzling via JAX transpose op (#2163)
* add swizzle in jax
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* added outer_impl
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* clean up FFI
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Extract cpp distributed tests into a separate project (#2165)
* Extract cpp distributed tests into a separate project
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Remove obsolete exclusion
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* Run L1_cpp_distributed tests if at least 4 GPUs
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (#2129)
* test - adds unit test for cp utilities and the utilites
Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com>
* assert line change
Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com>
* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* cleanup
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix linting error
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
[PyTorch Debug] Fix issue with negative underflow% stat. (#2107)
* fix underflows log issue
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Address review comments, fix mxfp8 kernel bug: was not passing clamped swiglu parameter correctly
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Lower precision gated-act to accelerate FP8 current-scaling. (#2153)
* Applying the original precision as N…1 parent 25252e9 commit 7fa0f55
File tree
14 files changed
+459
-122
lines changed- tests/pytorch
- transformer_engine
- common
- activation
- include/transformer_engine
- util
- pytorch
- csrc
- extensions
- ops/basic
14 files changed
+459
-122
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1736 | 1736 | | |
1737 | 1737 | | |
1738 | 1738 | | |
| 1739 | + | |
| 1740 | + | |
| 1741 | + | |
| 1742 | + | |
| 1743 | + | |
| 1744 | + | |
| 1745 | + | |
| 1746 | + | |
| 1747 | + | |
| 1748 | + | |
| 1749 | + | |
| 1750 | + | |
| 1751 | + | |
| 1752 | + | |
| 1753 | + | |
| 1754 | + | |
| 1755 | + | |
| 1756 | + | |
| 1757 | + | |
| 1758 | + | |
| 1759 | + | |
| 1760 | + | |
| 1761 | + | |
| 1762 | + | |
| 1763 | + | |
| 1764 | + | |
| 1765 | + | |
| 1766 | + | |
| 1767 | + | |
| 1768 | + | |
| 1769 | + | |
| 1770 | + | |
| 1771 | + | |
| 1772 | + | |
| 1773 | + | |
| 1774 | + | |
| 1775 | + | |
| 1776 | + | |
| 1777 | + | |
| 1778 | + | |
| 1779 | + | |
| 1780 | + | |
| 1781 | + | |
| 1782 | + | |
| 1783 | + | |
| 1784 | + | |
| 1785 | + | |
| 1786 | + | |
| 1787 | + | |
| 1788 | + | |
| 1789 | + | |
| 1790 | + | |
| 1791 | + | |
| 1792 | + | |
| 1793 | + | |
| 1794 | + | |
| 1795 | + | |
| 1796 | + | |
| 1797 | + | |
| 1798 | + | |
| 1799 | + | |
| 1800 | + | |
| 1801 | + | |
| 1802 | + | |
| 1803 | + | |
| 1804 | + | |
| 1805 | + | |
| 1806 | + | |
| 1807 | + | |
| 1808 | + | |
| 1809 | + | |
| 1810 | + | |
| 1811 | + | |
| 1812 | + | |
1739 | 1813 | | |
1740 | 1814 | | |
1741 | 1815 | | |
| |||
Lines changed: 4 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
59 | | - | |
| 58 | + | |
60 | 59 | | |
61 | 60 | | |
62 | 61 | | |
63 | 62 | | |
64 | | - | |
| 63 | + | |
65 | 64 | | |
66 | 65 | | |
67 | 66 | | |
68 | | - | |
69 | | - | |
| 67 | + | |
70 | 68 | | |
71 | 69 | | |
72 | 70 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | | - | |
| 34 | + | |
| 35 | + | |
34 | 36 | | |
35 | 37 | | |
36 | 38 | | |
| |||
49 | 51 | | |
50 | 52 | | |
51 | 53 | | |
52 | | - | |
| 54 | + | |
| 55 | + | |
53 | 56 | | |
54 | 57 | | |
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
58 | 61 | | |
59 | | - | |
| 62 | + | |
| 63 | + | |
60 | 64 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | | - | |
| 34 | + | |
| 35 | + | |
34 | 36 | | |
35 | 37 | | |
36 | 38 | | |
| |||
49 | 51 | | |
50 | 52 | | |
51 | 53 | | |
52 | | - | |
| 54 | + | |
| 55 | + | |
53 | 56 | | |
54 | 57 | | |
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
58 | 61 | | |
59 | | - | |
| 62 | + | |
| 63 | + | |
60 | 64 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
34 | 53 | | |
Lines changed: 40 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
176 | 196 | | |
177 | 197 | | |
178 | 198 | | |
| |||
230 | 250 | | |
231 | 251 | | |
232 | 252 | | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
233 | 273 | | |
234 | 274 | | |
235 | 275 | | |
| |||
0 commit comments