Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
02bc852
[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests (#1612)
pggPL May 19, 2025
74525d1
Fix README render for uploading package to PyPI (#1798)
ksivaman May 19, 2025
cea1152
Enhance recipe compatibility (#1724)
negvet May 19, 2025
610c393
Use an empty torch tensor to indicate no fp8 information in extra_sta…
pstjohn May 20, 2025
c5ea9eb
[Pytorch] NVIDIA-DL-Framework-Inspect support – part 4 – documentatio…
pggPL May 20, 2025
aafa053
[PyTorch] Add docstring for CP load balancing (#1802)
cyanguwa May 20, 2025
90458e7
Add missing docs for C API (#1803)
ksivaman May 21, 2025
3a5ca57
Remove `comm_gemm_overlap` doc (#1815)
ksivaman May 22, 2025
9b80ea9
Add docs for missing FP8 recipes. (#1816)
ksivaman May 22, 2025
7558c44
Fix the failing test cases in the CI (#1806)
ptrendx May 23, 2025
d82f67b
Fix multi-framework runtime lib loading (#1825)
ksivaman May 28, 2025
b1d2539
Release v2.4_rocm
alextmagro Oct 6, 2025
0e1c8fe
readd HIP data generation
alextmagro Oct 7, 2025
758ed7e
Missing ; in test_common
alextmagro Oct 8, 2025
d1b8dba
[CI] Removed Jax jit workaround, replaced with XLA_FLAGS=--xla_gpu_en…
VeeraRajasekhar Oct 31, 2025
fa8615d
CI hotfix: IFU test update (#329)
Micky774 Oct 10, 2025
08bf8fc
Fix and add MXFP8 GEMM test failures (#326)
ipanfilo Oct 19, 2025
c6a2c65
Fix FFI import. Add distributed tests hang workaround (#347)
ipanfilo Oct 23, 2025
499d2d8
Make TE ROCm wheels building image directly from manylinix image (#340)
ipanfilo Oct 27, 2025
235b9b6
[CI] Hotfix test_gemm_autotune update (#353)
VeeraRajasekhar Oct 31, 2025
bcae459
MXFP8 test scale off by 1 fix (#338)
alextmagro Oct 31, 2025
34b1a34
CI: allow numpy 2.0 (#366)
ipanfilo Nov 7, 2025
736ab30
Relax tolerance to pass 29x29x17389NT GEMM on MI350 (#365)
ipanfilo Nov 8, 2025
baed0d1
Bring back aiter solib with aiter update (#327)
ipanfilo Oct 12, 2025
cc5b356
[ROCm] update AITER to support aiter shared lib for multi-gpu (PRs 11…
wangye805 Oct 22, 2025
08344fe
Use .info/version for ROCm verison (#368)
ipanfilo Nov 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,7 @@ Installation
============

System Requirements
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^

* **Hardware:** Blackwell, Hopper, Grace Hopper/Blackwell, Ada, Ampere

Expand All @@ -468,10 +468,10 @@ System Requirements
* **Notes:** FP8 features require Compute Capability 8.9+ (Ada/Hopper/Blackwell)

Installation Methods
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^

Docker (Recommended)
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^
The quickest way to get started with Transformer Engine is by using Docker images on
`NVIDIA GPU Cloud (NGC) Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_.

Expand All @@ -496,7 +496,7 @@ Where 25.04 (corresponding to April 2025 release) is the container version.
* NGC PyTorch 23.08+ containers include FlashAttention-2

pip Installation
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^

**Prerequisites for pip installation:**

Expand Down Expand Up @@ -534,7 +534,7 @@ Source Installation
`See the installation guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source>`_

Environment Variables
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
These environment variables can be set before installation to customize the build process:

* **CUDA_PATH**: Path to CUDA installation
Expand All @@ -545,7 +545,7 @@ These environment variables can be set before installation to customize the buil
* **NVTE_BUILD_THREADS_PER_JOB**: Control threads per build job

Compiling with FlashAttention
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Transformer Engine supports both FlashAttention-2 and FlashAttention-3 in PyTorch for improved performance. FlashAttention-3 was added in release v1.11 and is prioritized over FlashAttention-2 when both are present in the environment.

You can verify which FlashAttention version is being used by setting these environment variables:
Expand All @@ -557,8 +557,9 @@ You can verify which FlashAttention version is being used by setting these envir
It is a known issue that FlashAttention-2 compilation is resource-intensive and requires a large amount of RAM (see `bug <https://github.com/Dao-AILab/flash-attention/issues/358>`_), which may lead to out of memory errors during the installation of Transformer Engine. Please try setting **MAX_JOBS=1** in the environment to circumvent the issue.

.. troubleshooting-begin-marker-do-not-remove

Troubleshooting
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^

**Common Issues and Solutions:**

Expand Down Expand Up @@ -692,7 +693,7 @@ Papers
Videos
======

* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`_
* `Stable and Scalable FP8 Deep Learning Training on Blackwell | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62457/>`__
* `Blackwell Numerics for AI | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72458/>`_
* `Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision | GTC 2025 <https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=zoho#/session/1726152813607001vnYK>`_
* `From FP8 LLM Training to Inference: Language AI at Scale | GTC 2025 <https://www.nvidia.com/en-us/on-demand/session/gtc25-s72799/>`_
Expand Down
9 changes: 9 additions & 0 deletions docs/api/c/cast_transpose_noop.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

cast_transpose_noop.h
=====================

.. doxygenfile:: cast_transpose_noop.h
9 changes: 9 additions & 0 deletions docs/api/c/cudnn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

cudnn.h
=======

.. doxygenfile:: cudnn.h
3 changes: 3 additions & 0 deletions docs/api/c/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,13 @@ directly from C/C++, without Python.

transformer_engine.h <transformer_engine>
activation.h <activation>
cast_transpose_noop.h <cast_transpose_noop>
cast.h <cast>
cudnn.h <cudnn>
fused_attn.h <fused_attn>
fused_rope.h <fused_rope>
gemm.h <gemm>
multi_tensor.h <multi_tensor>
normalization.h <normalization>
padding.h <padding>
permutation.h <permutation>
Expand Down
9 changes: 9 additions & 0 deletions docs/api/c/multi_tensor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

multi_tensor.h
==============

.. doxygenfile:: multi_tensor.h
4 changes: 4 additions & 0 deletions docs/api/common.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ Common API
.. autoapiclass:: transformer_engine.common.recipe.DelayedScaling(margin=0, fp8_format=Format.HYBRID, amax_history_len=1024, amax_compute_algo="max", scaling_factor_compute_algo=None)

.. autoapiclass:: transformer_engine.common.recipe.MXFP8BlockScaling(fp8_format=Format.E4M3)

.. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)

.. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)
14 changes: 14 additions & 0 deletions docs/debug.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.
Precision debug tools
==============================================

.. toctree::
:caption: Precision debug tools

debug/1_getting_started.rst
debug/2_config_file_structure.rst
debug/api
debug/4_distributed.rst
241 changes: 241 additions & 0 deletions docs/debug/1_getting_started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
..
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

Getting started
==============

.. note::

Precision debug tools with `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ for Transformer Engine are currently supported only for PyTorch.

Transformer Engine provides a set of precision debug tools which allow you to easily:

- log the statistics for each of the tensors in every matrix multiply (GEMM) operation,
- run selected GEMMs in higher precision,
- run current scaling - with one scaling factor per tensor - for particular GEMMs,
- test new precisions and integrate them with FP8 training,
- ... and many more.

There are 4 things one needs to do to use Transformer Engine debug features:

1. Create a configuration YAML file to configure the desired features.
2. Import, and initialize the `Nvidia-DL-Framework-Inspect <https://github.com/NVIDIA/nvidia-dlfw-inspect>`_ tool, which is installed as the dependency of the Transformer Engine.
3. One can pass ``name="..."`` when creating TE layers to easier identify layer names. If this is not provided, names will be inferred automatically.
4. Invoke ``debug_api.step()`` at the end of one forward-backward pass.

To start debugging, one needs to create a configuration YAML file. This file lists the features to be used in particular layers. There are 2 kinds of features:

- provided by the Transformer Engine - for example, DisableFP8GEMM or LogTensorStats - they are listed in the :doc:`debug features API <3_api_features>` section
- defined by the user. For details on how to create a custom feature - please read the :doc:`calls to Nvidia-DL-Framework-Inspect <3_api_te_calls>` section.

.. figure:: ./img/introduction.svg
:align: center

Fig 1: Example of Nvidia-DL-Framework-Inspect affecting training script with 3 TE Linear Layers.
``config.yaml`` contains the specification of the features used for each Linear layer. Some feature classes are provided by TE,
one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.

Example training script
----------------------

Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.

.. code-block:: python

# train.py

from transformer_engine.pytorch import TransformerLayer
import torch
import torch.nn as nn
import torch.optim as optim
import transformer_engine.pytorch as te

hidden_size = 512
num_attention_heads = 8

transformer_layer = TransformerLayer(
hidden_size=hidden_size,
ffn_hidden_size=hidden_size,
num_attention_heads=num_attention_heads
).cuda()

dummy_input = torch.randn(10, 32, hidden_size).cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
dummy_target = torch.randn(10, 32, hidden_size).cuda()

for epoch in range(5):
transformer_layer.train()
optimizer.zero_grad()
with te.fp8_autocast(enabled=True):
output = transformer_layer(dummy_input)
loss = criterion(output, dummy_target)
loss.backward()
optimizer.step()

We will demonstrate two debug features on the code above:

1. Disabling FP8 precision for specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.

Config file
----------

We need to prepare the configuration YAML file, as below

.. code-block:: yaml

# config.yaml

fc1_fprop_to_fp8:
enabled: True
layers:
layer_types: [fc1, fc2] # contains fc1 or fc2 in name
transformer_engine:
DisableFP8GEMM:
enabled: True
gemms: [fprop]

log_tensor_stats:
enabled: True
layers:
layer_types: [layernorm_linear] # contains layernorm_linear in name
transformer_engine:
LogTensorStats:
enabled: True
stats: [max, min, mean, std, l1_norm]
tensors: [activation]
freq: 1
start_step: 2
end_step: 5

Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.

Adjusting Python file
--------------------

.. code-block:: python

# (...)

import nvdlfw_inspect.api as debug_api
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log",
default_logging_enabled=True)

# initialization of the TransformerLayer with the name
transformer_layer = TransformerLayer(
name="transformer_layer",
# ...)

# (...)
for epoch in range(5):
# forward and backward pass
# ...
debug_api.step()

In the modified code above, the following changes were made:

1. Added an import for ``nvdlfw_inspect.api``.
2. Initialized the Nvidia-DL-Framework-Inspect by calling ``debug_api.initialize()`` with appropriate configuration, specifying the path to the config file, feature directories, and log directory.
3. Added ``debug_api.step()`` after each of the forward-backward pass.

Inspecting the logs
------------------

Let's look at the files with the logs. Two files will be created:

1. debug logs.
2. statistics logs.

Let's look inside them!

In the main log file, you can find detailed information about the transformer layer's GEMMs behavior. You can see that ``fc1`` and ``fc2`` fprop GEMMs are run in high precision, as intended.

.. code-block:: text

# log/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log

INFO - Default logging to file enabled at ./log
INFO - Reading config from ./config.yaml.
INFO - Loaded configs for dict_keys(['fc1_fprop_to_fp8', 'log_tensor_stats']).
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm fprop - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm wgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm dgrad - FP8 quantization
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm wgrad - FP8 quantization
INFO - transformer_layer.self_attention.layernorm_qkv: Feature=LogTensorStats, API=look_at_tensor_before_process: activation
....

The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``) contains statistics for tensors we requested in ``config.yaml``.

.. code-block:: text

# log/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log

INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000002 value=4.3188
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000002 value=-4.3386
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000002 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000002 value=0.9998
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000002 value=130799.6953
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000003 value=4.3184
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000003 value=-4.3381
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000003 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000003 value=0.9997
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000003 value=130788.1016
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max iteration=000004 value=4.3181
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min iteration=000004 value=-4.3377
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean iteration=000004 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000004 value=0.9996
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000004 value=130776.7969

Logging using TensorBoard
------------------------

Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``. Let's modify ``train.py`` file.

.. code-block:: python

# (...)

from torch.utils.tensorboard import SummaryWriter
tb_writer = SummaryWriter('./tensorboard_dir/run1')

# add tb_writer to the Debug API initialization
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log",
tb_writer=tb_writer)

# (...)

Let's run training and open TensorBoard by ``tensorboard --logdir=./tensorboard_dir/run1``:

.. figure:: ./img/tensorboard.png
:align: center

Fig 2: TensorBoard with plotted stats.
Loading