Support checkpointing TE quantizations with new remat policies #2773

jberchtold-nvidia · 2025-12-02T19:23:06Z

Description

This PR allows us to avoid rematerializing TransformerEngine (TE) quantizations. TE supports fused kernels that compute both the forward and backward layouts in a single kernel, however, these are only useful if we are saving the alternate layouts for the backward.

This PR introduces two new remat policies, minimal_with_quantization and minimal_with_context_and_quantization which extend the existing policies with additional support for checkpointing TE quantizations.

Tests

Tested locally with E2E workloads and confirmed the quantization operations were saved and not rematerialized with these policies enabled.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

Signed-off-by: Jeremy Berchtold <[email protected]>

richjames0

lgtm

Support checkpointing TE quantizations with new remat policies

2aee314

Signed-off-by: Jeremy Berchtold <[email protected]>

jberchtold-nvidia requested review from NicoGrande, RissyRan, bvandermoon, gagika, gobbleturk, jiangjy1982, jshin1394, khatwanimohit, liudangyi, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners December 2, 2025 19:23

gobbleturk approved these changes Dec 3, 2025

View reviewed changes

richjames0 approved these changes Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support checkpointing TE quantizations with new remat policies #2773

Support checkpointing TE quantizations with new remat policies #2773

Uh oh!

jberchtold-nvidia commented Dec 2, 2025 •

edited

Loading

Uh oh!

richjames0 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support checkpointing TE quantizations with new remat policies #2773

Are you sure you want to change the base?

Support checkpointing TE quantizations with new remat policies #2773

Uh oh!

Conversation

jberchtold-nvidia commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

richjames0 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jberchtold-nvidia commented Dec 2, 2025 •

edited

Loading