Convert activation checkpointing tag with eager checkpointing function #2538

shino16 · 2025-09-25T15:29:11Z

This is a workaround for #2527. torch.ops.higher_order.tag_activation_checkpoint does not perform activation checkpointing when run in eager mode, so we convert it back to torch.utils.checkpoint.checkpoint.

shino16 · 2025-09-25T15:32:15Z

This fixes #2501.

on main (280c57e)

[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. GPU 2 has a total capacity of 139.72 GiB of which 3.71 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 127.71 GiB is allocated by PyTorch, and 6.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

^ happening on each rank

in this PR (ac5508a)

Model name: Gemma-2-27b
Seq Length: 8192
Micro BS: 1
Global BS: 8
Number of Layers: 46
Number of parameters: 3.55B
Distributed Mode: fsdp
Sharding Mode: zero3
Bucketing: block
Compiler: dynamo_thunder
Low Precision Mode: none
Average iter time: 9991.12 ms
Memory used: 81.76 GB
Tokens/s: 6555.90
Tokens/s/GPU: 819.49
TFLOP/s: 1192.38

shino16 · 2025-09-25T15:51:00Z

thunder/dynamo/compiler.py


-        # Dynamo uses lazy generation of the underlying Python code, so we need to
-        # force recompilation of the GraphModule before passing it to Thunder.
-        recompile_graph(gm)


Recompiling here was added in commit 0338afe when we did not have the graph splitting logic. Now we break the graph down in the subsequent code, so no need for recompile.

shino16 · 2025-09-25T15:51:34Z

thunder/dynamo/splitter.py

            )
            example_input_metadatas.append(list(example_input_metadata))
-            # Replace PyTorch operators within the checkpointed function with the corresponding Thunder operators
-            checkpoint_converter(split_gm, graph_module)


torch.utils.checkpoint.checkpoint is Thunder-tracible.

KaelanDt

thank you @shino16

thunder/dynamo/compiler.py

kshitij12345 · 2025-09-29T10:43:16Z

thunder/tests/test_dynamo.py

+
+    initial_mem = torch.cuda.memory_allocated()
+
+    x = torch.randn((1024 // 4, 1024, 1024), device="cuda", requires_grad=True)


It would be better to use smaller input as these tests run in parallel.

Ah that's a good point, thank you!

kshitij12345 · 2025-09-29T10:44:17Z

thunder/dynamo/utils.py


-    Args:
-        gm (torch.fx.GraphModule): The GraphModule of the checkpointed function, which is modified inplace.
+    tag_activation_checkpoint only marks nodes for torch.compile stack but does not execute actual checkpointing in eager mode.


It would be nice to mention that this function mutates the gm.

kshitij12345

LGTM, thanks @shino16.

Let's also wait for review from @kiya00

… inductor-checkpoint

kiya00

Thank you @shino16 for the fix, there are 2 tests test_checkpoint_converter, test_checkpoint_converter_submodule that test the old converter, but I think we can keep them, as they also seem to validate the functionality of convert_checkpoint_tags.

shino16 mentioned this pull request Sep 25, 2025

Activation checkpoint not working inside Inductor-compiled submodules #2527

Open

shino16 commented Sep 25, 2025

View reviewed changes

shino16 added 3 commits September 26, 2025 13:44

Replace tag_activation_checkpoint with actual checkpointer

5fc411f

Add test

f3e80b3

Improved comments

49f813b

shino16 force-pushed the inductor-checkpoint branch from 2bdfeae to 49f813b Compare September 26, 2025 11:44

shino16 marked this pull request as ready for review September 26, 2025 11:46

shino16 requested review from mruberry, lantiga, t-vi and KaelanDt as code owners September 26, 2025 11:46

KaelanDt approved these changes Sep 26, 2025

View reviewed changes

shino16 requested review from kiya00 and kshitij12345 September 26, 2025 17:33

kshitij12345 approved these changes Sep 29, 2025

View reviewed changes

shino16 added 3 commits September 29, 2025 04:17

Merge branch 'main' of github.com:Lightning-AI/lightning-thunder into…

5bb50af

… inductor-checkpoint

Resolve review: better comments

35297a7

Resolve review: small tensor for tests

a1db659

This was referenced Sep 29, 2025

Change the entrypoint for Inductor on submodules to compile_fx #2550

Closed

Wrap fallback submodules for Inductor in a new nn.Module #2551

Draft

Merge branch 'main' into inductor-checkpoint

71ee95e

kiya00 approved these changes Oct 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert activation checkpointing tag with eager checkpointing function #2538

Convert activation checkpointing tag with eager checkpointing function #2538

Uh oh!

shino16 commented Sep 25, 2025

Uh oh!

shino16 commented Sep 25, 2025

Uh oh!

shino16 Sep 25, 2025

Uh oh!

shino16 Sep 25, 2025

Uh oh!

KaelanDt left a comment

Uh oh!

Uh oh!

kshitij12345 Sep 29, 2025

Uh oh!

shino16 Sep 29, 2025

Uh oh!

kshitij12345 Sep 29, 2025

Uh oh!

kshitij12345 left a comment

Uh oh!

kiya00 left a comment

Uh oh!

Uh oh!


		initial_mem = torch.cuda.memory_allocated()

		x = torch.randn((1024 // 4, 1024, 1024), device="cuda", requires_grad=True)

Convert activation checkpointing tag with eager checkpointing function #2538

Are you sure you want to change the base?

Convert activation checkpointing tag with eager checkpointing function #2538

Uh oh!

Conversation

shino16 commented Sep 25, 2025

Uh oh!

shino16 commented Sep 25, 2025

Uh oh!

shino16 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

shino16 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

KaelanDt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kshitij12345 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

shino16 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

kiya00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!