Test turning all grad checkpointings off #1168

javak87 · 2025-10-28T10:15:56Z

Description

This is a merge branch related to the following PRs:
#1151
#1152
#1153
#1155
#1156

Issue Number

Closes #1141

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Comparing the baseline and current PR performance

When pred_gradient_checkpoint_mode is set to false, the performance and GPU memory peak are as follows:

default_config.yml:

../WeatherGenerator-private/hpc/launch-slurm.py --time 60

mixed.yml:

../WeatherGenerator-private/hpc/launch-slurm.py --time 60 --config ./config/mixed.yml

…into javad/dev/cond_checkpoint_embed_transformer

…/cond_checkpoint_all-1141

…javad/dev/cond_checkpoint_all-1141

…cond_checkpoint_all-1141

config/default_config.yml

tjhunter

@javak87 thanks for the assembly of this PR. I suggest a way to abstract all the checkpoint changes, happy to talk about it if unclear.

tjhunter · 2025-12-03T13:43:28Z

src/weathergen/model/embeddings.py


-        # embed provided input data
-        x = peh(checkpoint(self.embed, x_in.transpose(-2, -1), use_reentrant=False))
+        if self.embed_gradient_checkpoint_mode:


This is not the right place for branching, because you are then forced to copy the same logic across each branch.

Here is what I would suggest: we write a small conditional checkpoint function in one of the utility files (the signature is based on looking at the pytorch checkpoint function):

def cond_checkpoint( enable_checkpoint: bool, function, *args, use_reentrant: Optional[bool] = None, context_fn: Callable[[], Tuple[ContextManager, ContextManager]] = noop_context_fn, determinism_check: str = _DEFAULT_DETERMINISM_MODE, debug: bool = False, **kwargs ): if enable_checkpoint: checkpoint(function, ...) else: function(*args, **kwargs)

and then, the only change required is to convert:

x = peh(checkpoint(self.embed, x_in.transpose(-2, -1), use_reentrant=False))

into:

x = peh(cond_checkpoint(self.embed_gradient_checkpoint_mode, self.embed, x_in.transpose(-2, -1), use_reentrant=False))

How does that sound?

This way, we can also build this rule in the linter: do not use checkpoint directly but use cond_checkpoint

I’m okay with this helper function. However, I should put it in __init__ for every checkpoint, because having these conditional helper functions in forward is degrading performance.

tjhunter · 2025-12-03T13:45:51Z

src/weathergen/model/model.py

-                    cell_lens_c,
-                    use_reentrant=False,
-                )
+            if self.cf.ae_adapter_grdient_checkpoint_mode:


always assume the parameter may be missing:

self.cf.get("ae_adapter_grdient_checkpoint_mode", True)

javak87 · 2026-01-08T08:04:34Z

Because of multiple conflicts, I opened a new PR.
New PR #1564

Javad Kasravi added 21 commits October 24, 2025 14:11

add embed_gradient_checkpoint_mode to config

e30a8be

add embed_gradient_checkpoint_mode condition to forward_channels

427f6dd

add embed_gradient_checkpoint_mode to args

cf509e3

test pipeline for embed_gradient_checkpoint_mode true

125d50d

test pipeline for embed_gradient_checkpoint_mode false

53c94a5

add ae_local_blocks_grdient_checkpoint_mode to config

797cf1b

add ae_local_blocks_grdient_checkpoint_mode cond

8575019

add ae_adapter_grdient_checkpoint_mode

c144f28

add assimilate_global_gradient_checkpoint_mode to config

11161e5

add pred_gradient_checkpoint_mode to config

4b32a13

Merge branch 'develop' of https://github.com/javak87/WeatherGenerator …

a8f5633

…into javad/dev/cond_checkpoint_embed_transformer

ruff embeddings.py

f2a78d3

remove ae_local_blocks_grdient_checkpoint_mode for __init__

bca3381

ruff the code

909cff3

ruff the code

e4b7746

ruff the code

cf5b13f

ruff the code

e0532c2

Merge branch 'javad/dev/cond_checkpoint_ae_local-1141' into javad/dev…

1c012ce

…/cond_checkpoint_all-1141

Merge branch 'javad/dev/cond_checkpoint_assimilate_global-1141' into …

9827650

…javad/dev/cond_checkpoint_all-1141

Merge branch 'javad/dev/cond_checkpoint_embed_transformer-1141' into …

e1630a1

…javad/dev/cond_checkpoint_all-1141

Merge branch 'javad/dev/cond_checkpoint_predict-1141' into javad/dev/…

92cd98e

…cond_checkpoint_all-1141

github-project-automation bot added this to WeatherGen-dev Oct 28, 2025

javak87 marked this pull request as draft October 28, 2025 10:16

Merge branch 'develop' into javad/dev/cond_checkpoint_all-1141

a1b7f1f

javak87 changed the title ~~Test turning all grad checkpointing off~~ Test turning all grad checkpointings off Oct 28, 2025

tjhunter reviewed Dec 3, 2025

View reviewed changes

config/default_config.yml Show resolved Hide resolved

tjhunter requested changes Dec 3, 2025

View reviewed changes

github-project-automation bot moved this to In Progress in WeatherGen-dev Dec 3, 2025

javak87 mentioned this pull request Jan 8, 2026

Introduce configuration flags for gradient checkpointing #1564

Open

4 tasks

javak87 closed this Jan 8, 2026

github-project-automation bot moved this from In Progress to Done in WeatherGen-dev Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test turning all grad checkpointings off #1168

Test turning all grad checkpointings off #1168

Uh oh!

javak87 commented Oct 28, 2025 •

edited by tjhunter

Loading

Uh oh!

Uh oh!

tjhunter left a comment

Uh oh!

tjhunter Dec 3, 2025 •

edited

Loading

Uh oh!

tjhunter Dec 3, 2025

Uh oh!

javak87 Jan 7, 2026

Uh oh!

tjhunter Dec 3, 2025

Uh oh!

javak87 commented Jan 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Test turning all grad checkpointings off #1168

Test turning all grad checkpointings off #1168

Uh oh!

Conversation

javak87 commented Oct 28, 2025 • edited by tjhunter Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Comparing the baseline and current PR performance

default_config.yml:

mixed.yml:

Uh oh!

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjhunter Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

javak87 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

tjhunter Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

javak87 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

javak87 commented Oct 28, 2025 •

edited by tjhunter

Loading

tjhunter Dec 3, 2025 •

edited

Loading

javak87 commented Jan 8, 2026 •

edited

Loading