fixing the ESM2 checkpointing issue by polinabinder1 · Pull Request #842 · NVIDIA/bionemo-framework

polinabinder1 · 2025-04-23T00:28:59Z

Description

This addresses: #757

Type of changes

In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py.

Signed-off-by: Polina Binder <pbinder@nvidia.com>

trvachov

@jwilber can you review as well please?

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py

codecov-commenter · 2025-04-23T01:32:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.39%. Comparing base (3936231) to head (54e9a8e).
Report is 65 commits behind head on main.

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #842      +/-   ##
==========================================
+ Coverage   84.37%   84.39%   +0.02%     
==========================================
  Files         138      138              
  Lines        8690     8690              
==========================================
+ Hits         7332     7334       +2     
+ Misses       1358     1356       -2

Files with missing lines	Coverage Δ
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py	`93.79% <ø> (ø)`

... and 1 file with indirect coverage changes

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py

skothenhill-nv · 2025-04-23T15:59:25Z

There is similar logic here: https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/train.py#L59

for the pydantic based workflow. but checkpoints in this code-path are passed into setup_nemo_lightning_logger but without the additional arguments to the callback- seems like we do not see the same error.

Do you understand what's going on here? What are the tradeoffs when passing the ModelCheckpoint callback via Trainer vs setup_nemo_lightning_logger. It makes me uncomfortable that we have hit a divergent point.

skothenhill-nv

Approved- but please address the divergence with sub-packages/bionemo-llm/src/bionemo/llm/train.py .

Ideally all of our models would use the same workflow. We should understand the difference and if its a reason we saw this bug.

Or maybe we should document in the description of this PR the differences two code paths?

Signed-off-by: Polina Binder <pbinder@nvidia.com>

trvachov · 2025-04-25T20:27:10Z

Nit: can you fix the type in the PR title?

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

jwilber

Added style comment, otherwise LGTM

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

Signed-off-by: Polina Binder <pbinder@nvidia.com>

### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Signed-off-by: dorotat <dorotat@nvidia.com>

### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com>

### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Signed-off-by: Ubuntu <camirr@nvidia.com>

fixing the checkointing issue

c76d61e

Signed-off-by: Polina Binder <pbinder@nvidia.com>

polinabinder1 requested review from farhadrgh, jomitchellnv, jstjohn, pstjohn, sichu2023 and skothenhill-nv as code owners April 23, 2025 00:29

trvachov reviewed Apr 23, 2025

View reviewed changes

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py Show resolved Hide resolved

skothenhill-nv reviewed Apr 23, 2025

View reviewed changes

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py Outdated Show resolved Hide resolved

skothenhill-nv approved these changes Apr 23, 2025

View reviewed changes

polinabinder1 added 2 commits April 23, 2025 18:17

consistent pydantic path

3465567

Signed-off-by: Polina Binder <pbinder@nvidia.com>

Merge branch 'main' into pbinder/esm2_issue_fix

f530541

polinabinder1 enabled auto-merge April 25, 2025 20:24

polinabinder1 changed the title ~~fixing the ESM2 checkointing issue~~ fixing the ESM2 checkpointing issue Apr 25, 2025

jwilber reviewed Apr 25, 2025

View reviewed changes

sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py Outdated Show resolved Hide resolved

jwilber reviewed Apr 25, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Show resolved Hide resolved

jwilber approved these changes Apr 25, 2025

View reviewed changes

polinabinder1 added 6 commits April 25, 2025 14:12

Update test_train_esm2.py

78bc957

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

Merge branch 'main' into pbinder/esm2_issue_fix

c8f803b

run pre-commit

c974ebb

Signed-off-by: Polina Binder <pbinder@nvidia.com>

reverting to something that works

5c477a8

Signed-off-by: Polina Binder <pbinder@nvidia.com>

Merge branch 'main' into pbinder/esm2_issue_fix

7ce1dfb

Merge branch 'main' into pbinder/esm2_issue_fix

54e9a8e

polinabinder1 added this pull request to the merge queue May 5, 2025

Merged via the queue into main with commit 8a1a701 May 5, 2025
10 checks passed

polinabinder1 deleted the pbinder/esm2_issue_fix branch May 5, 2025 19:27

dorotat-nv mentioned this pull request May 12, 2025

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing the ESM2 checkpointing issue#842

fixing the ESM2 checkpointing issue#842
polinabinder1 merged 9 commits intomainfrom
pbinder/esm2_issue_fix

polinabinder1 commented Apr 23, 2025 •

edited

Loading

Uh oh!

trvachov left a comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

skothenhill-nv commented Apr 23, 2025

Uh oh!

skothenhill-nv left a comment

Uh oh!

trvachov commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

jwilber left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

polinabinder1 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of changes

Uh oh!

trvachov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

skothenhill-nv commented Apr 23, 2025

Uh oh!

skothenhill-nv left a comment

Choose a reason for hiding this comment

Uh oh!

trvachov commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

jwilber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

polinabinder1 commented Apr 23, 2025 •

edited

Loading

codecov-commenter commented Apr 23, 2025 •

edited

Loading