Skip to content

fixing the ESM2 checkpointing issue#842

Merged
polinabinder1 merged 9 commits intomainfrom
pbinder/esm2_issue_fix
May 5, 2025
Merged

fixing the ESM2 checkpointing issue#842
polinabinder1 merged 9 commits intomainfrom
pbinder/esm2_issue_fix

Conversation

@polinabinder1
Copy link
Collaborator

@polinabinder1 polinabinder1 commented Apr 23, 2025

Description

This addresses: #757

Type of changes

In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py.

Signed-off-by: Polina Binder <pbinder@nvidia.com>
Copy link
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwilber can you review as well please?

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.39%. Comparing base (3936231) to head (54e9a8e).
Report is 65 commits behind head on main.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #842      +/-   ##
==========================================
+ Coverage   84.37%   84.39%   +0.02%     
==========================================
  Files         138      138              
  Lines        8690     8690              
==========================================
+ Hits         7332     7334       +2     
+ Misses       1358     1356       -2     
Files with missing lines Coverage Δ
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py 93.79% <ø> (ø)

... and 1 file with indirect coverage changes

@skothenhill-nv
Copy link
Collaborator

There is similar logic here: https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/train.py#L59

for the pydantic based workflow. but checkpoints in this code-path are passed into setup_nemo_lightning_logger but without the additional arguments to the callback- seems like we do not see the same error.

Do you understand what's going on here? What are the tradeoffs when passing the ModelCheckpoint callback via Trainer vs setup_nemo_lightning_logger. It makes me uncomfortable that we have hit a divergent point.

Copy link
Collaborator

@skothenhill-nv skothenhill-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved- but please address the divergence with sub-packages/bionemo-llm/src/bionemo/llm/train.py .

Ideally all of our models would use the same workflow. We should understand the difference and if its a reason we saw this bug.

Or maybe we should document in the description of this PR the differences two code paths?

@polinabinder1 polinabinder1 enabled auto-merge April 25, 2025 20:24
@trvachov
Copy link
Collaborator

Nit: can you fix the type in the PR title?

@polinabinder1 polinabinder1 changed the title fixing the ESM2 checkointing issue fixing the ESM2 checkpointing issue Apr 25, 2025
Copy link
Collaborator

@jwilber jwilber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added style comment, otherwise LGTM

@polinabinder1 polinabinder1 added this pull request to the merge queue May 5, 2025
Merged via the queue into main with commit 8a1a701 May 5, 2025
10 checks passed
@polinabinder1 polinabinder1 deleted the pbinder/esm2_issue_fix branch May 5, 2025 19:27
dorotat-nv pushed a commit that referenced this pull request May 6, 2025
### Description
This addresses: #757
### Type of changes

In the original code the optimizer was not saved in the checkpoint, but
it is expected in the megatron strategy. Saving the optimizer is added
to the checkpoint callback and the test has been updated. Changes were
made to have the checkpointing callback in the nemo logger. This is the
same as the training path in
sub-packages/bionemo-llm/src/bionemo/llm/train.py.

---------

Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Signed-off-by: dorotat <dorotat@nvidia.com>
trvachov pushed a commit that referenced this pull request May 16, 2025
### Description
This addresses: #757
### Type of changes

In the original code the optimizer was not saved in the checkpoint, but
it is expected in the megatron strategy. Saving the optimizer is added
to the checkpoint callback and the test has been updated. Changes were
made to have the checkpointing callback in the nemo logger. This is the
same as the training path in
sub-packages/bionemo-llm/src/bionemo/llm/train.py.

---------

Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
camirr-nv pushed a commit that referenced this pull request Jun 26, 2025
### Description
This addresses: #757
### Type of changes

In the original code the optimizer was not saved in the checkpoint, but
it is expected in the megatron strategy. Saving the optimizer is added
to the checkpoint callback and the test has been updated. Changes were
made to have the checkpointing callback in the nemo logger. This is the
same as the training path in
sub-packages/bionemo-llm/src/bionemo/llm/train.py.

---------

Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Signed-off-by: Ubuntu <camirr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants