fixing the ESM2 checkpointing issue#842
Conversation
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #842 +/- ##
==========================================
+ Coverage 84.37% 84.39% +0.02%
==========================================
Files 138 138
Lines 8690 8690
==========================================
+ Hits 7332 7334 +2
+ Misses 1358 1356 -2
|
sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py
Outdated
Show resolved
Hide resolved
|
There is similar logic here: https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/train.py#L59 for the pydantic based workflow. but checkpoints in this code-path are passed into Do you understand what's going on here? What are the tradeoffs when passing the ModelCheckpoint callback via |
skothenhill-nv
left a comment
There was a problem hiding this comment.
Approved- but please address the divergence with sub-packages/bionemo-llm/src/bionemo/llm/train.py .
Ideally all of our models would use the same workflow. We should understand the difference and if its a reason we saw this bug.
Or maybe we should document in the description of this PR the differences two code paths?
Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
Nit: can you fix the type in the PR title? |
sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py
Outdated
Show resolved
Hide resolved
jwilber
left a comment
There was a problem hiding this comment.
Added style comment, otherwise LGTM
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Signed-off-by: dorotat <dorotat@nvidia.com>
### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com>
### Description This addresses: #757 ### Type of changes In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py. --------- Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Signed-off-by: Ubuntu <camirr@nvidia.com>
Description
This addresses: #757
Type of changes
In the original code the optimizer was not saved in the checkpoint, but it is expected in the megatron strategy. Saving the optimizer is added to the checkpoint callback and the test has been updated. Changes were made to have the checkpointing callback in the nemo logger. This is the same as the training path in sub-packages/bionemo-llm/src/bionemo/llm/train.py.