-
Notifications
You must be signed in to change notification settings - Fork 65
fix: directly save final ckpt in save_model_dir #626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: directly save final ckpt in save_model_dir #626
Conversation
|
Thanks for making a pull request! 😃 |
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing this can we pass the save_dir directly with hf_converted_output_dir attached if intermediate else not if this is final from the call to this function?
This would reduce the code and have a clean function.
Yeah, the path-based approach sounds simpler, but I had thought that it could easily lead to inconsistent directory handling. I did consider it, but I think the flag-based version is better since it keeps all folder-structure logic in one place. That way, if we ever change stuff (like renaming |
Can you explain what you mean by |
With inconsistent I had meant passing different paths outside the checkpoint function, the current version contains all the logic within the checkpoint function and simply abstracts the intermediate vs final checkpoint logic within the function |
93bda6b to
a346f92
Compare
Signed-off-by: yashasvi <[email protected]>
a346f92 to
e3f0b8a
Compare
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1e844d3
into
foundation-model-stack:main

Description of the change
With this PR the the fms_hf_tuning stack will directly save final ckpt in save_model_dir rather than nesting it further within the
/{save_model_dir}/hf_converted_checkpointWhen using
SHARDED_STATE_DICT, each intermediate checkpoint produces distributed shard files that must be converted into Hugging Face–compatible format.which are saved within
but for the final checkpoint nesting it again under
hf_converted_checkpointis unnecessary and hence the change..Related issue number
How to verify the PR
Following is the screenshot of the job with the PR changes.
hf_converted_checkpointsubdirWas the PR tested