Skip to content

Conversation

@YashasviChaurasia
Copy link
Contributor

Description of the change

With this PR the the fms_hf_tuning stack will directly save final ckpt in save_model_dir rather than nesting it further within the /{save_model_dir}/hf_converted_checkpoint

When using SHARDED_STATE_DICT, each intermediate checkpoint produces distributed shard files that must be converted into Hugging Face–compatible format.
which are saved within

{output_dir}/checkpoint-<step>/hf_converted_checkpoint/

but for the final checkpoint nesting it again under hf_converted_checkpoint is unnecessary and hence the change..

Related issue number

How to verify the PR

Following is the screenshot of the job with the PR changes.

  1. The changes ensures to save hf converted intermediate checkpoint within hf_converted_checkpoint subdir
  2. While ensuring the final checkpoint directly has hf compatible format without any sub-dir..
image

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Nov 3, 2025
@YashasviChaurasia YashasviChaurasia marked this pull request as ready for review November 3, 2025 02:56
Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this can we pass the save_dir directly with hf_converted_output_dir attached if intermediate else not if this is final from the call to this function?

This would reduce the code and have a clean function.

@YashasviChaurasia
Copy link
Contributor Author

Rather than doing this can we pass the save_dir directly with hf_converted_output_dir attached if intermediate else not if this is final from the call to this function?

This would reduce the code and have a clean function.

Yeah, the path-based approach sounds simpler, but I had thought that it could easily lead to inconsistent directory handling. I did consider it, but I think the flag-based version is better since it keeps all folder-structure logic in one place. That way, if we ever change stuff (like renaming hf_converted_checkpoint), it would be a one-line update.. also I think the callers just express intent (is_intermediate=True/False) instead of repeating path joins..
I see the logic within the checkpiont function might be condition based but it can help with simple to deal with rather than understanding why did the path join happen before the call.. simple a quick look at the checkpoint save function explains alot of things imo.

@dushyantbehl
Copy link
Collaborator

Rather than doing this can we pass the save_dir directly with hf_converted_output_dir attached if intermediate else not if this is final from the call to this function?
This would reduce the code and have a clean function.

Yeah, the path-based approach sounds simpler, but I had thought that it could easily lead to inconsistent directory handling. I did consider it, but I think the flag-based version is better since it keeps all folder-structure logic in one place. That way, if we ever change stuff (like renaming hf_converted_checkpoint), it would be a one-line update.. also I think the callers just express intent (is_intermediate=True/False) instead of repeating path joins.. I see the logic within the checkpiont function might be condition based but it can help with simple to deal with rather than understanding why did the path join happen before the call.. simple a quick look at the checkpoint save function explains alot of things imo.

Can you explain what you mean by but I had thought that it could easily lead to inconsistent directory handling.
We are not defining any user facing API so its behavior is not likely to be affected..further since the complete on_save function is written with assumptions of how the dist checkpoint is saved this is likely not affecting anything and would just make code simpler.

@YashasviChaurasia
Copy link
Contributor Author

Can you explain what you mean by but I had thought that it could easily lead to inconsistent directory handling.

With inconsistent I had meant passing different paths outside the checkpoint function, the current version contains all the logic within the checkpoint function and simply abstracts the intermediate vs final checkpoint logic within the function

@YashasviChaurasia YashasviChaurasia force-pushed the isfinal_ckpt branch 3 times, most recently from 93bda6b to a346f92 Compare November 4, 2025 05:15
@YashasviChaurasia
Copy link
Contributor Author

image

Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dushyantbehl dushyantbehl merged commit 1e844d3 into foundation-model-stack:main Nov 5, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants