feat: add ckpt conversion script fp32-bf16 #614

YashasviChaurasia · 2025-09-27T07:01:46Z

Description of the change

fms-hf saves both intermediate and end checkpoints based on settings like (save_strategy, save_model_dir) using HF APIs.
For trainings done on multi node setup with mixed precision (which is default in granite dot build due to better performance) are saved in fp32 mode. This means the checkpoint saved is large in number and even becomes bigger for intermediate checkpoints due to optimizer state.

This PR will add a script to convert model checkpoints from fp32 to bf16, this script can be then used as and when required by to convert checkpoints..

This PR adds a script checkpoint_utils.py to manage checkpoints:
Utilities for managing model checkpoints with an optional in-place mode.

Default: copy INPUT -> OUTPUT unchanged
--convert-model-to-bf16: cast model FP32 tensors to BF16 (optimizer tensors remain FP32)
--no-optimizer: drop optimizer artifacts when writing outputs (merged with --drop-files)
--drop-files: comma-separated extra file/dir names to drop (works with --no-optimizer and --inplace)
--inplace: perform conversion and/or dropping directly in INPUT (destructive)

Defaults
Optimizer-related files dropped by --no-optimizer:
optimizer.pt, optimizer, optimizer_0, optimizer_1

The script supports converting:

.pt / .pth files
single .safetensors files
Hugging Face checkpoint directories (with sharded .safetensors)

Features

Skips optimizer state tensors during BF16 conversion (keeps them FP32 for stability)
Preserves non-tensor objects (e.g., JSON configs) and non-FP32 tensors
Copies configs, tokenizers, and other non-model files unchanged
Can drop optimizer, scheduler, rng_state, and other artifacts with --drop-files

Usage:

This script provides several operations on checkpoints: copying, converting FP32 tensors to BF16, dropping optimizer states, and modifying checkpoints in place.

## Copying checkpoints

# Copies the entire directory as is.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_copy  


## Converting model tensors to BF16

# Converts all safetensor shards in the directory to BF16.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_bf16 --convert-model-to-bf16  


## Dropping optimizer files

# Copies a checkpoint directory but removes optimizer files.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_noopt --no-optimizer  

# Removes optimizer files plus additional specified files.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_noopt --no-optimizer --drop-files scheduler.pt,rng_state_0.pth  


## Inplace modification

# Deletes the specified files directly from the checkpoint directory.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --drop-files scheduler.pt,trainer_state.json  

# Converts all model safetensor shards in the directory to BF16 in place.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --convert-model-to-bf16  

# Converts model tensors in place and drops specified files in the same directory.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --convert-model-to-bf16 --drop-files scheduler.pt

Related issue number

How to verify the PR

Model checkpoint fine tuned with mixed-precision.

Model checkpoint after conversion using the script

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: yashasvi <[email protected]>

github-actions · 2025-09-27T07:01:54Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

ashokponkumar · 2025-09-29T09:13:12Z

Can we make it generic as say checkpoint_utils.py or something, and features like removing optimizer files etc.

Signed-off-by: yashasvi <[email protected]>

feat: add ckpt conversion script fp32-bf16

092a437

Signed-off-by: yashasvi <[email protected]>

github-actions bot added the feat label Sep 27, 2025

YashasviChaurasia marked this pull request as ready for review September 27, 2025 11:52

YashasviChaurasia requested review from aluu317, anhuong, dushyantbehl, fabianlim and kmehant as code owners September 27, 2025 11:52

YashasviChaurasia force-pushed the convert_fp32_to_bf16 branch from 882adc5 to 35f0908 Compare September 30, 2025 11:00

feat: add remove optim func to checkpoint_util.py

9c2bf84

Signed-off-by: yashasvi <[email protected]>

YashasviChaurasia force-pushed the convert_fp32_to_bf16 branch from 7d4b464 to 9c2bf84 Compare September 30, 2025 11:14

YashasviChaurasia and others added 2 commits September 30, 2025 16:45

Merge branch 'main' into convert_fp32_to_bf16

9328845

feat: add inplace file deletion capability

ea8ac4c

Signed-off-by: yashasvi <[email protected]>

YashasviChaurasia force-pushed the convert_fp32_to_bf16 branch from c2c2506 to ea8ac4c Compare September 30, 2025 20:56

ashokponkumar approved these changes Sep 30, 2025

View reviewed changes

ashokponkumar merged commit d9ee35f into foundation-model-stack:main Sep 30, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add ckpt conversion script fp32-bf16 #614

feat: add ckpt conversion script fp32-bf16 #614

Uh oh!

YashasviChaurasia commented Sep 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 27, 2025

Uh oh!

ashokponkumar commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add ckpt conversion script fp32-bf16 #614

feat: add ckpt conversion script fp32-bf16 #614

Uh oh!

Conversation

YashasviChaurasia commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Usage:

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Sep 27, 2025

Uh oh!

ashokponkumar commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YashasviChaurasia commented Sep 27, 2025 •

edited

Loading