Skip to content

Conversation

@YashasviChaurasia
Copy link
Contributor

@YashasviChaurasia YashasviChaurasia commented Sep 27, 2025

Description of the change

fms-hf saves both intermediate and end checkpoints based on settings like (save_strategy, save_model_dir) using HF APIs.
For trainings done on multi node setup with mixed precision (which is default in granite dot build due to better performance) are saved in fp32 mode. This means the checkpoint saved is large in number and even becomes bigger for intermediate checkpoints due to optimizer state.

This PR will add a script to convert model checkpoints from fp32 to bf16, this script can be then used as and when required by to convert checkpoints..

This PR adds a script checkpoint_utils.py to manage checkpoints:
Utilities for managing model checkpoints with an optional in-place mode.

  • Default: copy INPUT -> OUTPUT unchanged
  • --convert-model-to-bf16: cast model FP32 tensors to BF16 (optimizer tensors remain FP32)
  • --no-optimizer: drop optimizer artifacts when writing outputs (merged with --drop-files)
  • --drop-files: comma-separated extra file/dir names to drop (works with --no-optimizer and --inplace)
  • --inplace: perform conversion and/or dropping directly in INPUT (destructive)

Defaults
Optimizer-related files dropped by --no-optimizer:
optimizer.pt, optimizer, optimizer_0, optimizer_1

The script supports converting:

  • .pt / .pth files
  • single .safetensors files
  • Hugging Face checkpoint directories (with sharded .safetensors)

Features

  • Skips optimizer state tensors during BF16 conversion (keeps them FP32 for stability)
  • Preserves non-tensor objects (e.g., JSON configs) and non-FP32 tensors
  • Copies configs, tokenizers, and other non-model files unchanged
  • Can drop optimizer, scheduler, rng_state, and other artifacts with --drop-files

Usage:

This script provides several operations on checkpoints: copying, converting FP32 tensors to BF16, dropping optimizer states, and modifying checkpoints in place.

## Copying checkpoints

# Copies the entire directory as is.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_copy  


## Converting model tensors to BF16

# Converts all safetensor shards in the directory to BF16.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_bf16 --convert-model-to-bf16  


## Dropping optimizer files

# Copies a checkpoint directory but removes optimizer files.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_noopt --no-optimizer  

# Removes optimizer files plus additional specified files.  
python checkpoint_utils.py ./checkpoint_dir ./checkpoint_noopt --no-optimizer --drop-files scheduler.pt,rng_state_0.pth  


## Inplace modification

# Deletes the specified files directly from the checkpoint directory.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --drop-files scheduler.pt,trainer_state.json  

# Converts all model safetensor shards in the directory to BF16 in place.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --convert-model-to-bf16  

# Converts model tensors in place and drops specified files in the same directory.  
python checkpoint_utils.py ./checkpoint_dir_ignored --inplace --convert-model-to-bf16 --drop-files scheduler.pt  

Related issue number

How to verify the PR

image Model checkpoint fine tuned with mixed-precision.
image Model checkpoint after conversion using the script

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@ashokponkumar
Copy link
Collaborator

Can we make it generic as say checkpoint_utils.py or something, and features like removing optimizer files etc.

@ashokponkumar ashokponkumar merged commit d9ee35f into foundation-model-stack:main Sep 30, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants