feat: add ckpt conversion script fp32-bf16 #614
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the change
fms-hf saves both intermediate and end checkpoints based on settings like (save_strategy, save_model_dir) using HF APIs.
For trainings done on multi node setup with mixed precision (which is default in granite dot build due to better performance) are saved in fp32 mode. This means the checkpoint saved is large in number and even becomes bigger for intermediate checkpoints due to optimizer state.
This PR will add a script to convert model checkpoints from
fp32tobf16, this script can be then used as and when required by to convert checkpoints..This PR adds a script
checkpoint_utils.pyto manage checkpoints:Utilities for managing model checkpoints with an optional in-place mode.
INPUT -> OUTPUTunchanged--convert-model-to-bf16: cast model FP32 tensors to BF16 (optimizer tensors remain FP32)--no-optimizer: drop optimizer artifacts when writing outputs (merged with--drop-files)--drop-files: comma-separated extra file/dir names to drop (works with--no-optimizerand--inplace)--inplace: perform conversion and/or dropping directly in INPUT (destructive)Defaults
Optimizer-related files dropped by
--no-optimizer:optimizer.pt,optimizer,optimizer_0,optimizer_1The script supports converting:
.pt/.pthfiles.safetensorsfiles.safetensors)Features
--drop-filesUsage:
This script provides several operations on checkpoints: copying, converting FP32 tensors to BF16, dropping optimizer states, and modifying checkpoints in place.
Related issue number
How to verify the PR
Was the PR tested