fix: checkpoint utils for single shard #137

willmj · 2025-03-14T18:25:32Z

Before enabling lora tuning for ScatterMoE, checkpoint utils need to work on single shard since adapter config will never have a safetensors index file. Additionally, in the case of a single shard base model, I created a load_weight_map function to handle the index file and model.safetensors.

Testing:
Tested on tiny-granite-moe with full FT:

{
    "model_name_or_path": "katuni4ka/tiny-random-granite-moe",
    "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
    "output_dir": "/testing/tuning/output/tiny-granite-moe/ft/20250314_1350-tone-FAST",
    "save_model_dir": "/testing/tuning/output/tiny-granite-moe/ft/20250314_1350-tone-FAST/save_model",
    "num_train_epochs": 1.0,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 1,
    "learning_rate": 1e-5,
    "response_template": "\n### Response:",
    "dataset_text_field": "output",
    "fast_moe": 1
}

Training logs:

$ python accelerate_launch.py
WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py:384: SyntaxWarning: invalid escape sequence '\.'
  _reg = re.compile(f"(.*)\.({_name})\.weight")
/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py:384: SyntaxWarning: invalid escape sequence '\.'
  _reg = re.compile(f"(.*)\.({_name})\.weight")
/app/fms-hf-tuning/tuning/config/acceleration_configs/acceleration_framework_config.py:297: UserWarning: An experimental acceleration feature is requested by specifying the '--fast_moe' argument. Please note this feature may not support certain edge cases at this juncture. When the feature matures this message will be turned off.
  warnings.warn(
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
WARNING:sft_trainer.py:PAD token set to default, to make it different from eos token
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Converting ScatterMoE layers: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  3.86it/s]
/home/tuning/.local/lib/python3.12/site-packages/transformers/training_args.py:2077: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
/app/fms-hf-tuning/tuning/sft_trainer.py:377: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
/home/tuning/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
  0%|                                                                                                                                               | 0/500 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 499/500 [00:20<00:00, 27.34it/s]You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'loss': 10.8029, 'grad_norm': 0.765625, 'learning_rate': 0.0, 'epoch': 1.0}                                                                                                
{'train_runtime': 26.6897, 'train_samples_per_second': 37.468, 'train_steps_per_second': 18.734, 'train_loss': 10.8028984375, 'epoch': 1.0}                                 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:26<00:00, 18.71it/s]

Signed-off-by: Will Johnson <[email protected]>

fabianlim

LGTM

willmj added 5 commits March 14, 2025 10:53

feat: get unsharded state dict from safe checkpoint

0ee5d5b

Signed-off-by: Will Johnson <[email protected]>

Merge branch 'main' into lora-checkpoint-utils

73f9c4b

Signed-off-by: Will Johnson <[email protected]>

fix: load in same way as sharded checkpoints

d7e2818

Signed-off-by: Will Johnson <[email protected]>

fix: utils for loading weight map from single shard

2af23f9

Signed-off-by: Will Johnson <[email protected]>

fix: add safetensors index as flag

da352f3

Signed-off-by: Will Johnson <[email protected]>

willmj requested a review from fabianlim as a code owner March 14, 2025 18:25

fix: better error

6f7ca28

Signed-off-by: Will Johnson <[email protected]>

willmj marked this pull request as draft March 14, 2025 18:31

lint + fmt

74d9fcc

Signed-off-by: Will Johnson <[email protected]>

willmj force-pushed the lora-checkpoint-utils branch from cb7d26e to 74d9fcc Compare March 14, 2025 18:39

willmj marked this pull request as ready for review March 17, 2025 14:00

fabianlim approved these changes Mar 21, 2025

View reviewed changes

fabianlim merged commit ee7d713 into foundation-model-stack:main Mar 21, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: checkpoint utils for single shard #137

fix: checkpoint utils for single shard #137

Uh oh!

willmj commented Mar 14, 2025 •

edited

Loading

Uh oh!

fabianlim left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: checkpoint utils for single shard #137

fix: checkpoint utils for single shard #137

Uh oh!

Conversation

willmj commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabianlim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willmj commented Mar 14, 2025 •

edited

Loading