Skip to content

Conversation

@willmj
Copy link
Collaborator

@willmj willmj commented Mar 14, 2025

Before enabling lora tuning for ScatterMoE, checkpoint utils need to work on single shard since adapter config will never have a safetensors index file. Additionally, in the case of a single shard base model, I created a load_weight_map function to handle the index file and model.safetensors.

Testing:
Tested on tiny-granite-moe with full FT:

{
    "model_name_or_path": "katuni4ka/tiny-random-granite-moe",
    "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
    "output_dir": "/testing/tuning/output/tiny-granite-moe/ft/20250314_1350-tone-FAST",
    "save_model_dir": "/testing/tuning/output/tiny-granite-moe/ft/20250314_1350-tone-FAST/save_model",
    "num_train_epochs": 1.0,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 1,
    "learning_rate": 1e-5,
    "response_template": "\n### Response:",
    "dataset_text_field": "output",
    "fast_moe": 1
}

Training logs:

$ python accelerate_launch.py
WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py:384: SyntaxWarning: invalid escape sequence '\.'
  _reg = re.compile(f"(.*)\.({_name})\.weight")
/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py:384: SyntaxWarning: invalid escape sequence '\.'
  _reg = re.compile(f"(.*)\.({_name})\.weight")
/app/fms-hf-tuning/tuning/config/acceleration_configs/acceleration_framework_config.py:297: UserWarning: An experimental acceleration feature is requested by specifying the '--fast_moe' argument. Please note this feature may not support certain edge cases at this juncture. When the feature matures this message will be turned off.
  warnings.warn(
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
WARNING:sft_trainer.py:PAD token set to default, to make it different from eos token
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Converting ScatterMoE layers: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  3.86it/s]
/home/tuning/.local/lib/python3.12/site-packages/transformers/training_args.py:2077: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
/app/fms-hf-tuning/tuning/sft_trainer.py:377: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
/home/tuning/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
  0%|                                                                                                                                               | 0/500 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 499/500 [00:20<00:00, 27.34it/s]You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type granitemoe to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'loss': 10.8029, 'grad_norm': 0.765625, 'learning_rate': 0.0, 'epoch': 1.0}                                                                                                
{'train_runtime': 26.6897, 'train_samples_per_second': 37.468, 'train_steps_per_second': 18.734, 'train_loss': 10.8028984375, 'epoch': 1.0}                                 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:26<00:00, 18.71it/s]

@willmj willmj requested a review from fabianlim as a code owner March 14, 2025 18:25
Signed-off-by: Will Johnson <[email protected]>
@willmj willmj marked this pull request as draft March 14, 2025 18:31
Signed-off-by: Will Johnson <[email protected]>
@willmj willmj force-pushed the lora-checkpoint-utils branch from cb7d26e to 74d9fcc Compare March 14, 2025 18:39
@willmj willmj marked this pull request as ready for review March 17, 2025 14:00
Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fabianlim fabianlim merged commit ee7d713 into foundation-model-stack:main Mar 21, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants