-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
Model merge and saving issues with the latest dev version and it's likely caused by this change: f143ad3#r173948970
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: Traceback (most recent call last):
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/cli/_megatron/sft.py", line 7, in <module>
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: megatron_sft_main()
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/megatron/train/sft.py", line 87, in megatron_sft_main
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: return MegatronSft(args).main()
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: ^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/llm/base.py", line 49, in main
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: result = self.run()
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: ^^^^^^^^^^
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/megatron/train/sft.py", line 77, in run
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: self.trainer.train(train_dataset, val_dataset, data_collator)
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/megatron/trainers/base.py", line 1098, in train
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: pretrain(
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/Megatron-LM/megatron/training/training.py", line 726, in pretrain
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: save_checkpoint(
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/ms-swift/swift/megatron/trainers/base.py", line 1038, in save_checkpoint
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: self._origin_save_checkpoint(iteration, model, *_args, **kwargs)
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/root/sky_workdir/Megatron-LM/megatron/training/checkpointing.py", line 538, in save_checkpoint
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: save_sharded_modelopt_state(model, checkpoint_name, (args.ckpt_format, 1))
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: File "/opt/TensorRT-Model-Optimizer/modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 125, in save_sharded_modelopt_state
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: if not mto.ModeloptStateManager.is_converted(model[0]):
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: ~~~~~^^^
�[36m(worker4, rank=4, pid=2864, ip=10.0.4.250)�[0m [rank36]: IndexError: list index out of range
my script:
megatron sft \
--model ${PRETRAINED_MODEL_PATH} \
--model_type deepseek_v3_1 \
--load_safetensors true \
--save_safetensors true \
--merge_lora true \
--fp8_recipe blockwise \
--fp8_format e4m3 \
--fp8_param_gather false \
--exp_avg_dtype bf16 \
--exp_avg_sq_dtype bf16 \
--attention_backend flash \
--dataset ${HAI_DATASET} \
--load_from_cache_file true \
--train_type lora \
--lora_rank 32 \
--lora_alpha 64 \
--target_modules all-linear \
--split_dataset_ratio 0.05 \
--use_chat_template false \
--tensor_model_parallel_size 4 \
--expert_tensor_parallel_size 1 \
--expert_model_parallel_size 8 \
--context_parallel_size 1 \
--pipeline_model_parallel_size 8 \
--decoder_last_pipeline_num_layers 12 \
--tp_comm_overlap false \
--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_enable_deepep false \
--moe_aux_loss_coeff 1e-3 \
--micro_batch_size 1 \
--global_batch_size 16 \
--packing true \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--optimizer_cpu_offload true \
--use_precision_aware_optimizer true \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 5e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 5e-6 \
--save ${FINAL_OUTPUT_DIR} \
--eval_interval 20 \
--save_interval 1000 \
--max_length 14000 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--report_to wandb
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context
Add any other context about the problem here(在这里补充其他信息)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working