forked from hiyouga/LlamaFactory
-
Notifications
You must be signed in to change notification settings - Fork 43
Closed
Description
Reminder
- I have read the README and searched the existing issues.
System Info
When sequence parallelism is enabled, the reported training loss becomes significantly larger than in the non‑SP setup (e.g., ~5 vs ~1.x). This is likely due to the loss not being normalized over the correct global number of tokens across SP ranks. I am using transformers==4.51.3
Reproduction
### model
model_name_or_path:
### method
stage: sft
do_train: true
finetuning_type: full
# lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
dataset: ccaicorpus_hard
template: qwen
cutoff_len: 4000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 20
### output
output_dir:
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
# resume_from_checkpoint: true
# enable_thinking: false
flash_attn: fa2
# neat_packing: true
sequence_parallel_size: 4
report_to: noneExpected behavior
Without SP:
Others
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
