Skip to content

Loss scale mismatch when enabling sequence parallelism #72

@DSTTSD

Description

@DSTTSD

Reminder

  • I have read the README and searched the existing issues.

System Info

When sequence parallelism is enabled, the reported training loss becomes significantly larger than in the non‑SP setup (e.g., ~5 vs ~1.x). This is likely due to the loss not being normalized over the correct global number of tokens across SP ranks. I am using transformers==4.51.3

Reproduction

### model
model_name_or_path: 

### method
stage: sft
do_train: true
finetuning_type: full
# lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: ccaicorpus_hard
template: qwen
cutoff_len: 4000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 20


### output
output_dir: 
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
# resume_from_checkpoint: true
# enable_thinking: false
flash_attn: fa2
# neat_packing: true
sequence_parallel_size: 4

report_to: none

Expected behavior

Without SP:

Image

With SP=4:
Image

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions