Loss scale mismatch when enabling sequence parallelism

### Reminder

- [x] I have read the README and searched the existing issues.

### System Info

When sequence parallelism is enabled, the reported training loss becomes significantly larger than in the non‑SP setup (e.g., ~5 vs ~1.x). This is likely due to the loss not being normalized over the correct global number of tokens across SP ranks. I am using transformers==4.51.3

### Reproduction

```yaml
### model
model_name_or_path: 

### method
stage: sft
do_train: true
finetuning_type: full
# lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: ccaicorpus_hard
template: qwen
cutoff_len: 4000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 20


### output
output_dir: 
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
# resume_from_checkpoint: true
# enable_thinking: false
flash_attn: fa2
# neat_packing: true
sequence_parallel_size: 4

report_to: none
```

### Expected behavior
Without SP:

<img width="139" height="119" alt="Image" src="https://github.com/user-attachments/assets/ecf37052-7a52-4aa9-9ee8-4b82d98d70dd" />

With SP=4:
<img width="146" height="135" alt="Image" src="https://github.com/user-attachments/assets/4ad5468b-bb2f-42e8-96ec-0c20f516eb65" />

### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss scale mismatch when enabling sequence parallelism #72

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss scale mismatch when enabling sequence parallelism #72

Description

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions