Skip to content

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

@weixuansun

Description

@weixuansun

Bug description

When I use an tokenizer that has vocabulary size that is not divisible by parallel (or world) size, the training loss will become inconsistent after resuming.

Versions

can be reproduced using torch2.6

Reproduce:
Use any tokenizer that has a vocabulary size that is not divisible by parallel (or world) size.
train from step 0 to step 20:

Image load step 10 checkpoint and resume training: Image As shown, step 11 and following steps have inconsistent loss.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions