Resume with --load restores Megatron ckpt but train.py rollout loop still starts from 0

I ran with --load ${OUT_FOLDER}, and confirmed that in `slime/backends/megatron_utils/checkpoint.py` it can read the `latest_checkpointed_iteration.txt`. Also in `slime/ray/placement_group.py` I printed `start_rollout_ids` and it returns the next number of the latest checkpoint. 

However the training always start with rollout_id = 0. I address this issue in https://github.com/THUDM/slime/blob/d008e74e12b5c322767c31e0ee22ef3e6382d027/slime/utils/arguments.py#L1560C5-L1563C34
When I remove `args.start_rollout_id = 0` it can start resuming the checkpoint and training from the start of the latest checkpoint.

Is this modification correct?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume with --load restores Megatron ckpt but train.py rollout loop still starts from 0 #1558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resume with --load restores Megatron ckpt but train.py rollout loop still starts from 0 #1558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions