Skip to content

Resume with --load restores Megatron ckpt but train.py rollout loop still starts from 0 #1558

@p1k0pan

Description

@p1k0pan

I ran with --load ${OUT_FOLDER}, and confirmed that in slime/backends/megatron_utils/checkpoint.py it can read the latest_checkpointed_iteration.txt. Also in slime/ray/placement_group.py I printed start_rollout_ids and it returns the next number of the latest checkpoint.

However the training always start with rollout_id = 0. I address this issue in https://github.com/THUDM/slime/blob/d008e74e12b5c322767c31e0ee22ef3e6382d027/slime/utils/arguments.py#L1560C5-L1563C34
When I remove args.start_rollout_id = 0 it can start resuming the checkpoint and training from the start of the latest checkpoint.

Is this modification correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions