I ran with --load ${OUT_FOLDER}, and confirmed that in slime/backends/megatron_utils/checkpoint.py it can read the latest_checkpointed_iteration.txt. Also in slime/ray/placement_group.py I printed start_rollout_ids and it returns the next number of the latest checkpoint.
However the training always start with rollout_id = 0. I address this issue in https://github.com/THUDM/slime/blob/d008e74e12b5c322767c31e0ee22ef3e6382d027/slime/utils/arguments.py#L1560C5-L1563C34
When I remove args.start_rollout_id = 0 it can start resuming the checkpoint and training from the start of the latest checkpoint.
Is this modification correct?