Skip to content

local_rollout_batch_size #13

@YuyangZhu7

Description

@YuyangZhu7

Total 2 A100 GPUs
per_device_train_batch_size=16
local_rollout_batch_size=10
It is necessary for local_rollout_batch_size to be multiplied by the number of GPUs-1 used for Actor to exactly equal the number of tasks?
If I use 8 GPUs and set local Rollout batch size not to 2, I will encounter an error:
local_token_obs["input_ids"][:, :processed_obs["input_ids"].shape[1]] = processed_obs["input_ids"]
RuntimeError: The expanded size of the tensor (4) must match the existing size (2) at non-singleton dimension 0. Target sizes: [4, 34]. Tensor sizes: [2, 34]
Actor ObjectRef(2751d69548dba9565d1370028d345bffd68a1c6b0100000001000000) died

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions