-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
Total 2 A100 GPUs
per_device_train_batch_size=16
local_rollout_batch_size=10
It is necessary for local_rollout_batch_size to be multiplied by the number of GPUs-1 used for Actor to exactly equal the number of tasks?
If I use 8 GPUs and set local Rollout batch size not to 2, I will encounter an error:
local_token_obs["input_ids"][:, :processed_obs["input_ids"].shape[1]] = processed_obs["input_ids"]
RuntimeError: The expanded size of the tensor (4) must match the existing size (2) at non-singleton dimension 0. Target sizes: [4, 34]. Tensor sizes: [2, 34]
Actor ObjectRef(2751d69548dba9565d1370028d345bffd68a1c6b0100000001000000) died
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels