Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion apps/grpo/qwen3_8b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ policy:

# Trainer configuration
trainer:
use_dcp: true
use_dcp: false
use_vllm_builtin_load: true
model:
name: qwen3
Expand Down
3 changes: 2 additions & 1 deletion src/forge/actors/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,7 +403,8 @@ async def push_weights(self, policy_version: int) -> None:
else:
for name, param in hf_state_dict.items():
key = get_param_key(policy_version, name)
await ts.put(key, param)
# RDMA is still broken on GPU, so we need to copy to CPU
await ts.put(key, param.detach().cpu())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bad thing? I thought we were writing it to CPU memory anyways on the trainer put side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically we don't need this extra copy if RDMA on GPU is working. yes we are writing to cpu memory, but currently it is local gpu-> local cpu-> remote cpu.

Copy link
Contributor

@vidhyav vidhyav Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why is the GPU RDMA not working?

  2. What the perf penalty is with the GPU-CPU copy?

  3. Also, what's the corresponding access on the read side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why is the GPU RDMA not working?

Memory registration error. This could be a build issue but I doubt we have enough time to debug. CPU should work now.

  1. What the perf penalty is with the GPU-CPU copy?

Not sure, need profiling. I'd guess anywhere between 30% - 100% increased latency.

  1. Also, what's the corresponding access on the read side?

On the read side, it's basically remote cpu -> local cpu -> gpu (vllm worker).

t.step("ts_save")
t.stop()
end_time = time.perf_counter()
Expand Down
Loading