Fixes for multi-GPU #390

catid · 2025-10-18T20:20:48Z

I had GPT-5-codex fix a bunch of bugs in multi-gpu for me. Haven't reviewed what it did, but it fixed lots of console errors and a multi-GPU hang that locked up my runs.

time torchrun --standalone --nnodes=1 --nproc-per-node=2 -m pufferlib.pufferl train puffer_breakout

Tested on 3 machines:

Benchmark results (SPS): (1) 5090x2: 6M (2) Pro6000x2 (600W): 4M (3) Pro6000x4 (300W): 9M. Interesting that 5090 is faster than the much more expensive card. Could be the host CPU, which is a gamer 9950X3D vs 64-core Threadrippers.

Seems to be working. Hope this helps your dev

jsuarez5341 · 2025-10-21T18:37:05Z

Your LLMs made a mess of the main train file. Can you please submit a repro of the hang that this was supposed to fix?

catid added 5 commits October 18, 2025 19:14

fixes

66b67a3

fix sync

ac477fa

fixes

f93374c

fixes

c99a022

fix

6c11511

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes for multi-GPU #390

Fixes for multi-GPU #390

Uh oh!

catid commented Oct 18, 2025 •

edited

Loading

Uh oh!

jsuarez5341 commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes for multi-GPU #390

Are you sure you want to change the base?

Fixes for multi-GPU #390

Uh oh!

Conversation

catid commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsuarez5341 commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catid commented Oct 18, 2025 •

edited

Loading