Skip to content

Conversation

@catid
Copy link

@catid catid commented Oct 18, 2025

I had GPT-5-codex fix a bunch of bugs in multi-gpu for me. Haven't reviewed what it did, but it fixed lots of console errors and a multi-GPU hang that locked up my runs.

time torchrun --standalone --nnodes=1 --nproc-per-node=2 -m pufferlib.pufferl train puffer_breakout

Tested on 3 machines:

Benchmark results (SPS): (1) 5090x2: 6M (2) Pro6000x2 (600W): 4M (3) Pro6000x4 (300W): 9M. Interesting that 5090 is faster than the much more expensive card. Could be the host CPU, which is a gamer 9950X3D vs 64-core Threadrippers.

Seems to be working. Hope this helps your dev

@jsuarez5341
Copy link
Contributor

Your LLMs made a mess of the main train file. Can you please submit a repro of the hang that this was supposed to fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants