-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Replies: 1 comment · 3 replies
-
What if you interrupt? What's the traceback (ie, where does it hang?) |
Beta Was this translation helpful? Give feedback.
All reactions
-
vLlm version is the one indicated on the README, I followed all the steps: uv pip install vllm==0.7.2 The weird thing is that on wandb it really always stops on Step 8, but watching the logs on the console I saw the training went at least until the Step 15, but then it also got blocked there. See the traceback below when I interrupt: ^CW0316 21:27:30.473000 33919 torch/distributed/elastic/agent/server/api.py:704] Received 2 death signal, shutting down workers |
Beta Was this translation helpful? Give feedback.
All reactions
-
I actually managed to capture the error when the process crashes, it seems to be some sort of timeout. Check below:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Any idea what could be causing this? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
first of all, thanks a lot to all contributors of this great repo. I'm trying to run it on a DGX H100 cluster using the README example of grpo (OpenR1-Math-200k dataset). I'm running it with num_processes 4 but the training always gets blocked on Step 8 (I'm checking it using wandb). Do you have any suggestions of what I could be doing wrong or how I could investigate and solve this? All runs, with different batch sizes (I tried to reduce them), stop on Step 8. The run does not show any error messages, the training process just gets blocked at Step 8.
Beta Was this translation helpful? Give feedback.
All reactions