Training using GRPO blocks on 8 steps #512

jtyska · 2025-03-16T17:09:18Z

jtyska
Mar 16, 2025

Hey everyone,

first of all, thanks a lot to all contributors of this great repo. I'm trying to run it on a DGX H100 cluster using the README example of grpo (OpenR1-Math-200k dataset). I'm running it with num_processes 4 but the training always gets blocked on Step 8 (I'm checking it using wandb). Do you have any suggestions of what I could be doing wrong or how I could investigate and solve this? All runs, with different batch sizes (I tried to reduce them), stop on Step 8. The run does not show any error messages, the training process just gets blocked at Step 8.

qgallouedec · 2025-03-16T19:33:31Z

qgallouedec
Mar 16, 2025
Maintainer

What if you interrupt? What's the traceback (ie, where does it hang?)
Also, what version of vLLM do you use?

3 replies

jtyska Mar 16, 2025
Author

vLlm version is the one indicated on the README, I followed all the steps:

uv pip install vllm==0.7.2

The weird thing is that on wandb it really always stops on Step 8, but watching the logs on the console I saw the training went at least until the Step 15, but then it also got blocked there.

See the traceback below when I interrupt:

^CW0316 21:27:30.473000 33919 torch/distributed/elastic/agent/server/api.py:704] Received 2 death signal, shutting down workers
W0316 21:27:30.474000 33919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 34078 closing signal SIGINT
W0316 21:27:30.474000 33919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 34079 closing signal SIGINT
W0316 21:27:30.474000 33919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 34080 closing signal SIGINT
W0316 21:27:30.474000 33919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 34081 closing signal SIGINT
Traceback (most recent call last):
W0316 21:28:00.474000 33919 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 34078 via 2, forcefully exiting via 9
W0316 21:28:02.086000 33919 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 34079 via 2, forcefully exiting via 9
W0316 21:28:02.562000 33919 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 34080 via 2, forcefully exiting via 9
W0316 21:28:03.119000 33919 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 34081 via 2, forcefully exiting via 9
Traceback (most recent call last):
File "/home/jovyan/arquivos/open-r1/openr1/bin/accelerate", line 10, in
sys.exit(main())
^^^^^^
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
deepspeed_launcher(args)
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
distrib_run.run(args)
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run
time.sleep(monitor_interval)
File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 33919 got signal: 2

jtyska Mar 17, 2025
Author

I actually managed to capture the error when the process crashes, it seems to be some sort of timeout. Check below:

[rank2]:[E316 22:24:32.027332680 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank2]:[E316 22:24:32.027697769 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank1]:[E316 22:24:32.064421104 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800032 milliseconds before timing out.
[rank1]:[E316 22:24:32.064739881 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank3]:[E316 22:24:32.074344758 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800065 milliseconds before timing out.
[rank3]:[E316 22:24:32.074680545 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank3]:[E316 22:24:34.392818550 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank3]:[E316 22:24:34.392849318 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E316 22:24:34.392853875 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E316 22:24:34.393998031 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800065 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daea798a446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7dae5cdcc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dae5cdd3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dae5cdd561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7daea7af15c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7daeb9ddfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7daeb9e70a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800065 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daea798a446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7dae5cdcc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dae5cdd3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dae5cdd561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7daea7af15c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7daeb9ddfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7daeb9e70a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daea798a446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7dae5ca4271b in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7daea7af15c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7daeb9ddfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7daeb9e70a04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E316 22:24:34.411436653 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank1]:[E316 22:24:34.411456460 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E316 22:24:34.411466274 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E316 22:24:34.412503563 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800032 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b57ff805446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7b57b4bcc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b57b4bd3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b57b4bd561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7b57ff96c5c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7b5811c53ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7b5811ce4a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800032 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b57ff805446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7b57b4bcc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b57b4bd3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b57b4bd561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7b57ff96c5c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7b5811c53ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7b5811ce4a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b57ff805446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7b57b484271b in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7b57ff96c5c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7b5811c53ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7b5811ce4a04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E316 22:24:34.417177655 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 798, last enqueued NCCL work: 798, last completed NCCL work: 797.
[rank2]:[E316 22:24:34.417197055 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E316 22:24:34.417200594 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E316 22:24:34.418231524 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79de22f1d446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x79ddd81cc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79ddd81d3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79ddd81d561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x79de230785c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x79de35372ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x79de35403a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=798, OpType=_ALLGATHER_BASE, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79de22f1d446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x79ddd81cc772 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79ddd81d3bb3 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79ddd81d561d in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x79de230785c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x79de35372ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x79de35403a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79de22f1d446 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x79ddd7e4271b in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x79de230785c0 in /home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x79de35372ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x79de35403a04 in /lib/x86_64-linux-gnu/libc.so.6)

W0316 22:24:34.958000 41919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 42074 closing signal SIGTERM
W0316 22:24:34.959000 41919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 42075 closing signal SIGTERM
W0316 22:24:34.959000 41919 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 42076 closing signal SIGTERM
E0316 22:24:36.876000 41919 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 42077) of binary: /home/jovyan/arquivos/open-r1/openr1/bin/python3
Traceback (most recent call last):
  File "/home/jovyan/arquivos/open-r1/openr1/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
    deepspeed_launcher(args)
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/arquivos/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent

jtyska Mar 19, 2025
Author

Any idea what could be causing this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training using GRPO blocks on 8 steps #512

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training using GRPO blocks on 8 steps #512

Uh oh!

jtyska Mar 16, 2025

Replies: 1 comment · 3 replies

Uh oh!

qgallouedec Mar 16, 2025 Maintainer

Uh oh!

jtyska Mar 16, 2025 Author

Uh oh!

jtyska Mar 17, 2025 Author

Uh oh!

jtyska Mar 19, 2025 Author

jtyska
Mar 16, 2025

Replies: 1 comment 3 replies

qgallouedec
Mar 16, 2025
Maintainer

jtyska Mar 16, 2025
Author

jtyska Mar 17, 2025
Author

jtyska Mar 19, 2025
Author