Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
NCCL Timeout Error during train_val_test_data_provider
I encountered an NCCL timeout error during the train_val_test_data_provider step while using 8 GPUs (2, 2, 2).
Why am I experiencing this issue?
Perhaps the NCCL timeout is related to communication problems?
It seems that both too large and too small data sets can cause NCCL timeouts.
The code was struck when :
train_data_iterator, valid_data_iterator, test_data_iterator
= build_train_valid_test_data_iterators(
train_valid_test_dataset_provider)
errors:
[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
Fatal Python error: Aborted
thanks !!!!
Beta Was this translation helpful? Give feedback.
All reactions