Jax multi-gpu randomly hangs forever #10763

jrabary · 2022-05-19T10:14:40Z

jrabary
May 19, 2022

Hi,

We are facing a problem where a training and validation code based on jax/flax hangs randomly on a multi-gpu host.
Using a single GPU is working correctly but once we add multi-gpu support it hangs in an unpredictable way.
The GPUs usage is at 0% for all GPU but the CPU is used.

What could be the problem ?

tomdps · 2022-05-19T14:08:33Z

tomdps
May 19, 2022

I encountered the exact same issue: pmaped step function works fine on single GPU, but when scaling to multi GPU, the step function might hang randomly with 0% GPU usage. There is a high variance in the frequency but I would say it happens once every 100k to 2M training steps (ie. pmaped function calls).

It is not a RAM or device memory issue. There is no error thrown, the training is just "paused" and you have to manually terminate it.

This seems not to be dependent on the dataloading pipeline, as when the hanging happens, the batch is well loaded and available in the device memory.

The issue is also not hardware dependant, as it happens both on A100 or P5000 GPUs.

I encountered the issue both in old (1.71) and recent (3.10) JAX versions.

I experienced this both with haiku and flax-based training frameworks.

Finally, early tests seem to indicate that the issue does not depend on the CuDNN version: 8.1.2 or 8.4

2 replies

sccsccs May 26, 2022

I also faced the same situation and it appears to be a deadlock situation? I traced the system call for the process training the model and got the output below.

$ strace -p 90006
strace: Process 90006 attached
futex(0x7f5878030b40, FUTEX_WAIT_PRIVATE, 0, NULL

not much info so I tried it again with the -f flag and was bombarded with a ton of lines... could read all the lines but sched_yield() seems to come up often. So maybe it has to do with scheduling and synchronization?

hawkinsp May 26, 2022
Maintainer

If you hit this problem again, the thing that would be most helpful would be to capture the thread stacks with gdb.

In general, this could be either user error (you are running dissimilar pmap steps from different processes, for example), or a bug in JAX. In order to know which, I may be able to determine from the thread stacks, but most likely I need a minimal repro of the problem.

lyshuga · 2022-05-24T13:30:13Z

lyshuga
May 24, 2022

Hello,

I thing I have quite the same issue.

I am working currently with Google Scenic project (https://github.com/google-research/scenic/tree/main/scenic/projects/vivit) which is based on JAX.

And one of the problem that was encountered is that during the evaluation step or train step the process just hangs randomly (can happen in 1 hour of training or in 30 hours). This hang happens somewhere inside the pmapped eval/train function. And it happens only if there are more than 1 GPU. Furthermore, during this hang the GPU utilization (I am currenlty using A100, V100 or P5000) is zero, but the CPU utilization is almost 100 on all cores. And it hangs without any error or Exception.

Basically, if the eval/train step function is vmapped, just jitted or pmapped with 1 GPU, the steps works perfectly.

I have tried different experiments to find out the issue, by isolating code from any dataset or other libraries apart from jax, flax, ml_collections, scipy and tf. But nothing appears to help. There is an obvious connection between the error and multi-gpu configuration.

My environment is based on last version of jax and jaxlib:

jax 0.3.13
jaxlib 0.3.10+cuda11.cudnn82

Though I have previously tried different versions of jax and jaxlib (both "jax[cuda11_cudnn82]" and "jax[cuda11_cudnn805]").

My CUDA version is 11.4 and cudnn is 8.3.

Currently, the problem is quite crucial for use case of Vision Transformers and I am looking for some help or hints to solve this problem.

I attach to this post an Python file with running script which is isolated from data and use only basic packages as jax, flax and tf. It is made for 4 gpus, feel free to edit it at the end depending on your number of GPUs.
eval_steps_simple.py.txt

5 replies

hawkinsp May 27, 2022
Maintainer

I've been running that repro for 24+ hours on 4 x T4 GPUs with no hang yet. I'll keep running it for a bit but no luck so far.

lyshuga May 28, 2022

@hawkinsp Thank you for running it. Can you also please give some information about your setup (CPU, in case your use Docker/Singularity, version of ubuntu, and also Nvidia driver/cuda/cudnn vesion)?

hawkinsp May 31, 2022
Maintainer

Interestingly, this did reproduce for me, although it took some days. The symptom I saw was that two threads appear blocked waiting at a NCCL collective:

Thread 205 (Thread 0x7fc8b97fa700 (LWP 4201)):
#0  0x00007fcc0925b74b in sched_get_priority_max () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fcb5d09f275 in ncclCpuBarrierOut(ncclComm*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#2  0x00007fcb5d09f85f in ncclBarrierEnqueueWait(ncclComm*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#3  0x00007fcb5d0a3ca0 in ncclGroupEnd () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#4  0x00007fcb5d08d91b in xla::gpu::(anonymous namespace)::RunAllReduce(xla::gpu::NcclAllReduceConfig const&, std::vector<xla::gpu::NcclCollectiveThunk::Buffer, std::allocator<xla::gpu::NcclCollectiveThunk::Buffer> > const&, xla::gpu::BufferAllocations const&, stream_executor::Stream&, ncclComm*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#5  0x00007fcb5d08e00a in xla::gpu::NcclAllReduceStartThunk::RunNcclCollective(xla::gpu::Thunk::ExecuteParams const&, ncclComm*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#6  0x00007fcb5d094bb7 in xla::gpu::NcclCollectiveThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#7  0x00007fcb5cfa9df2 in xla::gpu::(anonymous namespace)::ExecuteThunks(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xla::gpu::ThunkSchedule const&, xla::ServiceExecutableRunOptions const*, xla::gpu::BufferAllocations const&, bool) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#8  0x00007fcb5cfaaa1f in xla::gpu::GpuExecutable::ExecuteThunksOrBef(xla::ServiceExecutableRunOptions const*, xla::gpu::BufferAllocations const&, bool) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#9  0x00007fcb5cfad73f in xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl(xla::ServiceExecutableRunOptions const*, absl::lts_20211102::variant<absl::lts_20211102::Span<xla::ShapedBuffer const* const>, absl::lts_20211102::Span<xla::ExecutionInput> >) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#10 0x00007fcb5cfae674 in xla::gpu::GpuExecutable::ExecuteAsyncOnStream(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput> >, xla::HloExecutionProfile*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#11 0x00007fcb5fcd8012 in xla::Executable::ExecuteAsyncOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput> >) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#12 0x00007fcb5cc152e8 in xla::LocalExecutable::RunAsync(absl::lts_20211102::Span<xla::Shape const* const>, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput> >, xla::ExecutableRunOptions) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#13 0x00007fcb5cc15b40 in xla::LocalExecutable::RunAsync(std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput> >, xla::ExecutableRunOptions) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#14 0x00007fcb5caa9a2e in xla::PjRtStreamExecutorExecutable::EnqueueExecution(absl::lts_20211102::Span<xla::PjRtBuffer* const>, int, int, int, xla::RunId const&, xla::ExecuteOptions const&, xla::PjRtDevice*, std::vector<xla::PjRtStreamExecutorBuffer::ScopedHold, std::allocator<xla::PjRtStreamExecutorBuffer::ScopedHold> >*, std::shared_ptr<xla::DeviceAssignment>, std::vector<std::function<void ()>, std::allocator<std::function<void ()> > >&) const () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#15 0x00007fcb5cab0e87 in xla::PjRtStreamExecutorExecutable::ExecuteHelper(absl::lts_20211102::Span<xla::PjRtBuffer* const>, int, int, xla::RunId const&, xla::ExecuteOptions const&, bool, xla::PjRtDevice*) const () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#16 0x00007fcb5cab3618 in std::_Function_handler<void (), xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20211102::Span<std::vector<xla::PjRtBuffer*, std::allocator<xla::PjRtBuffer*> > const>, xla::ExecuteOptions const&, absl::lts_20211102::optional<std::vector<xla::PjRtFuture<tensorflow::Status>, std::allocator<xla::PjRtFuture<tensorflow::Status> > > >&)::{lambda()#2}>::_M_invoke(std::_Any_data const&) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#17 0x00007fcb5cc04ba1 in xla::WorkerThread::WorkLoop() () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#18 0x00007fcb60b8477b in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () at /dt9-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:75
#19 0x00007fcc094ad609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007fcc09278163 in umount2 () at ../sysdeps/unix/sysv/linux/umount2.S:8
#21 0x0000000000000000 in ?? ()

when we would expect four threads to be present at the barrier (one per GPU). I don't know why that is the case! Since we were planning to do it anyway, we might start by updating NCCL to the newest version and see if that helps.

tomdps May 31, 2022

That's great news to hear! Hopefully the NCCL update will fix that. We are following this thread with great attention and solving this bug will unlock training regimes in JAX that are for now unreachable for us.

jrabary Jun 1, 2022
Author

maybe a related issue NVIDIA/nccl#613 ?

hawkinsp · 2022-06-03T14:45:36Z

hawkinsp
Jun 3, 2022
Maintainer

I created an issue for this since this has grown beyond simply being a discussion (#10969).

0 replies

srush · 2022-06-10T19:42:08Z

srush
Jun 10, 2022

We're seeing the same issue on almost exactly the same setup 8x A100 with 0% GPU / 100% CPU , Jax / Flax versions the same.

0 replies

Jax multi-gpu randomly hangs forever #10763

Uh oh!

jrabary May 19, 2022

Replies: 4 comments · 7 replies

Uh oh!

Uh oh!

tomdps May 19, 2022

Uh oh!

sccsccs May 26, 2022

Uh oh!

hawkinsp May 26, 2022 Maintainer

Uh oh!

Uh oh!

lyshuga May 24, 2022

Uh oh!

hawkinsp May 27, 2022 Maintainer

Uh oh!

Uh oh!

lyshuga May 28, 2022

Uh oh!

hawkinsp May 31, 2022 Maintainer

Uh oh!

Uh oh!

tomdps May 31, 2022

Uh oh!

jrabary Jun 1, 2022 Author

Uh oh!

Uh oh!

hawkinsp Jun 3, 2022 Maintainer

Uh oh!

srush Jun 10, 2022

jrabary
May 19, 2022

Replies: 4 comments 7 replies

tomdps
May 19, 2022

hawkinsp May 26, 2022
Maintainer

lyshuga
May 24, 2022

hawkinsp May 27, 2022
Maintainer

hawkinsp May 31, 2022
Maintainer

jrabary Jun 1, 2022
Author

hawkinsp
Jun 3, 2022
Maintainer

srush
Jun 10, 2022