Skip to content

Deadlocks in a multi-threaded environment #76

@jvstme

Description

@jvstme

Context

This might be a known issue, as it seems to be mentioned in the readme:

Keep in mind, that this may lead to some problems or infinite locks, even if timeouts have been added.

We found a workaround for our project, but I still decided to open this ticket to possibly help resolve this issue or help other users work around it.

Steps to reproduce

Set your token and run the script.

from functools import partial
from threading import Thread

from nebius.sdk import SDK


TOKEN = ...
sdk = SDK(credentials=TOKEN)


def test(i):
    sdk.whoami(timeout=5).wait()
    print(f"Thread {i} done")


threads = [Thread(target=partial(test, i)) for i in range(10)]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

Expected behavior

Each of the threads prints a message, the script exits.

Thread 2 done
Thread 1 done
Thread 0 done
Thread 7 done
Thread 6 done
Thread 4 done
Thread 8 done
Thread 9 done
Thread 3 done
Thread 5 done

Actual behavior

Some threads print a message, the script hangs and never exits.

Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 4 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 6 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 3 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 1 done
Thread 7 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 9 done
Thread 8 done
Thread 0 done

You may need to run the script a few times to reproduce.

The error messages are likely caused by grpc/grpc#25364 and not related to deadlocks. We've seen deadlocks without these error messages too.

Workaround

While it's not feasible for us to rewrite a big chunk of our project using the asynchronous stack, we solved the problem by running an event loop dedicated to the SDK in a separate thread and passing async SDK calls to that loop from our sync code running in other threads.

import asyncio
from functools import partial
from threading import Thread

from nebius.sdk import SDK

TOKEN = ...
sdk = SDK(credentials=TOKEN)
loop = asyncio.new_event_loop()
Thread(target=lambda: loop.run_forever(), daemon=True).start()


async def coroutine(awaitable):
    return await awaitable


def test(i):
    asyncio.run_coroutine_threadsafe(coroutine(sdk.whoami(timeout=5)), loop).result()
    print(f"Thread {i} done")


threads = [Thread(target=partial(test, i)) for i in range(10)]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

Not sure what the root cause of the deadlocks is, but maybe an approach similar to this workaround could be used by the SDK internally to provide a thread-safe synchronous API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions