Deadlocks in a multi-threaded environment

### Context

This might be a known issue, as it seems to be mentioned in the readme:
> Keep in mind, that this may lead to some problems or infinite locks, even if timeouts have been added.

We found a workaround for our project, but I still decided to open this ticket to possibly help resolve this issue or help other users work around it.

### Steps to reproduce

Set your token and run the script.

```python
from functools import partial
from threading import Thread

from nebius.sdk import SDK


TOKEN = ...
sdk = SDK(credentials=TOKEN)


def test(i):
    sdk.whoami(timeout=5).wait()
    print(f"Thread {i} done")


threads = [Thread(target=partial(test, i)) for i in range(10)]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
```

### Expected behavior

Each of the threads prints a message, the script exits.

```
Thread 2 done
Thread 1 done
Thread 0 done
Thread 7 done
Thread 6 done
Thread 4 done
Thread 8 done
Thread 9 done
Thread 3 done
Thread 5 done
```

### Actual behavior

Some threads print a message, the script hangs and never exits.

```python
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 4 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 6 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 3 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 1 done
Thread 7 done
Exception in callback PollerCompletionQueue._handle_events()()
handle: <Handle PollerCompletionQueue._handle_events()()>
Traceback (most recent call last):
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi", line 147, in grpc._cython.cygrpc.PollerCompletionQueue._handle_events
BlockingIOError: [Errno 11] Resource temporarily unavailable
Thread 9 done
Thread 8 done
Thread 0 done
```

You may need to run the script a few times to reproduce.

The error messages are likely caused by https://github.com/grpc/grpc/issues/25364 and not related to deadlocks. We've seen deadlocks without these error messages too.

### Workaround

While it's not feasible for us to rewrite a big chunk of our project using the asynchronous stack, we solved the problem by running an event loop dedicated to the SDK in a separate thread and passing async SDK calls to that loop from our sync code running in other threads.

```python
import asyncio
from functools import partial
from threading import Thread

from nebius.sdk import SDK

TOKEN = ...
sdk = SDK(credentials=TOKEN)
loop = asyncio.new_event_loop()
Thread(target=lambda: loop.run_forever(), daemon=True).start()


async def coroutine(awaitable):
    return await awaitable


def test(i):
    asyncio.run_coroutine_threadsafe(coroutine(sdk.whoami(timeout=5)), loop).result()
    print(f"Thread {i} done")


threads = [Thread(target=partial(test, i)) for i in range(10)]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
```

Not sure what the root cause of the deadlocks is, but maybe an approach similar to this workaround could be used by the SDK internally to provide a thread-safe synchronous API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocks in a multi-threaded environment #76

Context

Steps to reproduce

Expected behavior

Actual behavior

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlocks in a multi-threaded environment #76

Description

Context

Steps to reproduce

Expected behavior

Actual behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions