PYTHON-5021 - Fix usages of getaddrinfo to be non-blocking #2059

NoahStapp · 2025-01-14T21:04:56Z

No description provided.

NoahStapp · 2025-01-14T21:46:24Z

Typing failures fixed by #2060.

ShaneHarvey · 2025-01-14T23:58:33Z

pymongo/asynchronous/auth.py

-        hostname, None, 0, 0, socket.IPPROTO_TCP, socket.AI_CANONNAME
-    )[0]
+    if not _IS_SYNC:
+        loop = asyncio.get_event_loop()


Should this be get_running_loop?

Yes, "Because this function has rather complex behavior (especially when custom event loop policies are in use), using the get_running_loop() function is preferred to get_event_loop() in coroutines and callbacks.".

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.get_event_loop

Good find--we use asyncio.get_event_loop() elsewhere in the code, I'll open a separate ticket to change those uses as well.

Addressed for the rest of the codebase in #2063.

ShaneHarvey · 2025-01-15T00:00:00Z

pymongo/asynchronous/auth.py

+    if not _IS_SYNC:
+        loop = asyncio.get_event_loop()
+        af, socktype, proto, canonname, sockaddr = (
+            await loop.getaddrinfo(


FYI:

Note Both getaddrinfo and getnameinfo internally utilize their synchronous versions through the loop’s default thread pool executor. When this executor is saturated, these methods may experience delays, which higher-level networking libraries may report as increased timeouts. To mitigate this, consider using a custom executor for other user tasks, or setting a default executor with a larger number of workers.

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.getaddrinfo

Which means our users will eventually run into this issue: python/cpython#112169

This is still better than blocking the loop of course but I wonder if we need to warn of this potential problem or if we should test it explicitly.

Would explicitly testing it provide an actionable solution? We could increase the default number of workers to help mitigate this, but warning users might only confuse them since this is an internal API.

Should we instead use run_in_executor and have our own executor? We use run_in_executor in _configured_socket as well.

Our current uses of run_in_executor also utilize the default thread pool executor. We could configure the executor to have a higher default number of workers, but we'd still hit the same issue depending on the system's resource limits.

My point is that user's code will default to the default executor, so we'd be contending with its resources. We'd be essentially taking Guido's advice and applying it to a library so it doesn't interfere with a default user.

Oh, you're saying we use our own ThreadPoolExecutor instance to avoid competing with the default executor? Does that make a difference when the underlying OS threads are still shared between the executors?

It does if the OS thread limit is much higher than the default thread executor's thread limit.

Good point, I like that idea then.

I would also suggest making a utility function for getaddrinfo to avoid repeating this ugly block everywhere. ;)

blink1073 · 2025-01-15T16:51:21Z

pymongo/asynchronous/helpers.py

+async def getaddrinfo(host, port, **kwargs):
+    if not _IS_SYNC:
+        loop = asyncio.get_running_loop()
+        return await loop.getaddrinfo(  # type: ignore[assignment]


Shouldn't we be using run_in_executor here instead as well?

Good catch sorry, juggling too many changes at once 😅

ShaneHarvey · 2025-01-15T18:39:36Z

pymongo/_asyncio_executor.py

+
+from concurrent.futures import ThreadPoolExecutor
+
+_PYMONGO_EXECUTOR = ThreadPoolExecutor(thread_name_prefix="PYMONGO_EXECUTOR-")


I'm not sure I like this approach because we now have a thread pool that hangs around forever even after all clients have been closed.

My other comment was more around adding guidance for potential errors, not for changing our implementation. Like something that says "if your app runs into "XXX" error consider this may mean your app's default loop executor is under provisioned. Consider increasing the size of this thread pool or ..."

It could be difficult to distinguish when this issue occurs, having a separate thread pool for our internal use will help mitigate how common it is. An extra thread pool instance shouldn't be expensive to have a reference to for the lifetime of the application.

Personally I prefer we go with the loop.getaddrinfo approach because it avoids adding the complexity of managing our own thread pool. It's not really kosher to leave a threadpool open even if the threads are "idle". The limitation in loop.getaddrinfo is also implementation detail that could be fixed at any point (even in a python bugfix release).

I expect it will be clear when this issue occurs because a timeout error caused by threadpool starvation looks different than a real DNS timeout error. It should be simple to add an example to our docs by:

saturating the executor with long running tasks

then attempting run a client command

record the error

I don't see much complexity in managing our own thread pool, but I totally understand the desire to not have an extra pool lying around. I'll revert back to using loop.getaddrinfo() once I have a good example for our docs.

After investigating, I believe the docs are slightly misleading: what actually happens when the executor pool is fully saturated is any loop.getaddrinfo() calls block until a thread is freed up for use. There's no timeout mechanism inherent to the executor pool. We could add our own timeout to every loop.run_in_executor() call to prevent users from accidentally blocking the driver forever if they saturate the default executor permanently, but then we would cause timeouts to occur whenever the response is too slow.

If we don't add any timeouts to those calls, users will experience slowdowns whenever they perform a driver operation while the default executor pool is fully saturated. That's preferable to spurious timeouts in my opinion, especially when the user's own code is what determines the frequency of the timeouts.

Thanks for investigating, I agree with that. The cpython ticket referencing anyio so that could explain the difference.

ShaneHarvey · 2025-01-15T23:54:40Z

pymongo/synchronous/helpers.py

    return cast(F, inner)


+def getaddrinfo(


nit: let's rename this _getaddrinfo.

blink1073

LGTM!

PYTHON-5021 - Fix usages of getaddrinfo to be non-blocking

f7bb39e

NoahStapp requested a review from ShaneHarvey January 14, 2025 21:46

ShaneHarvey reviewed Jan 15, 2025

View reviewed changes

NoahStapp added 2 commits January 15, 2025 07:55

Merge branch 'master' into PYTHON-5021

882c0db

Use get_running_loop instead of get_event_loop

f6c0136

NoahStapp requested review from ShaneHarvey and blink1073 January 15, 2025 14:35

NoahStapp added 5 commits January 15, 2025 11:13

Use our own executor

076a014

Fix import of _PYMONGO_EXECUTOR

2821566

Merge branch 'master' into PYTHON-5021

26fe6e1

getaddrinfo helper method

297bf9c

cleanup

39ade36

blink1073 reviewed Jan 15, 2025

View reviewed changes

Use run_in_executor for getaddrinfo

5e3bc65

NoahStapp requested a review from blink1073 January 15, 2025 17:20

ShaneHarvey requested changes Jan 15, 2025

View reviewed changes

Revert back to using the default executor

85e59fd

NoahStapp requested a review from ShaneHarvey January 15, 2025 21:54

ShaneHarvey requested changes Jan 15, 2025

View reviewed changes

getaddrinfo -> _getaddrinfo

abe6b24

NoahStapp requested a review from ShaneHarvey January 16, 2025 13:32

blink1073 approved these changes Jan 16, 2025

View reviewed changes

ShaneHarvey approved these changes Jan 16, 2025

View reviewed changes

NoahStapp merged commit e4d8449 into mongodb:master Jan 17, 2025
46 of 48 checks passed


		from concurrent.futures import ThreadPoolExecutor

		_PYMONGO_EXECUTOR = ThreadPoolExecutor(thread_name_prefix="PYMONGO_EXECUTOR-")

PYTHON-5021 - Fix usages of getaddrinfo to be non-blocking #2059

PYTHON-5021 - Fix usages of getaddrinfo to be non-blocking #2059

Uh oh!

Conversation

NoahStapp commented Jan 14, 2025

Uh oh!

NoahStapp commented Jan 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blink1073 Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blink1073 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blink1073 Jan 15, 2025 •

edited

Loading

ShaneHarvey Jan 15, 2025 •

edited

Loading

NoahStapp Jan 15, 2025 •

edited

Loading