Skip to content

PYTHON-5021 - Fix usages of getaddrinfo to be non-blocking #2059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 17, 2025

Conversation

NoahStapp
Copy link
Contributor

No description provided.

@NoahStapp
Copy link
Contributor Author

Typing failures fixed by #2060.

@NoahStapp NoahStapp requested a review from ShaneHarvey January 14, 2025 21:46
hostname, None, 0, 0, socket.IPPROTO_TCP, socket.AI_CANONNAME
)[0]
if not _IS_SYNC:
loop = asyncio.get_event_loop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be get_running_loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "Because this function has rather complex behavior (especially when custom event loop policies are in use), using the get_running_loop() function is preferred to get_event_loop() in coroutines and callbacks.".

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.get_event_loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find--we use asyncio.get_event_loop() elsewhere in the code, I'll open a separate ticket to change those uses as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed for the rest of the codebase in #2063.

if not _IS_SYNC:
loop = asyncio.get_event_loop()
af, socktype, proto, canonname, sockaddr = (
await loop.getaddrinfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI:

Note Both getaddrinfo and getnameinfo internally utilize their synchronous versions through the loop’s default thread pool executor. When this executor is saturated, these methods may experience delays, which higher-level networking libraries may report as increased timeouts. To mitigate this, consider using a custom executor for other user tasks, or setting a default executor with a larger number of workers.

https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.getaddrinfo

Which means our users will eventually run into this issue: python/cpython#112169

This is still better than blocking the loop of course but I wonder if we need to warn of this potential problem or if we should test it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would explicitly testing it provide an actionable solution? We could increase the default number of workers to help mitigate this, but warning users might only confuse them since this is an internal API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead use run_in_executor and have our own executor? We use run_in_executor in _configured_socket as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current uses of run_in_executor also utilize the default thread pool executor. We could configure the executor to have a higher default number of workers, but we'd still hit the same issue depending on the system's resource limits.

Copy link
Member

@blink1073 blink1073 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that user's code will default to the default executor, so we'd be contending with its resources. We'd be essentially taking Guido's advice and applying it to a library so it doesn't interfere with a default user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're saying we use our own ThreadPoolExecutor instance to avoid competing with the default executor? Does that make a difference when the underlying OS threads are still shared between the executors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does if the OS thread limit is much higher than the default thread executor's thread limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I like that idea then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest making a utility function for getaddrinfo to avoid repeating this ugly block everywhere. ;)

async def getaddrinfo(host, port, **kwargs):
if not _IS_SYNC:
loop = asyncio.get_running_loop()
return await loop.getaddrinfo( # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be using run_in_executor here instead as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch sorry, juggling too many changes at once 😅

@NoahStapp NoahStapp requested a review from blink1073 January 15, 2025 17:20

from concurrent.futures import ThreadPoolExecutor

_PYMONGO_EXECUTOR = ThreadPoolExecutor(thread_name_prefix="PYMONGO_EXECUTOR-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this approach because we now have a thread pool that hangs around forever even after all clients have been closed.

My other comment was more around adding guidance for potential errors, not for changing our implementation. Like something that says "if your app runs into "XXX" error consider this may mean your app's default loop executor is under provisioned. Consider increasing the size of this thread pool or ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be difficult to distinguish when this issue occurs, having a separate thread pool for our internal use will help mitigate how common it is. An extra thread pool instance shouldn't be expensive to have a reference to for the lifetime of the application.

Copy link
Member

@ShaneHarvey ShaneHarvey Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer we go with the loop.getaddrinfo approach because it avoids adding the complexity of managing our own thread pool. It's not really kosher to leave a threadpool open even if the threads are "idle". The limitation in loop.getaddrinfo is also implementation detail that could be fixed at any point (even in a python bugfix release).

I expect it will be clear when this issue occurs because a timeout error caused by threadpool starvation looks different than a real DNS timeout error. It should be simple to add an example to our docs by:

  1. saturating the executor with long running tasks
  2. then attempting run a client command
  3. record the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much complexity in managing our own thread pool, but I totally understand the desire to not have an extra pool lying around. I'll revert back to using loop.getaddrinfo() once I have a good example for our docs.

Copy link
Contributor Author

@NoahStapp NoahStapp Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigating, I believe the docs are slightly misleading: what actually happens when the executor pool is fully saturated is any loop.getaddrinfo() calls block until a thread is freed up for use. There's no timeout mechanism inherent to the executor pool. We could add our own timeout to every loop.run_in_executor() call to prevent users from accidentally blocking the driver forever if they saturate the default executor permanently, but then we would cause timeouts to occur whenever the response is too slow.

If we don't add any timeouts to those calls, users will experience slowdowns whenever they perform a driver operation while the default executor pool is fully saturated. That's preferable to spurious timeouts in my opinion, especially when the user's own code is what determines the frequency of the timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating, I agree with that. The cpython ticket referencing anyio so that could explain the difference.

@NoahStapp NoahStapp requested a review from ShaneHarvey January 15, 2025 21:54
@@ -68,6 +70,24 @@ def inner(*args: Any, **kwargs: Any) -> Any:
return cast(F, inner)


def getaddrinfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's rename this _getaddrinfo.

@NoahStapp NoahStapp requested a review from ShaneHarvey January 16, 2025 13:32
Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@NoahStapp NoahStapp merged commit e4d8449 into mongodb:master Jan 17, 2025
46 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants