Skip to content

Conversation

@julianz-
Copy link
Contributor

@julianz- julianz- commented Oct 5, 2025

Fixed race conditions to make socket I/O more resilient during connection teardown.

  1. BufferedWriter's write(): Added error handling to ignore common socket errors (e.g., ECONNRESET, EPIPE, ENOTCONN, EBADF) that occur when the underlying connection has been unexpectedly closed by the client or OS. This prevents a crash when attempting to write to a defunct socket.
  2. BufferedWriters's close(): Made idempotent, allowing safe repeated calls without raising exceptions.
  3. Needed to add explicit handling of WINDOWS environments as these are seen to throw Windows specific WSAENOTSOCK errors.

Includes new unit tests to cover the idempotency and graceful handling of already closed underlying buffers.


This change is Reviewable

@julianz- julianz- closed this Oct 5, 2025
@julianz- julianz- reopened this Oct 5, 2025
@julianz- julianz- changed the title Fix race condition and improve robustness during socket I/O [DRAFT] Fix race condition and improve robustness during socket I/O Oct 5, 2025
@julianz- julianz- force-pushed the fix-socket-teardown branch 2 times, most recently from 887399d to 4f1662e Compare October 5, 2025 19:04
@codecov
Copy link

codecov bot commented Oct 5, 2025

Codecov Report

❌ Patch coverage is 76.08696% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.19%. Comparing base (4a8dc43) to head (f7d8469).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #779      +/-   ##
==========================================
- Coverage   79.25%   79.19%   -0.06%     
==========================================
  Files          29       29              
  Lines        4203     4292      +89     
  Branches      539      544       +5     
==========================================
+ Hits         3331     3399      +68     
- Misses        728      747      +19     
- Partials      144      146       +2     

@julianz- julianz- force-pushed the fix-socket-teardown branch from 4f1662e to 4833dac Compare October 5, 2025 19:09
@julianz- julianz- marked this pull request as draft October 5, 2025 19:09
@julianz- julianz- changed the title [DRAFT] Fix race condition and improve robustness during socket I/O Fix race condition and improve robustness during socket I/O Oct 5, 2025
@julianz- julianz- changed the title Fix race condition and improve robustness during socket I/O Fix race conditions and improve robustness during socket I/O Oct 5, 2025
@julianz- julianz- marked this pull request as ready for review October 5, 2025 19:26
@julianz- julianz- force-pushed the fix-socket-teardown branch 2 times, most recently from 048d898 to f0471ca Compare October 7, 2025 03:27
@julianz- julianz- force-pushed the fix-socket-teardown branch 4 times, most recently from 1af9bf3 to 5f1d7f7 Compare October 21, 2025 23:38
@webknjaz webknjaz moved this to 🧐 @webknjaz's review queue 📋 in 📅 Procrastinating in public Oct 22, 2025
except OSError as sock_err:
error_code = sock_err.errno
if error_code in _errors.acceptable_sock_shutdown_error_codes:
# The socket is gone, so just ignore this error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of the refactoring being made with exception hierarchy, this would also be replaced.

I wonder, however, if it's right to ignore the problem here. I think that if the connection is broken, such an exception should bubble up to the layer where the reset of the data writing context exists. This would probably be the caller. And not just write() but whatever attempts writing. That layer would know how it needs to handle connection errors.

If we just suppress them, then write() would pretend to have written something into the socket successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Maybe just remove this branch then and let the errors rise.

self.conn.wfile.write(chunk)
data = chunk

with contextlib.suppress(ConnectionError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the layering, I think that something up the stack should process this exception, not a low-ish level write method. What calls it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking deeper, I see that _process_connections_until_interrupted() in thread pool is already set up to process connection errors. So this can just be dropped.

If an network error occurrs anytime while processing a single request, there's nothing to do but just log it. And such things should be happening on the top level. Let the connection errors bubble up all the way to that layer and don't suppress them.

The only thing we need to make sure of is that any network errors that present themselves in other ways, are turned into one of the specific connection error exceptions.

chunk_size_hex = hex(len(chunk))[2:].encode('ascii')
buf = [chunk_size_hex, CRLF, chunk, CRLF]
self.conn.wfile.write(EMPTY.join(buf))
data = EMPTY.join(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = EMPTY.join(buf)
self.conn.wfile.write(EMPTY.join(buf))

Copy link
Member

@webknjaz webknjaz Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw a few places in this module where socket.error / OSError is being suppressed. I haven't checked them all but it seems they need to be unshielded too so that the connection error bubbles to the top properly.

"""Handle SysCallError during close/shutdown."""
try:
# Call the proxied method (e.g., self._ssl_conn.close())
return getattr(self._ssl_conn, method_name)()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just move calling self._ssl_conn.close() / self._ssl_conn.shutdown() into the methods above. And put the below exception handling/conversion into a decorator. It's quite easy to do with @contextlib.contextmanager.

)
conn_err_cls = connection_error_map.get(
error_code,
errors.CherootConnectionError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this place.

Comment on lines 274 to 320
def close(self):
"""Close the connection, translating OpenSSL errors for shutdown."""
self._lock.acquire()
try:
return self._safe_close_call('close')
finally:
self._lock.release()

def shutdown(self):
"""Shutdown the connection, translating OpenSSL errors."""
self._lock.acquire()
try:
return self._safe_close_call('shutdown')
finally:
self._lock.release()

def _safe_close_call(self, method_name):
"""Handle SysCallError during close/shutdown."""
try:
# Call the proxied method (e.g., self._ssl_conn.close())
return getattr(self._ssl_conn, method_name)()
except SSL.SysCallError as ssl_syscall_err:
connection_error_map = {
errno.EBADF: ConnectionError, # socket is gone?
errno.ECONNABORTED: ConnectionAbortedError,
errno.ECONNREFUSED: ConnectionRefusedError,
errno.ECONNRESET: ConnectionResetError,
errno.ENOTCONN: ConnectionError,
errno.EPIPE: BrokenPipeError,
errno.ESHUTDOWN: BrokenPipeError,
}
error_code = (
ssl_syscall_err.args[0] if ssl_syscall_err.args else None
)
error_msg = (
os.strerror(error_code)
if error_code is not None
else repr(ssl_syscall_err)
)
conn_err_cls = connection_error_map.get(
error_code,
errors.CherootConnectionError,
)
raise conn_err_cls(
error_code,
f'Faied to {method_name!s} the PyOpenSSL connection: {error_msg!s}',
) from ssl_syscall_err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this should do:

Suggested change
def close(self):
"""Close the connection, translating OpenSSL errors for shutdown."""
self._lock.acquire()
try:
return self._safe_close_call('close')
finally:
self._lock.release()
def shutdown(self):
"""Shutdown the connection, translating OpenSSL errors."""
self._lock.acquire()
try:
return self._safe_close_call('shutdown')
finally:
self._lock.release()
def _safe_close_call(self, method_name):
"""Handle SysCallError during close/shutdown."""
try:
# Call the proxied method (e.g., self._ssl_conn.close())
return getattr(self._ssl_conn, method_name)()
except SSL.SysCallError as ssl_syscall_err:
connection_error_map = {
errno.EBADF: ConnectionError, # socket is gone?
errno.ECONNABORTED: ConnectionAbortedError,
errno.ECONNREFUSED: ConnectionRefusedError,
errno.ECONNRESET: ConnectionResetError,
errno.ENOTCONN: ConnectionError,
errno.EPIPE: BrokenPipeError,
errno.ESHUTDOWN: BrokenPipeError,
}
error_code = (
ssl_syscall_err.args[0] if ssl_syscall_err.args else None
)
error_msg = (
os.strerror(error_code)
if error_code is not None
else repr(ssl_syscall_err)
)
conn_err_cls = connection_error_map.get(
error_code,
errors.CherootConnectionError,
)
raise conn_err_cls(
error_code,
f'Faied to {method_name!s} the PyOpenSSL connection: {error_msg!s}',
) from ssl_syscall_err
@_morph_syscall_to_connection_error('close')
def close(self):
"""Close the connection, translating OpenSSL errors for shutdown."""
with self._lock:
return self._ssl_conn.close()
@_morph_syscall_to_connection_error('shutdown')
def shutdown(self):
"""Shutdown the connection, translating OpenSSL errors."""
with self._lock:
return self._ssl_conn.shutdown()
@contextlib.contextmanager
def _morph_syscall_to_connection_error(method_name, /):
"""Handle :exc:`SSL.SysCallError` in a wrapped method."""
try:
yield
except SSL.SysCallError as ssl_syscall_err:
connection_error_map = {
errno.EBADF: ConnectionError, # socket is gone?
errno.ECONNABORTED: ConnectionAbortedError,
errno.ECONNREFUSED: ConnectionRefusedError,
errno.ECONNRESET: ConnectionResetError,
errno.ENOTCONN: ConnectionError,
errno.EPIPE: BrokenPipeError,
errno.ESHUTDOWN: BrokenPipeError,
}
error_code = (
ssl_syscall_err.args[0] if ssl_syscall_err.args else None
)
error_msg = (
os.strerror(error_code)
if error_code is not None
else repr(ssl_syscall_err)
)
conn_err_cls = connection_error_map.get(
error_code,
ConnectionError,
)
raise conn_err_cls(
error_code,
f'Faied to {method_name!s} the PyOpenSSL connection: {error_msg!s}',
) from ssl_syscall_err

Comment on lines 37 to 48
except OSError as sock_err:
error_code = sock_err.errno
if error_code in _errors.acceptable_sock_shutdown_error_codes:
# The socket is gone, so just ignore this error.
return
raise
else:
# The 'try' block completed without an exception
if n == 0:
# This could happen with non-blocking write
# when nothing was written
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now questioning the addition of this whole thing. Or at least the new except block. Yet to hear your thoughts about the else portion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the OSError exception handling as I agree with your earlier point we should just let those errors rise. Regarding the else block - I've changed to a simple if statement to check for n == 0 to prevent an infinite loop.

@webknjaz
Copy link
Member

@julianz- could you scan the modules more broadly for places where we could stop suppressing connection errors on the low level layers? I'm not looking into the tests until we figure out the architectural overview of the whole thing. But I feel like we're getting closer. I think we might need to split this PR into two at some point, will see.

@julianz-
Copy link
Contributor Author

Thank you @webknjaz for all your great suggestions and points. I will take a look at everything. The layering is actually more complicated than I had realized.

@julianz- julianz- force-pushed the fix-socket-teardown branch 2 times, most recently from d9df8d9 to b26544f Compare October 22, 2025 04:01
Fixes to make socket I/O more resilient during connection teardown.

1. BufferedWriter's write(): Added error handling to ignore common
   socket errors (e.g., ECONNRESET, EPIPE, ENOTCONN, EBADF) that occur
   when the underlying connection has been unexpectedly closed by the
   client or OS. This prevents a crash when attempting to write to a
   defunct socket.
2. BufferedWriters's close(): Made idempotent, allowing safe repeated
   calls without raising exceptions.
3. Needed to add explicit handling of WINDOWS environments as these are
   seen to throw Windows specific WSAENOTSOCK errors.

Includes new unit tests to cover the idempotency and graceful handling
of already closed underlying buffers.
@julianz- julianz- force-pushed the fix-socket-teardown branch from b26544f to f7d8469 Compare October 22, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants