Replies: 2 comments
-
Anti-Pattern: Structured Error Data Buried in Exception Message StringsSummaryThis document catalogs a design problem in py-libp2p where structured exception data (error codes, close reasons, status codes, enum values) is extracted from library exceptions, formatted into human-readable message strings, and then — downstream — recovered via fragile string matching on those messages. This defeats the purpose of Python's exception type hierarchy and creates silent breakage whenever anyone changes a log message or an upstream library changes its error wording. The core anti-pattern: The Pythonic alternative: Current StatusThe WebSocket → Yamux path — the highest-severity instance — was fixed in commit A string-matching fallback remains in yamux for transports that haven't migrated yet (e.g. TCP Table of Contents
1. The WebSocket → Yamux Path (FIXED)This was the highest-severity instance of the anti-pattern. It has been fixed as of commit What was wrong
What was fixedNew exception class — class ConnectionClosedError(IOException):
"""Raised when a connection is closed by the peer."""
def __init__(
self,
message: str,
close_code: int | None = None,
close_reason: str = "",
transport: str = "",
) -> None:
super().__init__(message)
self.close_code = close_code
self.close_reason = close_reason
self.transport = transportWebSocket layer — return ConnectionClosedError(
f"WebSocket connection closed by peer during "
f"{operation} operation: code={close_code}, "
f"reason={close_reason}.",
close_code=close_code,
close_reason=close_reason,
transport="websocket",
)Yamux — except ConnectionClosedError as e:
# Typed exception — no string matching needed
logger.debug(
f"Stream {self.stream_id}: Window update failed due to "
f"connection closure (data was already read): {e}"
)
return
except (RawConnError, IOException) as e:
# Fallback for transports that don't yet raise ConnectionClosedError
error_str = str(e).lower()
if any(
keyword in error_str
for keyword in ["connection closed", "closed by peer", "connection is closed"]
):
logger.debug(...)
return
logger.warning(...)
raiseTests — # BEFORE (fragile)
error_msg = str(exc_info.value)
assert "WebSocket" in error_msg
assert "connection closed" in error_msg.lower()
assert "1000" in error_msg or "code" in error_msg.lower()
# AFTER (robust)
exc = exc_info.value
assert isinstance(exc, ConnectionClosedError)
assert exc.close_code == 1000
assert exc.transport == "websocket"What still uses string matching (fallback)The This fallback can be removed once TCP and other transports also raise What still uses string matching (WebSocket internal)
2. Remaining String-Matching Sites (UNFIXED)These are every place in non-test, non-example production code where exception behavior is still determined by parsing the message string. 2.1 Yamux: Fallback for Non-WebSocket Transports
Keywords matched: 2.2 WebSocket: Closure Detection (internal)
Keywords matched: 2.3 TLS: SSL Error Classification
Keywords matched: Note: 2.4 TLS: Certificate Extension Check
Keywords matched: 2.5 QUIC Error Context Manager: Exception Type Classification
Keywords matched: This is particularly notable because it checks the exception type name as a string ( 2.6 QUIC Stream: Flow Control Detection
Keywords matched: 2.7 QUIC Stream: Shutdown Race Detection
Keywords matched: 2.8 QUIC Stream: Reset Error Classification
Keywords matched: 2.9 UPnP: Quirky Library Workaround
Keywords matched: This is a documented upstream library bug workaround — arguably the only place where string matching on an exception message is defensible, since the upstream bug is the root cause. 2.10 Bootstrap: Connection Error Classification
Keywords matched: 3. Remaining Data-Burial Sites (UNFIXED)These are places where structured exception data is extracted and then formatted into a plain string, losing the structured form. Section 3.1 (WebSocket close code/reason) has been fixed — the others remain.
|
| File | Lines | Data Lost |
|---|---|---|
libp2p/transport/websocket/connection.py |
326 | Original exception type and attributes |
libp2p/transport/websocket/connection.py |
427 | Original exception type and attributes |
libp2p/transport/websocket/connection.py |
440 | Original exception type and attributes |
libp2p/transport/websocket/connection.py |
476 | Original exception type and attributes |
Pattern: raise IOException(f"Read failed: {str(e)}") or raise IOException(f"Write failed: {str(e)}"). The original exception's type and any attributes are lost (some use from e, some don't). These are non-closure errors that don't go through _handle_connection_closed_exception.
3.3 QUIC Error Context: aioquic Attributes → String
| File | Lines | Data Lost |
|---|---|---|
libp2p/transport/quic/exceptions.py |
367-388 | error_code, stream_id from aioquic exceptions |
Despite QUICConnectionClosedError and other QUIC exceptions accepting error_code parameters, the QUICErrorContext.__exit__() method doesn't extract them from the original exception. It just formats {exc_val} into the message.
3.4 Circuit Relay: Status Code → ConnectionError
| File | Lines | Data Lost |
|---|---|---|
libp2p/relay/circuit_v2/protocol.py |
771-779 | status_code: StatusCode enum |
libp2p/relay/circuit_v2/transport.py |
419-424 | status_code: StatusCode enum |
libp2p/relay/circuit_v2/transport.py |
577-582 | status_code: StatusCode enum |
libp2p/relay/circuit_v2/transport.py |
844-845 | status_code: StatusCode enum |
Pattern: raise ConnectionError(f"Relay connection failed: {status_msg}"). The status_code enum is available but not stored on the exception.
3.5 TCP Transport: OS Error Details → OpenConnectionError
| File | Lines | Data Lost |
|---|---|---|
libp2p/transport/tcp/tcp.py |
172-184 | OSError.errno, OSError.strerror |
Pattern: raise OpenConnectionError(f"Failed to open TCP stream to {maddr}: {error}") from error. While from error preserves the chain, OpenConnectionError doesn't carry errno.
3.6 Network Stream: QUIC Stream Attributes → StreamError
| File | Lines | Data Lost |
|---|---|---|
libp2p/network/stream/net_stream.py |
236 | QUICStreamError.error_code, .stream_id |
libp2p/network/stream/net_stream.py |
291 | Same |
Pattern: raise StreamError(f"Read operation failed: {error}") from error. The QUIC-specific structured attributes are lost.
3.7 Swarm: Various → SwarmException
| File | Lines | Data Lost |
|---|---|---|
libp2p/network/swarm.py |
249 | PeerStoreError details |
libp2p/network/swarm.py |
430 | Mux upgrade error details |
libp2p/network/swarm.py |
563 | Stream creation error details |
libp2p/network/swarm.py |
911 | Resource manager exception attributes |
Pattern: raise SwarmException(f"Failed to create stream to peer {peer_id}") from e.
4. Contrast: Where It's Done Right
The codebase has examples of proper exception design that prove the anti-pattern is avoidable.
4.1 ConnectionClosedError (NEW — the fix from this PR)
libp2p/io/exceptions.py — carries close_code, close_reason, and transport:
class ConnectionClosedError(IOException):
def __init__(
self,
message: str,
close_code: int | None = None,
close_reason: str = "",
transport: str = "",
) -> None:
super().__init__(message)
self.close_code = close_code
self.close_reason = close_reason
self.transport = transportYamux catches it by type: except ConnectionClosedError. No string matching needed. Being a subclass of IOException, it's backward compatible with all existing except IOException handlers.
4.2 QUIC Exception Hierarchy
libp2p/transport/quic/exceptions.py defines a rich hierarchy where exceptions carry structured data:
class QUICError(Exception):
def __init__(self, message: str, error_code: int | None = None):
super().__init__(message)
self.error_code = error_code
class QUICStreamError(QUICError):
def __init__(self, message: str, stream_id: str | None = None, error_code: int | None = None):
super().__init__(message, error_code)
self.stream_id = stream_idIrony: The QUIC exceptions are well-designed, but the code that raises them (QUICErrorContext.__exit__()) doesn't extract the structured attributes from the original exceptions — it just formats str(exc_val) into the message. The structured fields on the QUIC exceptions go unused.
4.3 IncompleteReadError
libp2p/io/exceptions.py — carries both expected_bytes and received_bytes:
class IncompleteReadError(IOException):
def __init__(self, message: str, expected_bytes: int = 0, received_bytes: int = 0):
super().__init__(message)
self.expected_bytes = expected_bytes
self.received_bytes = received_bytes
@property
def is_clean_close(self) -> bool:
return self.received_bytes == 0This allows downstream code to do if e.is_clean_close: instead of if "0 bytes" in str(e).
4.4 RendezvousError
libp2p/discovery/rendezvous/errors.py — carries a status enum:
class RendezvousError(Exception):
def __init__(self, status: Message.ResponseStatus.ValueType, message: str = ""):
self.status = status
self.message = message
super().__init__(f"Rendezvous error {status}: {message}")Subclasses like InvalidNamespaceError, InvalidTTLError, etc. set the status automatically. Downstream code catches specific types instead of parsing strings.
4.5 ResourceLimitExceeded
libp2p/rcmgr/exceptions.py — carries scope_name, resource_type, requested, available:
class ResourceLimitExceeded(ResourceManagerException):
def __init__(self, message, scope_name=None, resource_type=None, requested=None, available=None):
super().__init__(message)
self.scope_name = scope_name
self.resource_type = resource_type
self.requested = requested
self.available = available5. Test Code Still Infected by the Anti-Pattern
Fixed tests (PR #1213)
The following test files were updated to assert on exception type and structured attributes instead of string matching:
tests/core/transport/websocket/test_websocket_closure_handling.py— now usespytest.raises(ConnectionClosedError)and asserts.close_code,.close_reason,.transporttests/core/stream_muxer/yamux/test_yamux_window_update_error_handling.py— now usesConnectionClosedError(...)in mocks and tests both the typed path and the string-matching fallback
Remaining tests with string-matching assertions
These tests still assert on exception message content and will break if messages are reworded:
| File | Lines | What's Being Asserted via String |
|---|---|---|
tests/core/transport/websocket/test_websocket_connection_closure.py |
94-98 | "websocket", "connection closed", "1000", "code" |
tests/core/transport/websocket/test_websocket_connection_closure.py |
127-128 | "WebSocket", "connection" |
tests/core/transport/websocket/test_websocket_connection_closure.py |
144-149 | "websocket", "connection", "closed", "peer" |
tests/core/transport/websocket/test_websocket_connection_closure.py |
169-172 | "WebSocket", "connection closed", "1000" |
tests/core/transport/websocket/test_websocket_connection_closure.py |
190-193 | "1001", "Going away" |
tests/core/transport/websocket/test_websocket_yamux_nim_compat.py |
97-101 | "websocket", "connection closed", "1000" |
tests/core/transport/websocket/test_websocket_yamux_nim_compat.py |
129-130 | "WebSocket", "connection" |
tests/core/transport/websocket/test_websocket_yamux_nim_compat.py |
149-152 | "websocket", "connection", "closed", "peer" |
tests/core/transport/websocket/test_websocket_yamux_nim_compat.py |
174-175 | "connection closed", "1000" |
tests/core/transport/websocket/test_websocket_yamux_nim_compat.py |
195-196 | "1001", "Going away" |
tests/core/transport/test_tcp.py |
224-225 | "Failed to open TCP stream", "Failed to dial" |
Note: test_websocket_connection_closure.py and test_websocket_yamux_nim_compat.py still pass because ConnectionClosedError is a subclass of IOException, so their pytest.raises(IOException) assertions still work. However, they continue to assert on message strings rather than structured attributes. These can be migrated to use ConnectionClosedError in a follow-up.
6. Proposed Fixes for Remaining Modules
Fix 1: ConnectionClosedError subclass of IOException (DONE)
ConnectionClosedError subclass of IOExceptionImplemented in commit 635ce90f. See Section 1.
Fix 2: Make TCP transport raise ConnectionClosedError
This would allow removing the string-matching fallback in yamux entirely:
# libp2p/transport/tcp/tcp.py or libp2p/network/connection/raw_connection.py
# When a TCP connection is detected as closed:
raise ConnectionClosedError(
"TCP connection closed by peer",
transport="tcp",
)Then yamux simplifies to:
except ConnectionClosedError:
logger.debug(...)
return
except (RawConnError, IOException):
logger.warning(...)
raiseFix 3: Populate QUIC exception attributes in QUICErrorContext
The QUIC exception classes already have the right attributes — they're just never populated.
# libp2p/transport/quic/exceptions.py — QUICErrorContext.__exit__()
if "ConnectionClosed" in str(exc_type):
error_code = getattr(exc_val, "error_code", None)
raise QUICConnectionClosedError(
f"Connection closed during {self.operation}: {exc_val}",
error_code=error_code, # ← actually extract and pass it
) from exc_valFix 4: Import aioquic exceptions directly in QUICErrorContext
Instead of "ConnectionClosed" in str(exc_type), use isinstance():
from aioquic.quic.connection import QuicConnectionError # or whatever the actual class is
if isinstance(exc_val, QuicConnectionError):
...Fix 5: Use ssl.SSLError attributes instead of string matching
ssl.SSLError has .reason and .library attributes:
# BEFORE
if "alert" in error_str.lower():
...
if "EOF" in error_str:
...
# AFTER
if e.reason and "ALERT" in e.reason:
...
if e.reason == "EOF_OCCURRED" or e.errno == ssl.SSL_ERROR_EOF:
...Fix 6: RelayConnectionError for Circuit Relay
class RelayConnectionError(ConnectionError):
def __init__(self, message: str, status_code: StatusCode, status_msg: str = ""):
super().__init__(message)
self.status_code = status_code
self.status_msg = status_msgFix 7: Use isinstance for trio_websocket.ConnectionClosed
Replace _is_connection_closed_exception() string matching with:
try:
from trio_websocket import ConnectionClosed
except ImportError:
ConnectionClosed = None
def _is_connection_closed_exception(self, e: Exception) -> bool:
if ConnectionClosed is not None and isinstance(e, ConnectionClosed):
return True
# Fallback for other exception types
return "closed" in str(e).lower()7. Risk Assessment
Severity by Instance
| Location | Status | Severity | Reason |
|---|---|---|---|
| WebSocket → Yamux typed exception | FIXED | Now caught by type via ConnectionClosedError. |
|
| Yamux string-matching fallback (TCP) | Unfixed | Medium | Temporary; removable once TCP raises ConnectionClosedError. Only affects TCP transport, which is less prone to the immediate-close race that triggered the original issue. |
WebSocket _is_connection_closed_exception |
Unfixed | Medium | Still does string matching on trio_websocket exceptions. Could use isinstance. |
| QUIC Error Context type-name matching | Unfixed | High | If aioquic renames its exception class, all QUIC error mapping silently breaks. |
QUIC stream "flow control" matching |
Unfixed | Medium | Could misclassify errors if aioquic changes its message. |
QUIC stream "after fin" / "unknown peer-initiated stream" matching |
Unfixed | Medium | Shutdown race handling depends on aioquic's internal error messages. |
TLS "alert" / "EOF" matching |
Unfixed | Medium | ssl.SSLError has proper attributes; string matching is unnecessary. |
| TLS certificate extension string match | Unfixed | Low | Specific to a known edge case (autotls certs). |
Bootstrap "no addresses" match |
Unfixed | Low | Only affects logging verbosity, not control flow. |
UPnP "Success" match |
Unfixed | Low | Documented upstream bug workaround; no better alternative exists. |
| Circuit Relay status code loss | Unfixed | Low | No downstream code currently recovers the status code. |
| Remaining test string assertions | Unfixed | Medium | Tests will break on innocent message refactors. |
Why This Matters
-
Silent breakage: The string-matching code doesn't fail loudly — it takes the wrong branch. In the yamux case (now fixed), a connection closure error would have been re-raised instead of suppressed, potentially cascading into stream failures.
-
Upstream fragility: The matched strings come from
trio_websocket,aioquic, andssl— libraries py-libp2p doesn't control. A minor patch release could change error messages. -
Refactoring hazard: Anyone improving an error message must now grep the entire codebase for string-matching sites. This is a maintenance trap.
-
Testing burden: Tests that assert on string content are brittle and produce false failures on innocent changes.
Beta Was this translation helpful? Give feedback.
-
|
@acul71 I am working on this. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
Several places in py-libp2p extract structured error data (close codes, status codes, error codes) from library exceptions, format it into a human-readable message string, and then—downstream—recover it via fragile string matching on
str(e).lower(). This defeats Python's exception type hierarchy and creates silent breakage whenever error messages change.The core anti-pattern:
Partial Fix Already Landed
PR #1213 (commit
635ce90f) fixed the highest-severity instance—the WebSocket → Yamux path—by introducingConnectionClosedError(IOException)with structuredclose_code,close_reason, andtransportattributes. Yamux now catches it by type (except ConnectionClosedError) instead of parsing the message string.A string-matching fallback remains for transports that haven't migrated yet (e.g. TCP
RawConnError).Remaining Instances
The pattern still exists in 8+ production code sites across these modules:
QUICErrorContext"ConnectionClosed" in str(exc_type)) instead ofisinstance()"flow control","after fin","unknown peer-initiated stream""alert","EOF"onssl.SSLError(which has.reason/.errnoattributes)"expected certificate to contain the key extension"onValueError"no addresses established a successful connection"onSwarmExceptionstr(e) == "Success"(upstream library bug workaround—acceptable)status_codeenum extracted but not stored on raisedConnectionErrorRawConnError(temporary until TCP migrates)Additionally, 11+ test files assert on message strings instead of exception types/attributes.
Proposed Approach
Follow the same pattern established by
ConnectionClosedError:ConnectionClosedErroron connection closure → removes yamux fallbackisinstance()instead of type-name strings; populateerror_codeon QUIC exception classes (they already accept it but it's never passed)ssl.SSLError.reason/.errnoinstead of string matchingRelayConnectionErrorwithstatus_codeattributeisinstance(e, trio_websocket.ConnectionClosed)instead of string matching in_is_connection_closed_exception()See the full analysis with file paths, line numbers, and code samples in downloads/Exception_error_strings_to_codes.md.
Discussion Questions
IOExceptiondirectly?ConnectionClosedErrorbe the shared type across all transports, or should each transport have its own subclass (e.g.QUICConnectionClosedErroralready exists)?str(e) == "Success"case is an upstream bug—should we just document it and move on?Beta Was this translation helpful? Give feedback.
All reactions