Skip to content

WebSocket Transport Interoperability Failures with Other libp2p ImplementationsΒ #1082

@acul71

Description

@acul71

WebSocket Transport Interoperability Failures with Other libp2p Implementations

Summary

Python libp2p's WebSocket transport fails interoperability tests with multiple libp2p implementations (Nim v1.14, JVM v1.2, rust-v0.53, Chromium-Rust v0.53) when handling yamux operations over WebSocket. The connection closes prematurely during active stream data reading, causing error cascades. This works correctly with rust-v0.54+, which uses proactive closure detection.

Affected Tests: 11 WebSocket tests fail across multiple implementations
Status: βœ… Works with rust-v0.54, rust-v0.55, rust-v0.56 (these versions have proactive closure detection)

Problem Statement

When Python's WebSocket transport is used with yamux or mplex stream multiplexers, connections fail during active stream operations. The failure occurs when:

  1. A ping stream is created successfully
  2. Data is sent over the stream
  3. Python attempts to read the response
  4. During the read operation, yamux tries to send a window update
  5. The WebSocket connection has already been closed by the peer
  6. The write operation fails, followed by read failures
  7. An IncompleteReadError is raised with a generic error message

Error Message:

Connection closed: expected 2 bytes but received 0 (connection may have been reset by peer)

Location: libp2p/io/utils.py:read_exactly() raises IncompleteReadError when WebSocket connection closes during yamux stream operations.

Affected Implementations

Implementation Failed Tests Transports Muxers Security
Rust v0.53 4 ws yamux, mplex noise
JVM v1.2 4 ws yamux, mplex noise
Nim v1.14 4 ws yamux, mplex noise
Chromium-Rust v0.53 1 ws mplex noise

Total: 11 WebSocket test failures

Note: All tests pass with rust-v0.54+, rust-v0.55, and rust-v0.56, which implement proactive closure detection (see Rust PR #4568).

Root Cause Analysis

The Core Problem: Reactive vs Proactive Closure Detection

Python's Approach (Reactive):

  1. WebSocket close frame arrives β†’ Connection closed by peer
  2. Python unaware β†’ No proactive detection
  3. Yamux tries to read β†’ read_exactly(12) called
  4. WebSocket read fails β†’ ConnectionClosed exception
  5. Exception caught β†’ Returns b"" (empty bytes)
  6. Yamux continues β†’ Tries to send window update
  7. WebSocket write fails β†’ "Connection closed" exception
  8. Error propagated β†’ IncompleteReadError raised

Rust's Approach (Proactive):

  1. WebSocket close frame arrives β†’ Incoming::Closed emitted
  2. Stream ends β†’ BytesConnection::poll_next returns None
  3. EOF detected β†’ RwStreamSink::poll_read returns Ok(0)
  4. Yamux detects closure β†’ poll_next_inbound returns None β†’ ConnectionError::Closed
  5. Connection state updated β†’ No further writes attempted
  6. Window updates skipped β†’ Connection already known to be closed

Why TCP Works But WebSocket Doesn't

TCP:

  • Connection stays open until explicitly closed
  • Both sides can read/write independently
  • FIN on a stream doesn't close the TCP connection
  • Yamux can continue operating normally

WebSocket:

  • Message-based protocol with explicit close frames
  • When peer sends close frame, connection is closed
  • Python doesn't detect closure until attempting read/write
  • No way to "keep connection open for new streams" if it's already closed

Technical Details

Failure Sequence (from test logs):

1. Python creates ping stream successfully
2. Python sends ping data
3. Python tries to read ping response (stream.read(PING_LENGTH))
4. During the read, yamux tries to send window update (after reading data)
5. WebSocket write fails: "connection closed" - Peer has already closed the WebSocket
6. WebSocket read fails: "expected 2 bytes but received 0"

Key Code Locations:

  1. libp2p/transport/websocket/connection.py (lines 341-353):

    • Returns b"" when connection closes instead of raising an exception
    • This causes read_exactly() to retry indefinitely
    • Connection closure is detected reactively during operations
  2. libp2p/io/utils.py (lines 27-30):

    • Generic error message doesn't provide context about transport type
    • No indication that this is a WebSocket-specific issue
  3. libp2p/stream_muxer/yamux/yamux.py (lines 648-656):

    • Generic error logging doesn't include transport context
    • Makes debugging difficult

Proposed Solution

1. Implement Proactive Closure Detection

Goal: Detect WebSocket connection closure through stream polling before attempting operations, not during operations.

Changes Needed:

  • Monitor WebSocket connection state during stream polling
  • Treat EOF as Ok(0) (normal condition), not an exception
  • Track connection state and skip operations on closed connections
  • Detect closure during stream polling, not during read/write operations

2. Graceful EOF Handling

Current Issue: When WebSocket connection closes, connection.py:read() returns b"", causing read_exactly() to retry indefinitely.

Fix: Raise IOException with clear message when connection closes, instead of returning empty bytes.

File: libp2p/transport/websocket/connection.py

Before (lines 341-353):

except Exception as e:
    if ("CloseReason" in error_str or "ConnectionClosed" in error_type):
        self._closed = True
        return b""  # ❌ Causes read_exactly() to retry indefinitely

After:

except Exception as e:
    if ("CloseReason" in error_str or "ConnectionClosed" in error_type):
        self._closed = True
        close_code = getattr(e, 'code', None)
        close_reason = getattr(e, 'reason', None) or "Connection closed by peer"
        raise IOException(
            f"WebSocket connection closed by peer during read operation: "
            f"code={close_code}, reason={close_reason}. "
            f"This may indicate the peer closed the connection, a network issue, "
            f"or a protocol error during yamux stream operation."
        )

3. Enhanced Error Messages

File: libp2p/io/utils.py

Add transport context to IncompleteReadError messages:

raise IncompleteReadError(
    f"Connection closed during read operation: expected {n} bytes but received {len(buffer)} bytes. "
    f"This may indicate the peer closed the connection prematurely, a network issue, "
    f"or a transport-specific problem (e.g., WebSocket message boundary handling)."
    f"{context_info}"
)

4. Connection State Tracking

Add connection state monitoring to detect closure earlier:

  • Check connection state before operations
  • Handle WebSocket close frames proactively
  • Detect connection closure from read operations immediately

Code Changes Summary

File Lines Change Type Description
libp2p/io/utils.py 27-30 Error message enhancement Add transport context to IncompleteReadError
libp2p/transport/websocket/connection.py 341-353 Bug fix Raise IOException instead of returning b"" on close
libp2p/transport/websocket/connection.py 230-245 Bug fix Better close detection in read(n=None) path
libp2p/stream_muxer/yamux/yamux.py 648-656 Error logging enhancement Add transport context to error logs

Expected Improvements

Before Fix:

Connection closed: expected 2 bytes but received 0 (connection may have been reset by peer)

After Fix:

WebSocket connection closed by peer during read operation: code=1000, reason=Connection closed by peer. This may indicate the peer closed the connection, a network issue, or a protocol error during yamux stream operation. (transport: websocket, duration: 0.12s)

Benefits:

  1. Faster failure detection: Connection closure detected immediately (raises exception)
  2. No infinite retries: read_exactly() won't retry indefinitely
  3. Better debugging: Error messages include transport type and connection state
  4. Clear indication: WebSocket-specific issues are clearly identified

Testing

Unit Tests

Create test file: tests/core/transport/websocket/test_websocket_yamux_nim_compat.py

Test scenarios:

  • Connection closes during yamux header read
  • Connection closes during yamux data read
  • WebSocket message boundaries during yamux operations
  • Error message clarity and context

Interoperability Tests

Run the failing tests:

cd transport-interop
SAVE_LOGS=all WORKER_COUNT=1 npm test -- --name-filter="python-v0.4 x nim-v1.14 (ws, noise, yamux)"

References

Detailed Analysis Documents

All detailed analysis documents are available in downloads/WebSocketIssue/:

  1. PYTHON_WEBSOCKET_ROOT_CAUSE_ANALYSIS.md - Comprehensive root cause analysis
  2. RUST_WEBSOCKET_LIFECYCLE_ANALYSIS.md - Deep dive into how rust-libp2p handles WebSocket lifecycle (critical for understanding the solution)
  3. PYTHON_WEBSOCKET_FIX_PROPOSAL.md - Detailed fix proposal with code changes
  4. CODE_INTEGRATION_SUMMARY.md - Quick reference for code changes
  5. MAINTAINER_REPORT.md - Overview of all test failures
  6. README.md - Guide to the documentation package

Related Issues and PRs

Python Implementation Details

  • Commit SHA: 926fae4476965aa5f194ce46c128e3b3b7d5a55b
  • Test Implementation: transport-interop/impl/python/v0.4/ping_test.py
  • Version: v0.4

Implementation Priority

  1. High Priority: Implement proactive closure detection in libp2p/transport/websocket/connection.py
  2. High Priority: Update libp2p/io/utils.py to handle EOF gracefully
  3. Medium Priority: Improve error messages with transport context
  4. Medium Priority: Add connection state tracking
  5. Low Priority: Add test cases for WebSocket closure scenarios

Conclusion

The root cause is Python's reactive closure detection vs Rust's proactive closure detection. Python only discovers connection closure when attempting read/write operations, while Rust detects closure through stream polling before attempting operations. This causes Python to attempt window updates on already-closed connections, triggering error cascades.

The fix requires implementing proactive connection state tracking and graceful EOF handling, similar to how rust-v0.54+ handles WebSocket connections. This will allow Python to detect connection closure early and avoid problematic operations on closed connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions