Technical Discussion: CI Windows Ring Topology Test Failure Analysis #1166

yashksaini-coder · 2026-01-26T12:47:47Z

yashksaini-coder
Jan 26, 2026

Recently I've seen that the intermittent Windows CI failure in test_set_then_send_from_diff_nodes_five_nodes_ring_topography.

Problem Statement

The test fails with ~5-10% frequency on Windows, surfacing as a deeply nested ExceptionGroup masking the root issue.
Failure surfaces on test teardown due to unhandled async exceptions in the message pipeline.

Symptoms:

Windows-specific: No or rare failures on Linux.
Ring topology, 5 nodes: Single message path between nodes.
Timing sensitivity: Failures are race-condition triggered.
Masked root exception: Hard to debug due to nested exceptions.

Architecture Context

Ring Topology

Node 0 ──> Node 1 ──> Node 2 ──> Node 3 ──> Node 4 ──> Node 0
  ↑                                                          │
  └──────────────────────────────────────────────────────────┘

Messages traverse up to 4 hops.
No path redundancy.
Delays and failures at any node propagate (no isolation).

Test Scenario

async def action_func(dummy_nodes):
    await dummy_nodes[0].publish_set_crypto("alex", 20)
    await trio.sleep(1.0)  # Fixed wait
    await dummy_nodes[3].publish_send_crypto("alex", "rob", 12)

Race: Node 3 may act before its state is set.

Root Cause Analysis

1. Message Propagation Timing

Linux: 200–800ms for 4 hops.
Windows: 400–1400ms for the same. Our trio.sleep(1.0) is borderline.
Windows runs slower per-hop due to I/O Completion Ports, async scheduler jitter, and local networking overhead.

Timeline Example

Stage	Linux	Windows
Per hop	50-200ms	100-350ms
Total	200–800ms	400–1400ms

2. Windows Async I/O & Scheduling

Linux: Epoll gives fast, fair, predictable async.
Windows: IOCP, priority-based scheduler = higher latency, more jitter, slower networking even on loopback.

3. Exception Propagation Chain

10-level exception nesting hides the real error (e.g. KeyError for missing account state).
Top-level async nursery structure amplifies masking.

4. Lack of State Synchronization

Current test assumes all state will propagate after N seconds. Fails if propagation is slower than the assumed time due to environment.

Solution Approaches

1. Platform-Specific Timing (Quick Fix)

Pros: Fastest to implement.
Cons: Fragile, still time-based, may not generalize.

PROPAGATION_MULTIPLIER = 2.5 if sys.platform == 'win32' else 1.5
propagation_delay = BASE_PROPAGATION_DELAY * PROPAGATION_MULTIPLIER * (num_nodes - 1) / 2

Use as an emergency fix, but address root issues.

2. Convergence Detection (State-Based; Preferred Medium-Term)

Pros: Platform-agnostic, robust, eliminates guesswork.
Cons: More implementation effort, requires tracking state.

async def wait_for_convergence(nodes, timeout=10.0):
    stable_count_threshold = 5
    stable_counts = {n: 0 for n in nodes}
    last_msg_counts = {n: 0 for n in nodes}
    with trio.move_on_after(timeout) as cancel_scope:
        while True:
            all_stable = True
            for node in nodes:
                curr = len(node.balances)
                if curr == last_msg_counts[node]:
                    stable_counts[node] += 1
                else:
                    stable_counts[node] = 0
                    all_stable = False
                last_msg_counts[node] = curr
            if all_stable and all(c >= stable_count_threshold for c in stable_counts.values()):
                return
            await trio.sleep(0.2)
    if cancel_scope.cancelled_caught:
        raise TimeoutError(f"Network did not converge in {timeout}s")

3. Message Sequencing (Long-Term Improvement)

Pros: Allows deduplication, causal ordering, and traceability.
Cons: Protocol change, more code.

class DummyAccountNode(Service):
    def __init__(self, pubsub):
        self.message_sequence = 0
        self.received_sequences = set()
    async def publish_set_crypto(self, user, amount):
        self.message_sequence += 1
        seq = self.message_sequence
        msg = f"set,{seq},{user},{amount}"
        await self.pubsub.publish(CRYPTO_TOPIC, msg.encode())
        return seq

4. Enhanced Error Handling

Pros: Reduces full-service crashes, improves diagnostics.
Cons: May hide bugs if overused.

async def handle_incoming_msgs(self):
    while True:
        try:
            incoming = await self.subscription.get()
            # process...
        except Exception as e:
            logger.error(f"Error handling msg: {e}", exc_info=True)
            continue

Evaluation Summary

Approach	Reliability	Difficulty	Maint.	Short-term?	Long-term?
Platform timing	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	Yes	No
Convergence detect	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Yes	Yes
Message ordering	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	No	Yes
Error handling	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	Yes	Yes

Conclusion

The root cause for Windows CI issues is unreliable time-based state synchronization, aggravated by ring topology delays and Windows async/networking limitations.

Path forward:

Use timing multiplier temporarily.
Move all pubsub tests to state-based convergence detection.
Consider protocol sequencing and improved error handling for long-term resilience.

Adopting state-based synchronization aligns with distributed systems best practices and is key to reliable, platform-agnostic tests.

============================== warnings summary ===============================
tests\core\transport\websocket\test_proxy.py:394
tests\core\transport\websocket\test_proxy.py:394
  D:\a\py-libp2p\py-libp2p\tests\core\transport\websocket\test_proxy.py:394: PytestUnknownMarkWarning: Unknown pytest.mark.integration - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.integration

tests/core/stream_muxer/test_muxer_multistream.py::test_new_conn_passes_timeout_to_multistream_client
  D:\a\py-libp2p\py-libp2p\tests\core\stream_muxer\test_muxer_multistream.py:68: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
    muxer.multistream_client.select_one_of(
  Enable tracemalloc to get traceback where the object was allocated.
  See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================ slowest 50 durations =============================
60.01s call     tests/core/transport/quic/test_connection.py::TestQUICConnection::test_connection_connect_timeout
40.41s call     tests/core/pubsub/test_gossipsub.py::test_gossip_heartbeat[4]
39.67s call     tests/core/pubsub/test_gossipsub.py::test_gossip_heartbeat[1]
33.17s call     tests/core/pubsub/test_gossipsub.py::test_gossip_heartbeat[7]
27.91s call     tests/core/pubsub/test_gossipsub.py::test_mesh_heartbeat[13]
25.19s call     tests/core/pubsub/test_gossipsub.py::test_fanout_maintenance
23.86s call     tests/core/pubsub/test_gossipsub.py::test_fanout
23.21s call     tests/core/pubsub/test_gossipsub.py::test_mesh_heartbeat[10]
22.16s call     tests/core/pubsub/test_gossipsub_px_and_backoff.py::test_stress_churn
18.40s call     tests/core/pubsub/test_gossipsub.py::test_sparse_connect
17.06s call     tests/core/pubsub/test_gossipsub_v1_1_network_scenarios.py::test_large_scale_fanout
14.99s call     tests/core/pubsub/test_gossipsub.py::test_dense
14.72s call     tests/core/pubsub/test_gossipsub.py::test_mesh_heartbeat[7]
14.59s call     tests/core/pubsub/test_gossipsub.py::test_sparse_connect_degree_zero
13.83s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_three_topics]
12.99s call     tests/core/pubsub/test_gossipsub_v2_0.py::TestMeshMaintenance::test_peer_replacement_consideration
12.68s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_three_topics]
12.57s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_three_topics_diff_origin]
12.26s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_one_topics]
11.95s call     tests/core/pubsub/test_gossipsub_v1_1_network_scenarios.py::test_mesh_stability
11.90s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_one_topics]
11.84s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[seven_nodes_tree_three_topics_diff_origin]
11.56s call     tests/core/pubsub/test_gossipsub_v1_1_flood_publishing.py::test_flood_publish_reliability
11.48s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[five_nodes_ring_two_topic_diff_origin_many_msgs]
11.45s call     tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_root_seven_nodes_tree_topography
11.31s call     tests/core/identity/identify_push/test_identify_push.py::test_all_peers_receive_identify_push_with_semaphore_under_high_peer_load
10.98s call     tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_five_diff_nodes_five_nodes_ring_topography
10.98s call     tests/core/pubsub/test_gossipsub.py::test_connect_some_degree_limit_enforced
10.77s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[five_nodes_ring_two_topic_diff_origin_many_msgs]
10.68s call     tests/core/relay/test_circuit_v2_protocol.py::test_circuit_v2_reservation_limit
10.46s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[four_nodes_clique_two_topic_diff_origin_many_msgs]
10.44s call     tests/core/pubsub/test_gossipsub_px_and_backoff.py::test_unsubscribe_backoff
10.21s call     tests/core/pubsub/test_gossipsub_v1_1_flood_publishing.py::test_flood_publish_with_disconnected_peers
10.18s call     tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_diff_nodes_five_nodes_ring_topography
10.01s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_explicit_muxer_options[create_yamux_muxer_option-YamuxStream]
10.01s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_global_default_muxer[MPLEX]
10.00s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_multiplexer_preference_parameter[YAMUX]
10.00s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_global_default_muxer[YAMUX]
10.00s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_multiplexer_preference_parameter[MPLEX]
10.00s call     tests/core/stream_muxer/test_multiplexer_selection.py::test_explicit_muxer_options[create_mplex_muxer_option-MplexStream]
9.79s call     tests/core/pubsub/test_gossipsub_v2_0.py::TestAdaptiveGossip::test_network_health_calculation
9.42s call     tests/core/pubsub/test_gossipsub_v1_1_core_functionality.py::test_message_propagation_normal_mesh
9.28s call     tests/core/pubsub/test_dummyaccount_demo.py::test_simple_seven_nodes_tree_topography
9.04s call     tests/core/pubsub/test_pubsub.py::test_peer_subscribe_fail_upon_invald_record_transfer
9.03s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[three_nodes_two_topics]
8.91s call     tests/core/pubsub/test_floodsub.py::test_gossipsub_run_with_floodsub_tests[four_nodes_clique_two_topic_diff_origin_many_msgs]
8.89s call     tests/core/pubsub/test_validation_enhancements.py::TestEnhancedValidation::test_validation_timeout
8.79s call     tests/core/pubsub/test_gossipsub_v1_1_network_scenarios.py::test_simulated_partition
8.77s call     tests/core/pubsub/test_gossipsub_v1_1_invalid_behavior.py::test_excessive_ihave_iwant_spam_penalized
8.73s call     tests/core/pubsub/test_gossipsub_backward_compatibility.py::test_gossipsub_run_with_floodsub_tests[three_nodes_clique_two_topic_diff_origin]
=========================== short test summary info ===========================
FAILED tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_diff_nodes_five_nodes_ring_topography
===== 1 failed, 1824 passed, 12 skipped, 3 warnings in 1082.91s (0:18:02) =====
py3.12-core: exit 1 (1085.83 seconds) D:\a\py-libp2p\py-libp2p> pytest -n auto --timeout=1200 tests/core pid=5500
  py3.12-core: FAIL code 1 (1140.72=setup[34.31]+cmd[0.19,20.39,1085.83] seconds)
  evaluation failed :( (1146.08 seconds)
Error: Process completed with exit code 1.

yashksaini-coder · 2026-01-28T06:56:13Z

yashksaini-coder
Jan 28, 2026
Author

@IronJam11 @sumanjeet0012 @seetadev

Couple of more PRs are facing this same ci check failure:-

1 reply

yashksaini-coder Jan 29, 2026
Author

A PR is already in work to resolve such ci failures, #1168 @seetadev

IronJam11 · 2026-01-29T19:41:27Z

IronJam11
Jan 29, 2026

Implementation Notes: Fixing CI/CD Test Failures with Explicit Synchronization

Background

While working on the test suite for py-libp2p, we encountered intermittent CI/CD failures in test_set_then_send_from_root_seven_nodes_tree_topography. The test would pass reliably on local development machines but fail consistently in the CI pipeline with nested ExceptionGroup errors during cleanup. After investigation, we identified the root cause and implemented a fix.

Problem Analysis

The failures stemmed from a race condition between message propagation and test cleanup in distributed pubsub tests. Here's what we found:

Environment Differences

Local machines: Fast CPUs and low latency, so cleanup typically completes before in-flight messages arrive
CI/CD runners: Shared resources, CPU throttling, and higher network latency, meaning messages are still propagating when cleanup begins

Distributed Systems Fundamentals

The core issue relates to fundamental properties of distributed systems:

Message propagation takes time - it's not instantaneous
There's no global transaction or atomic operation across the network
Without explicit synchronization, you only get eventual consistency
A race condition window exists where cleanup can interrupt message handlers mid-execution

The Fix

We added explicit synchronization barriers using a wait_for_convergence helper function between state-changing operations. This ensures all nodes reach a consistent state before proceeding.

Before (race condition present)

async def action_func(dummy_nodes):
    await dummy_nodes[0].publish_set_crypto("aspyn", 20)
    await dummy_nodes[0].publish_send_crypto("aspyn", "alex", 5)
    # Cleanup starts immediately - messages may still be propagating

After (synchronized)

async def action_func(dummy_nodes):
    await dummy_nodes[0].publish_set_crypto("aspyn", 20)
    await wait_for_convergence(
        dummy_nodes, lambda n: n.get_balance("aspyn") == 20, timeout=10.0
    )
    await dummy_nodes[0].publish_send_crypto("aspyn", "alex", 5)
    await wait_for_convergence(
        dummy_nodes,
        lambda n: n.get_balance("aspyn") == 15 and n.get_balance("alex") == 5,
        timeout=10.0,
    )

Scalability Testing

We tested the solution across different network sizes to understand timeout requirements:

Nodes	Timeout	Result
7	10s	Pass
20	40s	Pass
40	80s	Pass
42	84s	Fail
50	100s	Fail

The scaling limitations are expected due to FloodSub's architecture:

FloodSub broadcasts to all peers, resulting in O(n²) message complexity
In mesh topologies, messages traverse multiple hops to reach all nodes
Propagation delay increases with network diameter and congestion

Note: These numbers are hardware-dependent. The test suite uses 7 nodes maximum, which provides sufficient time for convergence even on slower CI runners.

Why This Isn't an Architectural Issue

It's worth clarifying that this is not a bug in the codebase:

Messages do propagate eventually without loss or deadlock
The system achieves convergence given sufficient time
wait_for_convergence is a standard pattern for distributed system testing
FloodSub works exactly as designed - it broadcasts to all peers

The issue was simply missing synchronization barriers between operations in the test code.

FloodSub vs Gossipsub Context

The test suite uses FloodSub for simplicity in small network scenarios. For context:

FloodSub: Simple broadcast-to-all protocol, suitable for small networks
Gossipsub: Mesh-based gossip protocol, designed for large-scale production networks (hundreds to thousands of nodes)

The choice of FloodSub for tests with 7 nodes or fewer is appropriate and intentional.

Conclusion

The CI failures were caused by race conditions from insufficient synchronization between distributed operations, not architectural bugs. Adding wait_for_convergence barriers provides the explicit synchronization required for deterministic testing in distributed systems. The solution now passes all tests consistently across different Python versions and CI environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical Discussion: CI Windows Ring Topology Test Failure Analysis #1166

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Technical Discussion: CI Windows Ring Topology Test Failure Analysis #1166

Uh oh!

yashksaini-coder Jan 26, 2026

Problem Statement

Architecture Context

Ring Topology

Test Scenario

Root Cause Analysis

1. Message Propagation Timing

Timeline Example

2. Windows Async I/O & Scheduling

3. Exception Propagation Chain

4. Lack of State Synchronization

Solution Approaches

1. Platform-Specific Timing (Quick Fix)

2. Convergence Detection (State-Based; Preferred Medium-Term)

3. Message Sequencing (Long-Term Improvement)

4. Enhanced Error Handling

Evaluation Summary

Conclusion

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

yashksaini-coder Jan 28, 2026 Author

Couple of more PRs are facing this same ci check failure:-

Uh oh!

yashksaini-coder Jan 29, 2026 Author

A PR is already in work to resolve such ci failures, #1168 @seetadev

Uh oh!

Uh oh!

IronJam11 Jan 29, 2026

Implementation Notes: Fixing CI/CD Test Failures with Explicit Synchronization

Background

Problem Analysis

Environment Differences

Distributed Systems Fundamentals

The Fix

Before (race condition present)

After (synchronized)

Scalability Testing

Why This Isn't an Architectural Issue

FloodSub vs Gossipsub Context

Conclusion

Related

yashksaini-coder
Jan 26, 2026

Replies: 2 comments 1 reply

yashksaini-coder
Jan 28, 2026
Author

yashksaini-coder Jan 29, 2026
Author

IronJam11
Jan 29, 2026