Technical Discussion: CI Windows Ring Topology Test Failure Analysis #1166
Replies: 2 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
-
Implementation Notes: Fixing CI/CD Test Failures with Explicit SynchronizationBackgroundWhile working on the test suite for py-libp2p, we encountered intermittent CI/CD failures in Problem AnalysisThe failures stemmed from a race condition between message propagation and test cleanup in distributed pubsub tests. Here's what we found: Environment Differences
Distributed Systems FundamentalsThe core issue relates to fundamental properties of distributed systems:
The FixWe added explicit synchronization barriers using a Before (race condition present)async def action_func(dummy_nodes):
await dummy_nodes[0].publish_set_crypto("aspyn", 20)
await dummy_nodes[0].publish_send_crypto("aspyn", "alex", 5)
# Cleanup starts immediately - messages may still be propagatingAfter (synchronized)async def action_func(dummy_nodes):
await dummy_nodes[0].publish_set_crypto("aspyn", 20)
await wait_for_convergence(
dummy_nodes, lambda n: n.get_balance("aspyn") == 20, timeout=10.0
)
await dummy_nodes[0].publish_send_crypto("aspyn", "alex", 5)
await wait_for_convergence(
dummy_nodes,
lambda n: n.get_balance("aspyn") == 15 and n.get_balance("alex") == 5,
timeout=10.0,
)Scalability TestingWe tested the solution across different network sizes to understand timeout requirements:
The scaling limitations are expected due to FloodSub's architecture:
Note: These numbers are hardware-dependent. The test suite uses 7 nodes maximum, which provides sufficient time for convergence even on slower CI runners. Why This Isn't an Architectural IssueIt's worth clarifying that this is not a bug in the codebase:
The issue was simply missing synchronization barriers between operations in the test code. FloodSub vs Gossipsub ContextThe test suite uses FloodSub for simplicity in small network scenarios. For context:
The choice of FloodSub for tests with 7 nodes or fewer is appropriate and intentional. ConclusionThe CI failures were caused by race conditions from insufficient synchronization between distributed operations, not architectural bugs. Adding Related |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently I've seen that the intermittent Windows CI failure in
test_set_then_send_from_diff_nodes_five_nodes_ring_topography.Problem Statement
ExceptionGroupmasking the root issue.Symptoms:
Architecture Context
Ring Topology
Test Scenario
Root Cause Analysis
1. Message Propagation Timing
trio.sleep(1.0)is borderline.Timeline Example
2. Windows Async I/O & Scheduling
3. Exception Propagation Chain
KeyErrorfor missing account state).4. Lack of State Synchronization
Current test assumes all state will propagate after N seconds. Fails if propagation is slower than the assumed time due to environment.
Solution Approaches
1. Platform-Specific Timing (Quick Fix)
Pros: Fastest to implement.
Cons: Fragile, still time-based, may not generalize.
2. Convergence Detection (State-Based; Preferred Medium-Term)
Pros: Platform-agnostic, robust, eliminates guesswork.
Cons: More implementation effort, requires tracking state.
3. Message Sequencing (Long-Term Improvement)
Pros: Allows deduplication, causal ordering, and traceability.
Cons: Protocol change, more code.
4. Enhanced Error Handling
Pros: Reduces full-service crashes, improves diagnostics.
Cons: May hide bugs if overused.
Evaluation Summary
Conclusion
The root cause for Windows CI issues is unreliable time-based state synchronization, aggravated by ring topology delays and Windows async/networking limitations.
Path forward:
Adopting state-based synchronization aligns with distributed systems best practices and is key to reliable, platform-agnostic tests.
Beta Was this translation helpful? Give feedback.
All reactions