Improving Relay Selection Strategy Beyond Round-Robin #1043
Replies: 6 comments 1 reply
-
Relay Selection Strategy Review & FeedbackThis review evaluates the proposed relay selection strategy improvements for the py-libp2p CircuitV2 relay system. The current PR #972 implements a round-robin load balancing approach, while Discussion #1043 proposes advancing to a performance-based selection strategy. This document provides technical analysis, identifies strengths and concerns, and offers recommendations for the path forward. Key Findings:
BackgroundOriginal Issue (#735)The original relay selection logic returned the first available relay, leading to:
Current Implementation (PR #972)Status: Open, 6 commits, +798/-4 lines
Implementation Details: # From libp2p/relay/circuit_v2/transport.py
self.relay_counter = 0 # for round robin load balancing
self._relay_counter_lock = trio.Lock() # Thread safety
async with self._relay_counter_lock:
index = self.relay_counter % len(candidate_list)
relay_peer_id = candidate_list[index]
self.relay_counter += 1Proposed Enhancement (Discussion #1043)Proposes moving beyond round-robin to performance-aware relay selection with three strategic options. Discussion Post Analysis (#1043)Strengths1. Well-Structured Problem Definition ✅
2. Thoughtful Strategy Options 📊The three proposed approaches are well-researched and industry-standard: Option 1: Least Response Time
Option 2: Least Connections
Option 3: Hybrid Approach ⭐ (Recommended in discussion)
3. Backward Compatibility Consideration ✅The discussion explicitly mentions maintaining fallback behavior when metrics are unavailable—critical for production stability. 4. Clear Communication 📢
Areas for Discussion1. Implementation Complexity vs. Immediate Value ⚖️Question: Is the performance-based approach necessary now, or should it be a future enhancement? Considerations:
Analysis: The current round-robin approach represents a 80/20 solution—it solves 80% of the problem with 20% of the complexity. The performance-based approach targets the remaining 20% but requires 80% more engineering effort. 2. Missing: Metrics Collection Strategy 🔍The discussion proposes what to measure but not how:
Recommendation: A metrics collection design document should precede implementation. 3. Missing: Performance Benchmarks 📈The discussion lacks:
Recommendation: Establish benchmarks to validate that the additional complexity yields measurable improvements. 4. Risk: Over-Engineering
|
| Aspect | Round-Robin (PR #972) | Performance-Based (Discussion #1043) |
|---|---|---|
| Complexity | Low - Simple modulo arithmetic | High - Metrics collection, scoring, weighting |
| Implementation Time | ✅ Already complete | Estimated 2-4 weeks |
| Maintenance Overhead | Low - Minimal state | Medium-High - Metrics storage, expiry, edge cases |
| Performance Benefit | Evenly distributes load | Optimizes for quality AND distribution |
| User Experience | Good - No overloaded relays | Better - Routes to best-performing relays |
| Failure Handling | Neutral - Doesn't avoid bad relays | Self-healing - Automatically avoids slow relays |
| Resource Overhead | Minimal - Just a counter | Notable - Timing, storage, computation |
| Risk | Low - Battle-tested algorithm | Medium - New system, potential bugs |
| Backward Compatibility | ✅ No breaking changes | ✅ Can maintain fallback (if designed well) |
Recommendations
Phase 1: Merge Current PR #972
-
Rationale:
- Solves the immediate problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- Well-tested, low-risk implementation
- Significant improvement over "first available" strategy
- Provides foundation for future enhancements
-
Action Items:
- ✅ Address the minor performance concern (consider caching relay categorization)
- ✅ Ensure newsfragment mentions "round-robin" explicitly (already done)
- Merge to production
Impact: Immediate value delivery with minimal risk.
Phase 2: Research & Design
-
Rationale: Performance-based selection requires careful design
-
Action Items:
- Establish performance benchmarks with current round-robin implementation
- Design metrics collection architecture:
- What metrics to collect (latency, connection success rate, active circuits)
- How to store metrics (in-memory with TTL? time-series DB?)
- How to handle cold start and missing data
- Define success criteria (e.g., "30% reduction in average connection time")
- Create detailed design document for maintainer review
- Prototype in a feature branch
Impact: De-risks the implementation, ensures buy-in from maintainers.
Phase 3: Implement Performance-Based Selection
-
Rationale: Build on proven foundation with clear design
-
Implementation Strategy:
- Start with Option 2: Least Connections (simplest performance-based approach)
- Validate improvement with benchmarks
- If successful, iterate to Option 3: Hybrid Approach
-
Feature Flag:
class RelayConfig: selection_strategy: Literal["round_robin", "least_connections", "hybrid"] = "round_robin"
- Allows gradual rollout
- Easy rollback if issues arise
- A/B testing capability
Impact: Incremental improvement with controlled risk.
Recommendation 2: Start with Least Connections
Rationale:
- Easier to implement than hybrid approach (no complex weighting)
- Provides measurable benefit over round-robin
- Stepping stone to hybrid approach
- Lower risk than jumping directly to hybrid
Implementation Sketch:
async def _select_relay_least_connections(self, peer_info: PeerInfo) -> ID | None:
"""Select relay with fewest active circuits."""
relays = self.discovery.get_relays()
if not relays:
return None
# Prioritize relays with reservations
relays_with_reservations = [
r for r in relays
if self.discovery.get_relay_info(r) and
self.discovery.get_relay_info(r).has_reservation
]
candidate_list = relays_with_reservations if relays_with_reservations else relays
# Select relay with minimum active connections
# Tie-breaking with round-robin for equal connection counts
relay_connections = {
relay_id: self._get_active_circuit_count(relay_id)
for relay_id in candidate_list
}
min_connections = min(relay_connections.values())
min_relays = [r for r, c in relay_connections.items() if c == min_connections]
# Round-robin among relays with minimum connections
async with self._relay_counter_lock:
index = self.relay_counter % len(min_relays)
selected = min_relays[index]
self.relay_counter += 1
return selectedRecommendation 3: Metrics to Track
If pursuing performance-based selection, track these metrics per relay:
@dataclass
class RelayPerformanceMetrics:
relay_id: ID
# Connection metrics
total_connection_attempts: int
successful_connections: int
failed_connections: int
# Timing metrics (exponential moving average)
avg_connection_time_ms: float # EMA of connection establishment time
# Load metrics
active_circuits: int
# Reliability metrics
last_successful_connection: float # timestamp
last_failed_connection: float # timestamp
# Computed properties
@property
def success_rate(self) -> float:
if self.total_connection_attempts == 0:
return 0.0
return self.successful_connections / self.total_connection_attempts
@property
def is_healthy(self) -> bool:
"""Consider relay healthy if success rate > 80% and recent success."""
return (
self.success_rate > 0.8 and
time.time() - self.last_successful_connection < 300 # 5 minutes
)Metrics Retention:
- In-memory storage with TTL (e.g., 1 hour)
- Exponential moving average for timing metrics (smooth out spikes)
- Periodic cleanup of stale metrics
Recommendation 4: Acceptance Criteria
Before implementing performance-based selection, define clear success criteria:
| Metric | Current (Round-Robin) | Target (Performance-Based) |
|---|---|---|
| Average connection establishment time | Baseline TBD | 20-30% reduction |
| Failed connection rate | Baseline TBD | 20% reduction |
| Load distribution variance | Low (round-robin is uniform) | Medium (optimizes for quality) |
| 99th percentile connection time | Baseline TBD | 30% reduction |
| Self-healing (avoiding failed relays) | None | Automatic within 2 attempts |
How to measure: Implement telemetry in Phase 1 (round-robin) to establish baselines.
Risk Analysis
Risks of Current PR #972
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Counter overflow after 2^64 selections | Very Low | Low (wraps around) | No action needed (Python int is arbitrary precision) |
| Lock contention under high concurrency | Low | Low (lock held briefly) | Monitor in production; optimize if needed |
| Relay list changes during selection | Low | Low (handled gracefully) | Existing retry logic handles this |
Overall Risk: LOW ✅ - Safe to merge
Risks of Performance-Based Selection
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Metrics collection overhead | Medium | Medium | Use sampling, EMA smoothing |
| Cold start problem (new relay, no metrics) | High | Medium | Default to round-robin for unmeasured relays |
| Metrics storage memory overhead | Medium | Low | Implement TTL, periodic cleanup |
| Scoring algorithm tuning difficulty | High | Medium | Feature flag, A/B testing |
| Avoiding relays that are temporarily slow | Medium | Low | Time-based metrics expiry |
| Complexity increases debugging difficulty | High | Medium | Comprehensive logging, observability |
Overall Risk: MEDIUM
Comparison with Industry Standards
The proposed strategies align well with industry practices:
Round-Robin (Current PR)
- Used by: Nginx, HAProxy (default), AWS ELB (basic)
- Pros: Simple, predictable, widely understood
- Cons: Doesn't adapt to performance differences
Least Connections
- Used by: Nginx, HAProxy, AWS ALB
- Pros: Better than round-robin for long-lived connections
- Cons: Requires connection tracking
Weighted Response Time / Hybrid
- Used by: AWS ALB (with target health), Envoy, Istio
- Pros: Production-grade, self-optimizing
- Cons: Complex, requires tuning
libp2p Context:
- go-libp2p uses a similar relay selection approach but with additional DHT-based relay discovery
- rust-libp2p implements reservation prioritization similar to this PR
- py-libp2p would be on par with other implementations with the current PR, and ahead with performance-based selection
Alternative Approaches (Not in Discussion)
For completeness, here are alternative strategies not mentioned in the discussion:
1. Weighted Random Selection
- Assign weights to relays based on reservation status
- Randomly select with probability proportional to weight
- Pro: Simple, no state needed
- Con: Less predictable than round-robin
2. Power of Two Choices
- Randomly pick 2 relays, select the one with fewer connections
- Pro: Simple, good load distribution in large pools
- Con: Requires randomness, less predictable
3. Consistent Hashing
- Hash the target peer ID to select relay
- Pro: Same target always uses same relay (caching benefits)
- Con: Uneven distribution if target distribution is skewed
Recommendation: Stick with the proposed approaches; these alternatives don't offer significant advantages for this use case.
Detailed Feedback on Discussion Post #1043
What's Excellent ✅
- Contextualizes the change: Links to existing PR, acknowledges current progress
- Problem-first thinking: Clearly articulates why round-robin isn't sufficient
- Multiple options: Presents trade-offs, not just one solution
- Forward-thinking: Considers self-healing, adaptability
- Collaborative tone: CCs maintainers, invites feedback
- Structured format: Easy to read and reference
Suggestions for Improvement 📝
-
Add "Why Now?" Section
- Is this blocking any use case?
- Are users experiencing pain with round-robin?
- What's the opportunity cost of implementing this vs. other features?
-
Include Implementation Estimate
- Development time
- Testing time
- Migration/rollout plan
-
Add Rollback Strategy
- How to detect if performance-based selection is performing worse?
- How to quickly roll back to round-robin if needed?
-
Reference Benchmarks or Studies
- Any data showing performance-based selection benefits in similar systems?
- Links to research or case studies?
-
Address Operational Concerns
- How will this be monitored in production?
- What new metrics/alerts are needed?
- Debugging complexity increase?
-
Consider Simplification
- Could "Least Connections" alone solve 90% of the problem?
- Is the hybrid approach necessary for v1?
Conclusion & Final Recommendations
For PR #972 ✅
Recommendation: APPROVE & MERGE (with minor optional improvements)
Strengths:
- ✅ Solves the stated problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- ✅ Clean, maintainable code
- ✅ Thread-safe implementation
- ✅ Excellent test coverage
- ✅ Low risk, high value
Optional improvements (non-blocking):
- Consider caching relay categorization to reduce
get_relay_info()calls - Add stress test for concurrent selections (nice-to-have)
Next Steps:
- Address any maintainer feedback
- Merge to main
- Monitor in production
For Discussion #1043 📊
Recommendation: PROCEED WITH PHASED APPROACH
Phase 1: Merge PR #972 ← Do this first
- Delivers immediate value
- Establishes foundation
Phase 2: Research & Design ← Do this next
- Establish benchmarks with current implementation
- Design metrics collection architecture
- Get maintainer buy-in on approach
Phase 3: Implement Least Connections ← Then do this
- Start with simpler performance-based approach
- Validate improvements with benchmarks
Summary Recommendation Matrix
| Recommended Approach | Rationale |
|---|---|
| Merge PR #972 immediately | Solves problem, low risk, ready now |
| Implement telemetry & benchmarks | Establishes data for future decisions |
| Consider Least Connections if data supports | Incremental improvement |
| Consider Hybrid if scale demands | Production-grade for large scale |
References
- PR Improve relay selection to prioritize active reservations #972: Improve relay selection to prioritize active reservations #972
- Discussion Improving Relay Selection Strategy Beyond Round-Robin #1043: Improving Relay Selection Strategy Beyond Round-Robin #1043
- Issue Sophisticated Relay Selection in circuitv2 transport #735: (Referenced in PR)
- libp2p Circuit Relay v2 Spec: https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md
Beta Was this translation helpful? Give feedback.
-
|
@yashksaini-coder : Great, appreciate the feedback. Very comprehensive indeed. Would recommend you to discuss with @parth-soni07 and arrive at a good conclusion on the PR, together. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you @seetadev & @yashksaini-coder for your valuable feedbacks, here is how I have planned throughout the week to work on the PR for relay selection optimisation. Metrics Implementation Plan: Connection Latency & Active CircuitsOverviewThis section outlines the implementation plan for tracking two new metrics in py-libp2p:
Table of ContentsImplementation PhasesThe implementation is divided into 4 phases, each with specific code changes that can be tracked and tested independently. Phase 1: Prometheus Metrics FoundationGoal: Set up the Prometheus metrics infrastructure for both metrics. Files to Modify:
Tasks:
Completion Criteria:
Phase 2: Connection Latency TrackingGoal: Implement connection latency tracking in Swarm. Files to Modify:
Tasks:
Testing:
Completion Criteria:
Phase 3: Active Circuits TrackingGoal: Implement active circuits tracking in CircuitV2Protocol. Files to Modify:
Tasks:
Testing:
Completion Criteria:
Phase 4: Integration & TestingGoal: Complete integration, add comprehensive tests, and verify end-to-end functionality. Files to Create/Modify:
Tasks:
Testing:
Completion Criteria:
Detailed Implementation StepsThe following sections provide detailed code examples for each component. 1. Connection Latency TrackingOverviewTrack the time taken to establish connections, including both successful and failed attempts. This helps monitor network performance and identify connection issues. Metric Details
Implementation LocationFile: Changes Required1.1 Add Helper Methodsdef _get_transport_type(self) -> str:
"""Get transport type name for metrics."""
if isinstance(self.transport, QUICTransport):
return "quic"
elif hasattr(self.transport, '__class__'):
class_name = self.transport.__class__.__name__.lower()
if "circuit" in class_name or "relay" in class_name:
return "circuit_v2"
elif "websocket" in class_name:
return "websocket"
elif "tcp" in class_name:
return "tcp"
return "unknown"
def _record_connection_latency(
self,
peer_id: str,
transport: str,
success: bool,
latency_ms: float,
error: str | None = None,
) -> None:
"""Record connection latency metric."""
# Access PrometheusExporter through ResourceManager if available
if self._resource_manager is not None and self._resource_manager.prometheus_exporter is not None:
labels = {
"transport": transport,
"success": "true" if success else "false",
}
if error:
# Extract error type from error string for additional context
# (Note: Prometheus labels should be low cardinality, so we keep it simple)
if "timeout" in error.lower():
labels["error_type"] = "timeout"
elif "refused" in error.lower():
labels["error_type"] = "connection_refused"
elif "unreachable" in error.lower():
labels["error_type"] = "unreachable"
self._resource_manager.prometheus_exporter.record_connection_latency(
transport=transport,
success=success,
latency_ms=latency_ms,
)1.2 Modify
|
Beta Was this translation helpful? Give feedback.
-
|
Hi @parth-soni07, thanks for putting together this detailed implementation plan. I can see you've thought through the metrics tracking carefully. Here are my thoughts: Overall approach: My main concern:
The Prometheus export can come later as observability, but it shouldn't block the core relay selection improvements. Specific feedback:
What I'd like to see next:
Then we can discuss how this integrates with the existing Does this make sense? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @paschal533, I have curated the plan and have implemented what exactly you wanted to see in the PR #972 Relay Selection ImplementationImplementation PlanBased on earlier feedback : Focus on in-memory metrics and basic selection logic first. Prometheus integration can come later. GoalCreate a Requirements
ChecklistCore Implementation
Testing
Enhancements (Optional - Can be done later)
Status SummaryCore Implementation: ✅ Complete (15/15) Implementation SummaryRelayPerformanceTracker ClassLocation: Key Features:
Methods Implemented:
Scoring Formula:
Integration:
|
Beta Was this translation helpful? Give feedback.
-
|
In response to the feedback shared by @paschal533 on the changes made at #972 (comment) , I have made the following changes: 1. EMA Initialization - ✅ FixedIssue: Using Solution Implemented:
Changes:
Result: Legitimate 0ms latency is now properly tracked and won't be confused with "no data yet". 2. Circuit Closure Tracking - ✅ FixedIssue: Solution Implemented:
Changes:
Result: Circuit closures are now automatically tracked for all connections created through Note: This solution is non-intrusive - it doesn't modify core 3. Export Metrics Method - ✅ Working as IntendedStatus: The Current Implementation:
No changes needed - this is working as designed per the original feedback to defer Prometheus integration. SummaryBoth critical issues (EMA initialization and circuit closure tracking) have been fixed with comprehensive test coverage. All tests pass (31 total: 28 performance tracker tests + 3 new wrapper tests), and all lint/type checks pass. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
🧩 Context
⚙️ Problem Summary
Current behavior
Relays are selected using a simple round-robin rotation.
This spreads traffic evenly but does not account for:
Impact
Round-robin can unintentionally degrade performance because all relays are treated equally even when some are significantly slower or unstable.
This affects:
🔍 Analysis
A selection strategy that is aware of relay performance metrics will naturally self-optimize:
This is a common improvement seen across distributed systems, load balancers, and peer-to-peer networks.
🚀 Proposed Solutions
Below are three strategies worth evaluating:
Option 1: Least Response Time
Track the time required to successfully open a connection through each relay.
Option 2: Least Connections
Maintain a count of active circuits or sessions per relay.
Option 3: Hybrid Approach
Combine both:
Weight relays based on:
This gives a balanced, adaptive, self-healing selection algorithm.
🧠 Recommended Approach
**🔗 Related References **
CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡
Beta Was this translation helpful? Give feedback.
All reactions