Improving Relay Selection Strategy Beyond Round-Robin #1043

parth-soni07 · 2025-11-14T12:45:40Z

parth-soni07
Nov 14, 2025

🧩 Context

Recent work on relay selection improvements has introduced a round-robin strategy to replace the earlier “first available relay” method. This change represents a meaningful step toward more predictable and evenly distributed relay usage.
Given that this component is already undergoing revision, it may be an opportune moment to consider a more advanced approach. Specifically, a performance-aware relay selection mechanism could offer significant benefits in terms of user experience, connection reliability, and system adaptability under varying network conditions.
Such a strategy would enable the system to better account for differences in relay performance, helping it dynamically adjust to factors such as latency, load, and relay availability.

⚙️ Problem Summary

Current behavior

Relays are selected using a simple round-robin rotation.
This spreads traffic evenly but does not account for:
- Network latency
- Relay capacity
- Overloaded or unreachable relays
- Geographic distance

Impact

Round-robin can unintentionally degrade performance because all relays are treated equally even when some are significantly slower or unstable.

This affects:

Connection setup time
Circuit reliability
User experience during congestion
Automatic adaptation when a relay becomes unhealthy

🔍 Analysis

A selection strategy that is aware of relay performance metrics will naturally self-optimize:

Fast relays get selected more often.
Slow or struggling relays get deprioritized without manual intervention.
Reserved relays can still be favored when their performance is comparable.

This is a common improvement seen across distributed systems, load balancers, and peer-to-peer networks.

🚀 Proposed Solutions

Below are three strategies worth evaluating:

Option 1: Least Response Time

Track the time required to successfully open a connection through each relay.

Quick feedback loop
Directly reflects network quality
Ideal for latency-sensitive traffic

Option 2: Least Connections

Maintain a count of active circuits or sessions per relay.

Helps with load distribution
Prevents overloading any relay
Similar to classic load-balancer strategies

Option 3: Hybrid Approach

Combine both:

Weight relays based on:
- Response time
- Active connection count
- Reservation priority (reserved relays get bonus weight)

This gives a balanced, adaptive, self-healing selection algorithm.

🧠 Recommended Approach

Evaluate implementing a hybrid weighted selection that considers performance metrics.
Maintain backward compatibility with simple fallback behavior if metrics are unavailable.
Keep reserved relays prioritized, unless their performance is significantly worse.

🔗 Related References

Current PR Improve relay selection to prioritize active reservations #972

CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡

yashksaini-coder · 2025-11-15T06:20:16Z

yashksaini-coder
Nov 15, 2025

Relay Selection Strategy Review & Feedback

This review evaluates the proposed relay selection strategy improvements for the py-libp2p CircuitV2 relay system. The current PR #972 implements a round-robin load balancing approach, while Discussion #1043 proposes advancing to a performance-based selection strategy. This document provides technical analysis, identifies strengths and concerns, and offers recommendations for the path forward.

Key Findings:

✅ Current round-robin implementation is solid and addresses the immediate issue
⚠️ Performance-based selection offers significant advantages but adds complexity
📊 Recommend phased approach: merge current work, then iterate on performance-based selection

Background

Original Issue (#735)

The original relay selection logic returned the first available relay, leading to:

Uneven load distribution across relays
Potential overloading of specific relays
No consideration for relay reservations

Current Implementation (PR #972)

Status: Open, 6 commits, +798/-4 lines
Key Changes:

Implemented round-robin selection with relay counter
Added thread-safety using trio.Lock() for concurrent access protection
Prioritizes relays with active reservations
Comprehensive test coverage (3 new test cases)

Implementation Details:

# From libp2p/relay/circuit_v2/transport.py
self.relay_counter = 0  # for round robin load balancing
self._relay_counter_lock = trio.Lock()  # Thread safety

async with self._relay_counter_lock:
    index = self.relay_counter % len(candidate_list)
    relay_peer_id = candidate_list[index]
    self.relay_counter += 1

Proposed Enhancement (Discussion #1043)

Proposes moving beyond round-robin to performance-aware relay selection with three strategic options.

Discussion Post Analysis (#1043)

Strengths

1. Well-Structured Problem Definition ✅

Clear articulation of current limitations
Identifies specific gaps (latency, capacity, geographic distance)
Quantifiable impact areas (connection time, reliability, UX)

2. Thoughtful Strategy Options 📊

The three proposed approaches are well-researched and industry-standard:

Option 1: Least Response Time

Pros: Direct reflection of network quality, quick feedback loop, latency-sensitive
Cons: Requires timing infrastructure, vulnerable to temporary spikes
Use Case: Best for real-time applications

Option 2: Least Connections

Pros: Simple to implement, classic load balancer strategy, prevents overloading
Cons: Doesn't account for connection quality or relay capacity variance
Use Case: Best for basic load distribution

Option 3: Hybrid Approach ⭐ (Recommended in discussion)

Pros: Balanced, adaptive, self-healing, maintains reservation priority
Cons: Most complex, requires tuning weights
Use Case: Production-grade distributed systems

3. Backward Compatibility Consideration ✅

The discussion explicitly mentions maintaining fallback behavior when metrics are unavailable—critical for production stability.

4. Clear Communication 📢

Uses visual markers (emojis) for section organization
References related PR for context
Tags relevant maintainers for feedback

Areas for Discussion

1. Implementation Complexity vs. Immediate Value ⚖️

Question: Is the performance-based approach necessary now, or should it be a future enhancement?

Considerations:

Round-robin already solves the immediate load distribution problem
Performance-based selection adds significant complexity:
- Metrics collection infrastructure
- Storage and expiry of performance data
- Weighted scoring algorithms
- Edge case handling (all relays performing poorly, no metrics available)

Analysis: The current round-robin approach represents a 80/20 solution—it solves 80% of the problem with 20% of the complexity. The performance-based approach targets the remaining 20% but requires 80% more engineering effort.

2. Missing: Metrics Collection Strategy 🔍

The discussion proposes what to measure but not how:

How will response times be tracked? (per connection attempt? moving average?)
Where will metrics be stored? (in-memory? persistent?)
What's the metrics retention/expiry policy?
How to handle cold start (new relay with no metrics)?
What happens when all relays have poor metrics?

Recommendation: A metrics collection design document should precede implementation.

3. Missing: Performance Benchmarks 📈

The discussion lacks:

Current baseline performance metrics
Expected performance improvements with each strategy
Resource overhead of metrics collection

Recommendation: Establish benchmarks to validate that the additional complexity yields measurable improvements.

4. Risk: Over-Engineering ⚠️

The hybrid approach, while sophisticated, may be over-engineering for the current use case:

Is the user base large enough to benefit from this optimization?
Are there real-world scenarios where round-robin fails significantly?
What's the cost/benefit ratio of implementation vs. maintenance?

PR #972 Technical Review

Code Quality Assessment

✅ Strengths:

Thread-Safety Implementation

async with self._relay_counter_lock:
    index = self.relay_counter % len(candidate_list)
    relay_peer_id = candidate_list[index]
    self.relay_counter += 1

Properly addresses concurrent access concerns
Uses appropriate trio.Lock() for async context

Clear Logic Flow

# Prioritize relays with active reservations
relays_with_reservations = []
other_relays = []

for relay_id in relays:
    relay_info = self.discovery.get_relay_info(relay_id)
    if relay_info and relay_info.has_reservation:
        relays_with_reservations.append(relay_id)
    else:
        other_relays.append(relay_id)

candidate_list = (
    relays_with_reservations if relays_with_reservations else other_relays
)

Readable separation of concerns
Reservation prioritization is explicit

Comprehensive Test Coverage
- test_circuit_v2_transport_relay_selection_round_robin_no_reservation(): Tests basic round-robin cycling
- test_circuit_v2_transport_relay_selection_prioritizes_reservations(): Validates reservation priority
- test_circuit_v2_transport_relay_selection_multiple_reservations(): Tests round-robin among reserved relays

⚠️ Potential Improvements:

Performance Concern: Repeated get_relay_info() Calls
```
for relay_id in relays:
    relay_info = self.discovery.get_relay_info(relay_id)  # Called in loop
```
- In each _select_relay() call, this iterates through all relays
- If there are many relays (10+), this could be expensive
- Suggestion: Consider caching the categorization of relays or maintaining a separate list of relays with reservations
Counter Increment Placement
The counter increments inside the lock, which is correct. However, consider whether you want to increment on:
- Every selection attempt (current)
- Only successful returns (alternative)
Current implementation is fine for uniform distribution.
Newsfragment Clarity (Already addressed in commit afc287e)
The newsfragment should explicitly mention "round-robin" for changelog clarity. ✅ This was already fixed.

Test Coverage Analysis

The test suite is excellent and demonstrates thorough thinking:

# Test 1: Basic round-robin (no reservations)
selected1 -> relay1
selected2 -> relay2  
selected3 -> relay3
selected4 -> relay1  # Cycles back

# Test 2: Prioritizes reservations
relay2.has_reservation = True
selected1 -> relay2  # Always picks relay2
selected2 -> relay2

# Test 3: Round-robin among reserved relays
relay1.has_reservation = True
relay2.has_reservation = True
selected1 -> relay1  # Round-robin only among reserved
selected2 -> relay2
selected3 -> relay1

Edge cases covered:

No reservations: Pure round-robin
Single reservation: Always selected
Multiple reservations: Round-robin among reserved only

Missing test cases (nice-to-have, not blockers):

Concurrent relay selection from multiple coroutines (stress test for lock)
Relay list changes during selection (add/remove relay)
Counter overflow behavior (after 2^64 selections)

Comparative Analysis: Round-Robin vs. Performance-Based

Aspect	Round-Robin (PR #972)	Performance-Based (Discussion #1043)
Complexity	Low - Simple modulo arithmetic	High - Metrics collection, scoring, weighting
Implementation Time	✅ Already complete	Estimated 2-4 weeks
Maintenance Overhead	Low - Minimal state	Medium-High - Metrics storage, expiry, edge cases
Performance Benefit	Evenly distributes load	Optimizes for quality AND distribution
User Experience	Good - No overloaded relays	Better - Routes to best-performing relays
Failure Handling	Neutral - Doesn't avoid bad relays	Self-healing - Automatically avoids slow relays
Resource Overhead	Minimal - Just a counter	Notable - Timing, storage, computation
Risk	Low - Battle-tested algorithm	Medium - New system, potential bugs
Backward Compatibility	✅ No breaking changes	✅ Can maintain fallback (if designed well)

Recommendations

Phase 1: Merge Current PR #972

Rationale:
- Solves the immediate problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
- Well-tested, low-risk implementation
- Significant improvement over "first available" strategy
- Provides foundation for future enhancements
Action Items:
1. ✅ Address the minor performance concern (consider caching relay categorization)
2. ✅ Ensure newsfragment mentions "round-robin" explicitly (already done)
3. Merge to production

Impact: Immediate value delivery with minimal risk.

Phase 2: Research & Design

Rationale: Performance-based selection requires careful design
Action Items:
1. Establish performance benchmarks with current round-robin implementation
2. Design metrics collection architecture:
  - What metrics to collect (latency, connection success rate, active circuits)
  - How to store metrics (in-memory with TTL? time-series DB?)
  - How to handle cold start and missing data
3. Define success criteria (e.g., "30% reduction in average connection time")
4. Create detailed design document for maintainer review
5. Prototype in a feature branch

Impact: De-risks the implementation, ensures buy-in from maintainers.

Phase 3: Implement Performance-Based Selection

Rationale: Build on proven foundation with clear design
Implementation Strategy:
- Start with Option 2: Least Connections (simplest performance-based approach)
- Validate improvement with benchmarks
- If successful, iterate to Option 3: Hybrid Approach

Feature Flag:

class RelayConfig:
    selection_strategy: Literal["round_robin", "least_connections", "hybrid"] = "round_robin"

Allows gradual rollout
Easy rollback if issues arise
A/B testing capability

Impact: Incremental improvement with controlled risk.

Recommendation 2: Start with Least Connections

Rationale:

Easier to implement than hybrid approach (no complex weighting)
Provides measurable benefit over round-robin
Stepping stone to hybrid approach
Lower risk than jumping directly to hybrid

Implementation Sketch:

async def _select_relay_least_connections(self, peer_info: PeerInfo) -> ID | None:
    """Select relay with fewest active circuits."""
    relays = self.discovery.get_relays()
    if not relays:
        return None
    
    # Prioritize relays with reservations
    relays_with_reservations = [
        r for r in relays 
        if self.discovery.get_relay_info(r) and 
           self.discovery.get_relay_info(r).has_reservation
    ]
    
    candidate_list = relays_with_reservations if relays_with_reservations else relays
    
    # Select relay with minimum active connections
    # Tie-breaking with round-robin for equal connection counts
    relay_connections = {
        relay_id: self._get_active_circuit_count(relay_id)
        for relay_id in candidate_list
    }
    
    min_connections = min(relay_connections.values())
    min_relays = [r for r, c in relay_connections.items() if c == min_connections]
    
    # Round-robin among relays with minimum connections
    async with self._relay_counter_lock:
        index = self.relay_counter % len(min_relays)
        selected = min_relays[index]
        self.relay_counter += 1
    
    return selected

Recommendation 3: Metrics to Track

If pursuing performance-based selection, track these metrics per relay:

@dataclass
class RelayPerformanceMetrics:
    relay_id: ID
    
    # Connection metrics
    total_connection_attempts: int
    successful_connections: int
    failed_connections: int
    
    # Timing metrics (exponential moving average)
    avg_connection_time_ms: float  # EMA of connection establishment time
    
    # Load metrics
    active_circuits: int
    
    # Reliability metrics
    last_successful_connection: float  # timestamp
    last_failed_connection: float  # timestamp
    
    # Computed properties
    @property
    def success_rate(self) -> float:
        if self.total_connection_attempts == 0:
            return 0.0
        return self.successful_connections / self.total_connection_attempts
    
    @property
    def is_healthy(self) -> bool:
        """Consider relay healthy if success rate > 80% and recent success."""
        return (
            self.success_rate > 0.8 and
            time.time() - self.last_successful_connection < 300  # 5 minutes
        )

Metrics Retention:

In-memory storage with TTL (e.g., 1 hour)
Exponential moving average for timing metrics (smooth out spikes)
Periodic cleanup of stale metrics

Recommendation 4: Acceptance Criteria

Before implementing performance-based selection, define clear success criteria:

Metric	Current (Round-Robin)	Target (Performance-Based)
Average connection establishment time	Baseline TBD	20-30% reduction
Failed connection rate	Baseline TBD	20% reduction
Load distribution variance	Low (round-robin is uniform)	Medium (optimizes for quality)
99th percentile connection time	Baseline TBD	30% reduction
Self-healing (avoiding failed relays)	None	Automatic within 2 attempts

How to measure: Implement telemetry in Phase 1 (round-robin) to establish baselines.

Risk Analysis

Risks of Current PR #972

Risk	Likelihood	Impact	Mitigation
Counter overflow after 2^64 selections	Very Low	Low (wraps around)	No action needed (Python int is arbitrary precision)
Lock contention under high concurrency	Low	Low (lock held briefly)	Monitor in production; optimize if needed
Relay list changes during selection	Low	Low (handled gracefully)	Existing retry logic handles this

Overall Risk: LOW ✅ - Safe to merge

Risks of Performance-Based Selection

Risk	Likelihood	Impact	Mitigation
Metrics collection overhead	Medium	Medium	Use sampling, EMA smoothing
Cold start problem (new relay, no metrics)	High	Medium	Default to round-robin for unmeasured relays
Metrics storage memory overhead	Medium	Low	Implement TTL, periodic cleanup
Scoring algorithm tuning difficulty	High	Medium	Feature flag, A/B testing
Avoiding relays that are temporarily slow	Medium	Low	Time-based metrics expiry
Complexity increases debugging difficulty	High	Medium	Comprehensive logging, observability

Overall Risk: MEDIUM ⚠️ - Requires careful implementation and monitoring

Comparison with Industry Standards

The proposed strategies align well with industry practices:

Round-Robin (Current PR)

Used by: Nginx, HAProxy (default), AWS ELB (basic)
Pros: Simple, predictable, widely understood
Cons: Doesn't adapt to performance differences

Least Connections

Used by: Nginx, HAProxy, AWS ALB
Pros: Better than round-robin for long-lived connections
Cons: Requires connection tracking

Weighted Response Time / Hybrid

Used by: AWS ALB (with target health), Envoy, Istio
Pros: Production-grade, self-optimizing
Cons: Complex, requires tuning

libp2p Context:

go-libp2p uses a similar relay selection approach but with additional DHT-based relay discovery
rust-libp2p implements reservation prioritization similar to this PR
py-libp2p would be on par with other implementations with the current PR, and ahead with performance-based selection

Alternative Approaches (Not in Discussion)

For completeness, here are alternative strategies not mentioned in the discussion:

1. Weighted Random Selection

Assign weights to relays based on reservation status
Randomly select with probability proportional to weight
Pro: Simple, no state needed
Con: Less predictable than round-robin

2. Power of Two Choices

Randomly pick 2 relays, select the one with fewer connections
Pro: Simple, good load distribution in large pools
Con: Requires randomness, less predictable

3. Consistent Hashing

Hash the target peer ID to select relay
Pro: Same target always uses same relay (caching benefits)
Con: Uneven distribution if target distribution is skewed

Recommendation: Stick with the proposed approaches; these alternatives don't offer significant advantages for this use case.

Detailed Feedback on Discussion Post #1043

What's Excellent ✅

Contextualizes the change: Links to existing PR, acknowledges current progress
Problem-first thinking: Clearly articulates why round-robin isn't sufficient
Multiple options: Presents trade-offs, not just one solution
Forward-thinking: Considers self-healing, adaptability
Collaborative tone: CCs maintainers, invites feedback
Structured format: Easy to read and reference

Suggestions for Improvement 📝

Add "Why Now?" Section
- Is this blocking any use case?
- Are users experiencing pain with round-robin?
- What's the opportunity cost of implementing this vs. other features?
Include Implementation Estimate
- Development time
- Testing time
- Migration/rollout plan
Add Rollback Strategy
- How to detect if performance-based selection is performing worse?
- How to quickly roll back to round-robin if needed?
Reference Benchmarks or Studies
- Any data showing performance-based selection benefits in similar systems?
- Links to research or case studies?
Address Operational Concerns
- How will this be monitored in production?
- What new metrics/alerts are needed?
- Debugging complexity increase?
Consider Simplification
- Could "Least Connections" alone solve 90% of the problem?
- Is the hybrid approach necessary for v1?

Conclusion & Final Recommendations

For PR #972 ✅

Recommendation: APPROVE & MERGE (with minor optional improvements)

Strengths:

✅ Solves the stated problem (issue Sophisticated Relay Selection in circuitv2 transport #735)
✅ Clean, maintainable code
✅ Thread-safe implementation
✅ Excellent test coverage
✅ Low risk, high value

Optional improvements (non-blocking):

Consider caching relay categorization to reduce get_relay_info() calls
Add stress test for concurrent selections (nice-to-have)

Next Steps:

Address any maintainer feedback
Merge to main
Monitor in production

For Discussion #1043 📊

Recommendation: PROCEED WITH PHASED APPROACH

Phase 1: Merge PR #972 ← Do this first

Delivers immediate value
Establishes foundation

Phase 2: Research & Design ← Do this next

Establish benchmarks with current implementation
Design metrics collection architecture
Get maintainer buy-in on approach

Phase 3: Implement Least Connections ← Then do this

Start with simpler performance-based approach
Validate improvements with benchmarks

Summary Recommendation Matrix

Recommended Approach	Rationale
Merge PR #972 immediately	Solves problem, low risk, ready now
Implement telemetry & benchmarks	Establishes data for future decisions
Consider Least Connections if data supports	Incremental improvement
Consider Hybrid if scale demands	Production-grade for large scale

References

PR Improve relay selection to prioritize active reservations #972: Improve relay selection to prioritize active reservations #972
Discussion Improving Relay Selection Strategy Beyond Round-Robin #1043: Improving Relay Selection Strategy Beyond Round-Robin #1043
Issue Sophisticated Relay Selection in circuitv2 transport #735: (Referenced in PR)
libp2p Circuit Relay v2 Spec: https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md

@parth-soni07 @seetadev

0 replies

seetadev · 2025-11-16T08:16:44Z

seetadev
Nov 16, 2025
Maintainer

@yashksaini-coder : Great, appreciate the feedback. Very comprehensive indeed.

Would recommend you to discuss with @parth-soni07 and arrive at a good conclusion on the PR, together.

1 reply

yashksaini-coder Nov 16, 2025

Understood sir, as of now the PR is good for us. I have already reviewed waiting for Pacrob too since he requested changes

parth-soni07 · 2025-11-23T16:55:38Z

parth-soni07
Nov 23, 2025
Author

Thank you @seetadev & @yashksaini-coder for your valuable feedbacks, here is how I have planned throughout the week to work on the PR for relay selection optimisation.

Metrics Implementation Plan: Connection Latency & Active Circuits

Overview

This section outlines the implementation plan for tracking two new metrics in py-libp2p:

Connection Latency (connection_latency_ms) - Measures time taken during initial connection attempts
Active Circuits (active_circuits) - Tracks the number of currently active circuit relay connections

Implementation Phases

The implementation is divided into 4 phases, each with specific code changes that can be tracked and tested independently.

Phase 1: Prometheus Metrics Foundation

Goal: Set up the Prometheus metrics infrastructure for both metrics.

Files to Modify:

libp2p/rcmgr/prometheus_exporter.py
libp2p/rcmgr/monitoring.py

Tasks:

✅ Add connection_latency Histogram metric to PrometheusExporter._init_metrics()
✅ Add active_circuits Gauge metric to PrometheusExporter._init_metrics()
✅ Add record_connection_latency() method to PrometheusExporter
✅ Add set_active_circuits() method to PrometheusExporter
✅ Update Monitor._export_to_prometheus() to map new metrics

Completion Criteria:

PrometheusExporter has both new metrics defined
Both recording methods work correctly
Monitor can export metrics to Prometheus

Phase 2: Connection Latency Tracking

Goal: Implement connection latency tracking in Swarm.

Files to Modify:

libp2p/network/swarm.py

Tasks:

✅ Add _get_transport_type() helper method
✅ Add _record_connection_latency() method
✅ Modify dial_peer() to track start time and record metrics
✅ Handle both success and failure cases

Testing:

Unit tests for helper methods
Integration tests for latency tracking in connection attempts

Completion Criteria:

Connection latency is tracked for all connection attempts
Metrics are recorded to PrometheusExporter
Both successful and failed connections are tracked

Phase 3: Active Circuits Tracking

Goal: Implement active circuits tracking in CircuitV2Protocol.

Files to Modify:

libp2p/relay/circuit_v2/protocol.py
libp2p/relay/circuit_v2/transport.py (or where protocol is instantiated)

Tasks:

✅ Add monitor parameter to CircuitV2Protocol.__init__()
✅ Add set_monitor() method to CircuitV2Protocol
✅ Add _update_active_circuits_metric() method
✅ Update _handle_connect() to call metric update when circuits are added
✅ Update _handle_connect() to call metric update when circuits fail
✅ Update _relay_data() cleanup to call metric update when circuits close
✅ Update run() cleanup to call metric update
✅ Wire monitor to CircuitV2Protocol during initialization

Testing:

Unit tests for metric updates
Integration tests for circuit lifecycle tracking

Completion Criteria:

Active circuits count is tracked accurately
Metrics update when circuits are created/destroyed
Monitor receives all circuit lifecycle events

Phase 4: Integration & Testing

Goal: Complete integration, add comprehensive tests, and verify end-to-end functionality.

Files to Create/Modify:

tests/core/network/test_swarm_metrics.py (new)
tests/core/relay/test_circuit_v2_metrics.py (new)
Update existing test files as needed

Tasks:

✅ Create test file for connection latency metrics
✅ Create test file for active circuits metrics
✅ Add integration tests for both metrics
✅ Verify Prometheus export works end-to-end
✅ Test with actual Prometheus server (optional)
✅ Update documentation/examples if needed

Testing:

Comprehensive unit tests
Integration tests
End-to-end tests with Prometheus

Completion Criteria:

All tests pass
Metrics are correctly exported to Prometheus
Documentation is complete
Code follows codebase patterns and style

Detailed Implementation Steps

The following sections provide detailed code examples for each component.

1. Connection Latency Tracking

Overview

Track the time taken to establish connections, including both successful and failed attempts. This helps monitor network performance and identify connection issues.

Metric Details

Name: libp2p_connection_latency_ms
Type: HISTOGRAM (to capture distribution of latencies)
Unit: Milliseconds
Labels:
- transport: Transport type (tcp, quic, circuit_v2, websocket, etc.)
- success: Whether connection succeeded (true/false)
- peer_id: Truncated peer ID (first 16 chars for privacy)
- error_type: Type of error (only for failed connections)

Implementation Location

File: libp2p/network/swarm.py

Changes Required

1.1 Add Helper Methods

def _get_transport_type(self) -> str:
    """Get transport type name for metrics."""
    if isinstance(self.transport, QUICTransport):
        return "quic"
    elif hasattr(self.transport, '__class__'):
        class_name = self.transport.__class__.__name__.lower()
        if "circuit" in class_name or "relay" in class_name:
            return "circuit_v2"
        elif "websocket" in class_name:
            return "websocket"
        elif "tcp" in class_name:
            return "tcp"
    return "unknown"

def _record_connection_latency(
    self,
    peer_id: str,
    transport: str,
    success: bool,
    latency_ms: float,
    error: str | None = None,
) -> None:
    """Record connection latency metric."""
    # Access PrometheusExporter through ResourceManager if available
    if self._resource_manager is not None and self._resource_manager.prometheus_exporter is not None:
        labels = {
            "transport": transport,
            "success": "true" if success else "false",
        }
        if error:
            # Extract error type from error string for additional context
            # (Note: Prometheus labels should be low cardinality, so we keep it simple)
            if "timeout" in error.lower():
                labels["error_type"] = "timeout"
            elif "refused" in error.lower():
                labels["error_type"] = "connection_refused"
            elif "unreachable" in error.lower():
                labels["error_type"] = "unreachable"
        
        self._resource_manager.prometheus_exporter.record_connection_latency(
            transport=transport,
            success=success,
            latency_ms=latency_ms,
        )

1.2 Modify `dial_peer()` Method

async def dial_peer(self, peer_id: ID) -> list[INetConn]:
    """
    Try to create connections to peer_id with enhanced retry logic.
    
    Now tracks connection latency metrics.
    """
    # Check if we already have connections
    existing_connections = self.get_connections(peer_id)
    if existing_connections:
        logger.debug(f"Reusing existing connections to peer {peer_id}")
        return existing_connections

    logger.debug("attempting to dial peer %s", peer_id)
    
    # Track connection latency
    connection_start_time = trio.current_time()  # Use trio.current_time() for async timing
    transport_type = self._get_transport_type()

    try:
        # Get peer info from peer store
        addrs = self.peerstore.addrs(peer_id)
    except PeerStoreError as error:
        raise SwarmException(f"No known addresses to peer {peer_id}") from error

    if not addrs:
        raise SwarmException(f"No known addresses to peer {peer_id}")

    connections = []
    exceptions: list[SwarmException] = []

    # Enhanced: Try all known addresses with retry logic
    for multiaddr in addrs:
        try:
            connection = await self._dial_with_retry(multiaddr, peer_id)
            connections.append(connection)

            # Limit number of connections per peer
            if len(connections) >= self.connection_config.max_connections_per_peer:
                break

        except SwarmException as e:
            exceptions.append(e)
            logger.debug(
                "encountered swarm exception when trying to connect to %s, "
                "trying next address...",
                multiaddr,
                exc_info=e,
            )

    if not connections:
        # Tried all addresses, raising exception.
        latency_ms = (trio.current_time() - connection_start_time) * 1000
        self._record_connection_latency(
            peer_id=str(peer_id),
            transport=transport_type,
            success=False,
            latency_ms=latency_ms,
            error=str(MultiError(exceptions))
        )
        raise SwarmException(
            f"unable to connect to {peer_id}, no addresses established a "
            "successful connection (with exceptions)"
        ) from MultiError(exceptions)

    # Record successful connection latency
    latency_ms = (trio.current_time() - connection_start_time) * 1000
    self._record_connection_latency(
        peer_id=str(peer_id),
        transport=transport_type,
        success=True,
        latency_ms=latency_ms
    )

    return connections

2. Active Circuits Tracking

Overview

Track the number of currently active circuit relay connections. This helps monitor relay usage and capacity.

Metric Details

Name: libp2p_active_circuits
Type: GAUGE (current value that can go up or down)
Unit: Count
Labels: None (global count)

Implementation Location

File: libp2p/relay/circuit_v2/protocol.py

Changes Required

2.1 Add Monitor Reference and Helper Method

class CircuitV2Protocol(Service):
    def __init__(
        self,
        host: IHost,
        limits: RelayLimits | None = None,
        allow_hop: bool = False,
        read_timeout: int = DEFAULT_PROTOCOL_READ_TIMEOUT,
        write_timeout: int = DEFAULT_PROTOCOL_WRITE_TIMEOUT,
        close_timeout: int = DEFAULT_PROTOCOL_CLOSE_TIMEOUT,
        monitor: Monitor | None = None,  # Add monitor parameter
    ) -> None:
        # ... existing initialization code ...
        self.resource_manager = RelayResourceManager(self.limits)
        self._active_relays: dict[ID, tuple[INetStream, INetStream | None]] = {}
        self._monitor = monitor  # Store monitor reference
        self.event_started = trio.Event()
    
    def set_monitor(self, monitor: Monitor) -> None:
        """Set monitor for metrics tracking."""
        self._monitor = monitor
    
    def _update_active_circuits_metric(self) -> None:
        """Update the active circuits gauge metric."""
        if self._monitor is None:
            return
        
        from libp2p.rcmgr.monitoring import Metric, MetricType
        
        active_count = len(self._active_relays)
        metric = Metric(
            name="libp2p_active_circuits",
            value=active_count,
            metric_type=MetricType.GAUGE,
            labels={},
            help_text="Number of currently active circuit relay connections",
        )
        self._monitor.record_metric(metric)

2.2 Update `_handle_connect()` Method

async def _handle_connect(self, stream: INetStream, msg: HopMessage) -> None:
    """Handle a connect request."""
    # ... existing code ...
    
    try:
        # ... existing connection logic ...
        
        # Update active relays
        self._active_relays[peer_id] = (stream, dst_stream)
        self._update_active_circuits_metric()  # Add this line
        
        # ... rest of existing code ...
        
    except (trio.TooSlowError, ConnectionError) as e:
        # ... existing error handling ...
        if peer_id in self._active_relays:
            del self._active_relays[peer_id]
            self._update_active_circuits_metric()  # Add this line
        # ... rest of error handling ...
    except Exception as e:
        # ... existing error handling ...
        if peer_id in self._active_relays:
            del self._active_relays[peer_id]
            self._update_active_circuits_metric()  # Add this line
        # ... rest of error handling ...

2.3 Update `_relay_data()` Method

async def _relay_data(
    self,
    src_stream: INetStream,
    dst_stream: INetStream,
    peer_id: ID,
) -> None:
    """Relay data between source and destination streams."""
    try:
        # ... existing relay logic ...
    finally:
        # Clean up when relay ends
        # Note: peer_id might be source or destination, need to find and remove
        # This depends on your cleanup logic
        if peer_id in self._active_relays:
            del self._active_relays[peer_id]
            self._update_active_circuits_metric()  # Add this line

2.4 Update Cleanup in `run()` Method

async def run(self, *, task_status: Any = trio.TASK_STATUS_IGNORED) -> None:
    """Run the protocol service."""
    try:
        # ... existing code ...
    finally:
        # Clean up any active relay connections
        for src_stream, dst_stream in self._active_relays.values():
            await self._close_stream(src_stream)
            await self._close_stream(dst_stream)
        self._active_relays.clear()
        self._update_active_circuits_metric()  # Add this line (should be 0)
        # ... rest of cleanup ...

3. Prometheus Integration

3.1 Add Prometheus Metrics to PrometheusExporter

File: libp2p/rcmgr/prometheus_exporter.py

The codebase uses PrometheusExporter for Prometheus metrics. We need to add our new metrics there following the existing pattern.

3.1.1 Add Metric Definitions in `_init_metrics()`

def _init_metrics(self) -> None:
    """Initialize Prometheus metrics compatible with go-libp2p format."""
    # ... existing metrics ...
    
    # Connection latency histogram - NEW
    # Buckets: 10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
    latency_buckets = [10, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
    self.connection_latency = Histogram(
        "libp2p_connection_latency_ms",
        "Connection establishment latency in milliseconds",
        ["transport", "success"],
        buckets=latency_buckets,
        registry=self.registry,
    )
    
    # Active circuits gauge - NEW
    self.active_circuits = Gauge(
        "libp2p_active_circuits",
        "Number of currently active circuit relay connections",
        [],
        registry=self.registry,
    )

3.1.2 Add Recording Methods

def record_connection_latency(
    self,
    transport: str,
    success: bool,
    latency_ms: float,
) -> None:
    """
    Record connection latency metric.

    Args:
        transport: Transport type (tcp, quic, circuit_v2, etc.)
        success: Whether connection succeeded
        latency_ms: Latency in milliseconds
    """
    with self._lock:
        self.connection_latency.labels(
            transport=transport, success="true" if success else "false"
        ).observe(latency_ms)

def set_active_circuits(self, count: int) -> None:
    """
    Set the active circuits count.

    Args:
        count: Number of active circuits
    """
    with self._lock:
        self.active_circuits.set(count)

3.2 Update Monitor to Export to Prometheus

File: libp2p/rcmgr/monitoring.py

Update _export_to_prometheus() to handle the new metrics:

def _export_to_prometheus(self, metric: Metric) -> None:
    """Export metric to Prometheus if exporter is available."""
    if not self.prometheus_exporter:
        return

    try:
        # ... existing mappings ...
        
        # NEW: Map connection latency metrics
        elif metric.name == "libp2p_connection_latency_ms":
            transport = metric.labels.get("transport", "unknown")
            success = metric.labels.get("success", "false").lower() == "true"
            self.prometheus_exporter.record_connection_latency(
                transport=transport,
                success=success,
                latency_ms=metric.value
            )
        
        # NEW: Map active circuits metrics
        elif metric.name == "libp2p_active_circuits":
            self.prometheus_exporter.set_active_circuits(int(metric.value))
            
    except Exception as e:
        # Don't let Prometheus export failures affect main monitoring
        print(f"Warning: Failed to export metric to Prometheus: {e}")

3.3 Wire Monitor to CircuitV2Protocol

File: libp2p/relay/circuit_v2/transport.py or where CircuitV2Protocol is instantiated

Since CircuitV2Protocol needs access to metrics, we have a few options:

Option 1: Pass Monitor directly when creating CircuitV2Protocol

# When creating CircuitV2Protocol
monitor = Monitor(enable_prometheus=True)  # Created separately
protocol = CircuitV2Protocol(host, limits, allow_hop=True, monitor=monitor)

Option 2: Access through host's network ResourceManager (if available)

# In transport initialization or where protocol is created
def __init__(
    self,
    host: IHost,
    protocol: CircuitV2Protocol,
    config: RelayConfig,
) -> None:
    # ... existing initialization ...
    
    # Try to get Monitor through host's network ResourceManager
    # Note: This requires Monitor to be accessible, which may need to be
    # passed separately or stored in a shared location
    if hasattr(host, 'get_network'):
        network = host.get_network()
        if hasattr(network, '_resource_manager') and network._resource_manager is not None:
            rcmgr = network._resource_manager
            # If Monitor is stored somewhere accessible, wire it here
            # For now, we'll need to pass Monitor separately
            pass

4. Testing

4.1 Test Connection Latency Tracking

File: tests/core/network/test_swarm_metrics.py (new file)

import trio
import pytest
from libp2p.network.swarm import Swarm
from libp2p.rcmgr.monitoring import Monitor
from libp2p.rcmgr.manager import ResourceManager

@pytest.mark.trio
async def test_connection_latency_metric():
    """Test that connection latency is recorded."""
    monitor = Monitor()
    rcmgr = ResourceManager(monitor=monitor)
    
    # Create swarm with resource manager
    swarm = Swarm(...)
    swarm.set_resource_manager(rcmgr)
    
    # Attempt connection (mock or real)
    try:
        await swarm.dial_peer(peer_id)
    except Exception:
        pass  # Expected to fail in test environment
    
    # Check metrics
    metrics = monitor.metrics_buffer
    latency_metrics = [m for m in metrics if m.name == "libp2p_connection_latency_ms"]
    
    assert len(latency_metrics) > 0
    assert latency_metrics[0].value > 0
    assert "transport" in latency_metrics[0].labels
    assert "success" in latency_metrics[0].labels

4.2 Test Active Circuits Tracking

File: tests/core/relay/test_circuit_v2_metrics.py (new file)

import pytest
from libp2p.relay.circuit_v2.protocol import CircuitV2Protocol
from libp2p.rcmgr.monitoring import Monitor

def test_active_circuits_metric():
    """Test that active circuits count is tracked."""
    monitor = Monitor()
    protocol = CircuitV2Protocol(host, monitor=monitor)
    
    # Initially should be 0
    protocol._update_active_circuits_metric()
    metrics = monitor.metrics_buffer
    circuit_metrics = [m for m in metrics if m.name == "libp2p_active_circuits"]
    assert len(circuit_metrics) > 0
    assert circuit_metrics[-1].value == 0
    
    # Simulate adding a circuit
    protocol._active_relays[peer_id] = (stream, dst_stream)
    protocol._update_active_circuits_metric()
    
    # Check updated metric
    circuit_metrics = [m for m in monitor.metrics_buffer if m.name == "libp2p_active_circuits"]
    assert circuit_metrics[-1].value == 1

Phase-by-Phase Checklist

Phase 1: Prometheus Metrics Foundation

Add connection_latency Histogram to PrometheusExporter._init_metrics()
Add active_circuits Gauge to PrometheusExporter._init_metrics()
Add record_connection_latency() method to PrometheusExporter
Add set_active_circuits() method to PrometheusExporter
Update Monitor._export_to_prometheus() to map new metrics
Add unit tests for PrometheusExporter methods

Phase 2: Connection Latency Tracking

Add _get_transport_type() method to Swarm
Add _record_connection_latency() method to Swarm
Modify dial_peer() to track start time and record metrics
Handle both success and failure cases
Add unit tests for helper methods
Add integration tests for latency tracking

Phase 3: Active Circuits Tracking

Phase 4: Integration & Testing

Create tests/core/network/test_swarm_metrics.py
Create tests/core/relay/test_circuit_v2_metrics.py
Add comprehensive integration tests
Update documentation
Code review and style verification

Notes

Privacy: Peer IDs are truncated to 16 characters in metrics to protect privacy
Performance: Metric recording is non-blocking and should have minimal performance impact
Optional: Monitor can be None - metrics will simply not be recorded if monitor is unavailable
Time Tracking: Use trio.current_time() for async timing in trio contexts (like Swarm), matching the pattern in bootstrap.py
Defensive Checks: Always check if self._resource_manager is not None: before accessing, following the codebase pattern

CCing @paschal533 for any feedbacks on the implementation plan.

0 replies

paschal533 · 2025-11-24T11:37:17Z

paschal533
Nov 24, 2025

Hi @parth-soni07, thanks for putting together this detailed implementation plan. I can see you've thought through the metrics tracking carefully. Here are my thoughts:

Overall approach:
The phased implementation strategy is solid... breaking it down into Prometheus setup, connection latency, active circuits, and testing makes sense. However, I'm concerned we might be overengineering this compared to what we actually need for relay selection.

My main concern:
The implementation plan focuses heavily on Prometheus metrics infrastructure, but for relay selection purposes, we don't necessarily need full Prometheus integration right away. What we really need is:

A way to track connection latencies per relay
A count of active circuits per relay
This data needs to be available in-memory for the selection algorithm to use

The Prometheus export can come later as observability, but it shouldn't block the core relay selection improvements.

Specific feedback:

Connection latency tracking: You're tracking this at the Swarm level, but we need it tracked per relay for selection purposes. Consider adding a RelayPerformanceTracker class that maintains metrics per relay ID, with something like exponential moving average for latency. This would live closer to the relay selection logic.
Active circuits: Good that you're tracking this in CircuitV2Protocol, but again, we need this broken down per relay not just a global count. The selection algorithm needs to know "Relay A has 5 active circuits, Relay B has 2" to make decisions.
Architecture suggestion: Consider creating a dedicated RelayMetrics or RelayPerformanceTracker class that:
- Stores per-relay latency (EMA)
- Stores per-relay active circuit counts
- Provides methods like get_best_relay() or get_relay_score()
- Can optionally export to Prometheus
Testing strategy: Your test plan is comprehensive for Prometheus integration, but I'd like to see more tests for the actual relay selection behavior... things like "given relays with different latencies, does it pick the fastest one?" or "does it avoid relays with too many active circuits?"
Simplification opportunity: Can we start with just in-memory metrics and basic selection logic? Get that working and tested first, then add Prometheus as a Phase 2 enhancement? This would let us iterate faster on the actual selection algorithm.

What I'd like to see next:
Could you sketch out what the RelayPerformanceTracker class would look like? Something with methods like:

record_connection_attempt(relay_id, latency_ms, success)
record_circuit_opened(relay_id) / record_circuit_closed(relay_id)
get_relay_score(relay_id) - returns a score based on latency and load
select_best_relay(available_relays) - implements the selection strategy

Then we can discuss how this integrates with the existing _select_relay() method and where it fits in the architecture.

Does this make sense?

0 replies

parth-soni07 · 2025-12-01T08:01:14Z

parth-soni07
Dec 1, 2025
Author

0 replies

parth-soni07 · 2025-12-02T07:48:43Z

parth-soni07
Dec 2, 2025
Author

In response to the feedback shared by @paschal533 on the changes made at #972 (comment) , I have made the following changes:

1. EMA Initialization - ✅ Fixed

Issue: Using 0.0 as a sentinel value would incorrectly treat legitimate 0ms latency (e.g., localhost) as "no data yet".

Solution Implemented:

Changed default value from 0.0 to -1.0 as a sentinel value
Updated first measurement check from if stats.latency_ema_ms == 0.0: to if stats.latency_ema_ms < 0.0:
Added handling in get_relay_score() to return unknown_relay_score when latency_ema_ms < 0.0 (no latency data yet)

Changes:

RelayStats.latency_ema_ms now defaults to -1.0 with comment: # Exponential moving average latency (-1.0 = no data yet)
Added test test_record_connection_attempt_zero_latency() to verify 0ms latency is handled correctly
Added test test_get_relay_score_no_latency_data() to verify sentinel value behavior

Result: Legitimate 0ms latency is now properly tracked and won't be confused with "no data yet".

2. Circuit Closure Tracking - ✅ Fixed

Issue: record_circuit_closed() existed but was never called, causing circuit counts to grow unbounded and eventually making all relays appear overloaded.

Solution Implemented:
Created a TrackedRawConnection wrapper class that automatically tracks circuit closure:

Wrapper class: TrackedRawConnection wraps RawConnection and intercepts close() calls
Automatic tracking: When close() is called, it automatically calls record_circuit_closed() before delegating to the wrapped connection
Double-close protection: Uses _closed flag to ensure circuit is only decremented once, even if close() is called multiple times
Transparent delegation: All other methods/attributes delegate to the wrapped connection via __getattr__

Changes:

Added TrackedRawConnection class in libp2p/relay/circuit_v2/transport.py
Updated dial_peer_info() to return TrackedRawConnection instead of RawConnection
Changed return type to IRawConnection for type compatibility
Added 3 comprehensive unit tests covering:
- Normal closure tracking
- Double-close protection
- Method delegation

Result: Circuit closures are now automatically tracked for all connections created through dial_peer_info(), preventing unbounded growth of circuit counts. The wrapper handles all closure paths (normal, error, timeout) since it intercepts the close() method.

Note: This solution is non-intrusive - it doesn't modify core RawConnection class and is transparent to callers (returns IRawConnection interface).

3. Export Metrics Method - ✅ Working as Intended

Status: The export_metrics() method is intentionally not wired up to anything currently. It's designed for future Prometheus integration.

Current Implementation:

Returns a dictionary with relay statistics (latency, success rates, active circuits, etc.)
Ready to be called when Prometheus integration is added in a future phase
Not currently used by any monitoring system

No changes needed - this is working as designed per the original feedback to defer Prometheus integration.

Summary

Both critical issues (EMA initialization and circuit closure tracking) have been fixed with comprehensive test coverage. All tests pass (31 total: 28 performance tracker tests + 3 new wrapper tests), and all lint/type checks pass.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving Relay Selection Strategy Beyond Round-Robin #1043

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving Relay Selection Strategy Beyond Round-Robin #1043

Uh oh!

parth-soni07 Nov 14, 2025

🧩 Context

⚙️ Problem Summary

Current behavior

Impact

🔍 Analysis

🚀 Proposed Solutions

Option 1: Least Response Time

Option 2: Least Connections

Option 3: Hybrid Approach

🧠 Recommended Approach

**🔗 Related References **

CCing @paschal533 @seetadev @pacrob @acul71 for feedback & ideas 💡

Replies: 6 comments · 1 reply

Uh oh!

yashksaini-coder Nov 15, 2025

Relay Selection Strategy Review & Feedback

Background

Original Issue (#735)

Current Implementation (PR #972)

Proposed Enhancement (Discussion #1043)

Discussion Post Analysis (#1043)

Strengths

1. Well-Structured Problem Definition ✅

2. Thoughtful Strategy Options 📊

3. Backward Compatibility Consideration ✅

4. Clear Communication 📢

Areas for Discussion

1. Implementation Complexity vs. Immediate Value ⚖️

2. Missing: Metrics Collection Strategy 🔍

3. Missing: Performance Benchmarks 📈

4. Risk: Over-Engineering ⚠️

PR #972 Technical Review

Code Quality Assessment

✅ Strengths:

⚠️ Potential Improvements:

Test Coverage Analysis

Comparative Analysis: Round-Robin vs. Performance-Based

Recommendations

Phase 1: Merge Current PR #972

Phase 2: Research & Design

Phase 3: Implement Performance-Based Selection

Recommendation 2: Start with Least Connections

Recommendation 3: Metrics to Track

Recommendation 4: Acceptance Criteria

Risk Analysis

Risks of Current PR #972

Risks of Performance-Based Selection

Comparison with Industry Standards

Round-Robin (Current PR)

Least Connections

Weighted Response Time / Hybrid

Alternative Approaches (Not in Discussion)

1. Weighted Random Selection

2. Power of Two Choices

3. Consistent Hashing

Detailed Feedback on Discussion Post #1043

What's Excellent ✅

Suggestions for Improvement 📝

Conclusion & Final Recommendations

For PR #972 ✅

For Discussion #1043 📊

Summary Recommendation Matrix

References

Uh oh!

seetadev Nov 16, 2025 Maintainer

Uh oh!

yashksaini-coder Nov 16, 2025

Uh oh!

Uh oh!

parth-soni07 Nov 23, 2025 Author

Metrics Implementation Plan: Connection Latency & Active Circuits

Overview

Table of Contents

Implementation Phases

Phase 1: Prometheus Metrics Foundation

Phase 2: Connection Latency Tracking

Phase 3: Active Circuits Tracking

parth-soni07
Nov 14, 2025

🔗 Related References

Replies: 6 comments 1 reply

yashksaini-coder
Nov 15, 2025

seetadev
Nov 16, 2025
Maintainer

parth-soni07
Nov 23, 2025
Author

1.2 Modify `dial_peer()` Method

2.2 Update `_handle_connect()` Method

2.3 Update `_relay_data()` Method

2.4 Update Cleanup in `run()` Method

3.1.1 Add Metric Definitions in `_init_metrics()`

paschal533
Nov 24, 2025

parth-soni07
Dec 1, 2025
Author