Skip to content

[RFC]: Fault-Tolerant EP phase 2 (recovery) #27908

@tzulingk

Description

@tzulingk

Summary

Phase 2 enables automatic recovery from persistent rank failures by integrating Phase 0-1's health monitoring with vLLM's existing scale_elastic_ep() API. When failures persist beyond a threshold, the system automatically triggers scale-down (remove failed ranks) followed by scale-up (restore capacity) to achieve self-healing without manual intervention.

Motivation

Problem: Phase 0-1 Leaves System in Degraded State
Phase 0-1 (#27774) successfully provides graceful degradation:

  • Detects failures via per-expert latency monitoring within some number of forward passes
  • Applies weight penalty to unhealthy experts during EPLB rebalancing
  • Continues serving with reduced capacity

However, critical limitations remain in production:

  1. Incomplete traffic blocking. The weight penalty is probabilistic rather than absolute, so failed experts continue to receive traffic.
  2. No capacity is restored. Phases 0–1 never recover capacity, so failed ranks remain in the cluster indefinitely. Redundant experts are compensating for these failures.
  3. Manual intervention is required to restore full capacity.

Proposed Change.

Solution: Automatic Recovery via Elastic Scaling

Phase 2 adds three critical capabilities:

  • Hard-blocking: Immediately stop ALL traffic to failed ranks (not just penalty)
  • Automatic replacement: Trigger elastic scale-down + scale-up to restore capacity
  • Self-healing: System recovers without manual intervention

Track the failed state of each rank

@dataclass
class EPLBConfig:
    # ... existing EPLB fields ...
    
    # Phase 0-1 (from #27774)
    health_check_enabled: bool = True
    health_latency_threshold: float = 3.0  # 3x median = unhealthy
    health_penalty_factor: float = 10.0    # Weight penalty for unhealthy experts
    
    # Phase 2 (NEW) - Automatic Recovery
    recovery_enabled: bool = True
    
    # Trigger recovery if failure persists for this many forward passes
    failure_persistence_threshold: int = 1000
    
    # Per-rank threshold: trigger recovery if this fraction of experts failed
    rank_failure_ratio_threshold: float = 0.5
class EplbState:
    def __init__(self, ...):
        # ... existing state ...
        
        # Phase 2: Recovery state
        self.rank_failure_duration: torch.Tensor = torch.zeros(...)

        # Worker sets this when recovery needed, coordinator reads it
        self.pending_recovery_request: Optional[RecoveryRequest] = None

    def _update_failure_duration(self) -> None:
        """
        Track consecutive unhealthy passes per rank.
        
        Increment for unhealthy ranks, reset for recovered ranks.
        A rank is unhealthy if ALL its experts are unhealthy.
        """

Trigger recovery for serious degradation

# EblbState
def _should_trigger_recovery(self) -> bool:
    """
    Check if any rank meets EITHER conditions for recovery:
    1. Failure ratio exceeds threshold (e.g., >50% experts failed)
    2. Failure persisted for threshold number of passes
    
    Returns:
        True if recovery should be triggered for at least one rank
    """

def step(self, model: MixtureOfExperts, ...) -> None:
    """EPLB step with recovery trigger."""
    
    # Phase 0-1: Health monitoring
    if self.eplb_config.health_check_enabled:
        self._update_health_mask(model)
        self._update_failure_duration()
    
    # Phase 2: Recovery trigger (NEW)
    if self.eplb_config.recovery_enabled and self._should_trigger_recovery():
        failed_ranks = self._hard_block_and_prepare_recovery()
        return  # Skip normal rebalancing

Recovery

class AsyncLLM:
    def __init__(self, ...):
        # ... existing init ...
        self.last_recovery_check = time.time()
        self.recovery_check_interval = recovery_check_interval
    
    def _should_check_recovery(self) -> bool:
        """Rate limit recovery checks to every recovery_check_interval."""
        now = time.time()
        if now - self.last_recovery_check > self.recovery_check_interval:
            self.last_recovery_check = now
            return True
        return False
    
    async def _check_and_execute_recovery(self) -> None:
        """
        Poll workers for recovery triggers and execute if found.
        """
        # Poll all workers via collective_rpc
        recovery_triggers = await self.collective_rpc(
            method="get_recovery_trigger",  # ← Calls GPUWorker.get_recovery_trigger()
            timeout=0.5
        )
                
        # Execute recovery if any worker flagged        
        all_failed_ranks = set()
        for worker_idx, failed_ranks in enumerate(recovery_triggers):
            if failed_ranks is not None:
                all_failed_ranks.update(failed_ranks)
        if len(all_failed_ranks) > 0:
            await self._execute_recovery(failed_ranks_list)
    
    async def _execute_recovery(self, failed_ranks: list[int]) -> None:
        """
        Execute recovery by calling scale_elastic_ep twice.
        
        Step 1: Scale down (remove failed ranks)
        Step 2: Scale up (restore capacity)
        """

Q&A

Q: Why Downtime is Unavoidable for now?
A: Downtime is required due to PyTorch/NCCL architectural constraints: vLLM must destroy and recreate communication groups during scaling because PyTorch provides no API to dynamically update group membership. During cleanup_dist_env_and_memory(), all NCCL/NVSHMEM groups are destroyed, making inter-rank communication impossible. Also, DeepEP's NVSHMEM barriers (e.g., nvshmemx_barrier_all_block()) require ALL ranks to participate, so a dead rank in the group would cause hangs. Therefore, traffic must be drained before scaling begins via wait_for_requests_to_drain().

Q: Why don't we need hard-blocking during recovery?
A: Since drain is mandatory, explicit hard-blocking is unnecessary: Phase 0-1's weight penalty already reduces traffic to failed ranks during the degraded period. When recovery triggers, wait_for_requests_to_drain() naturally finishes any remaining in-flight requests on all ranks (including minimal traffic still reaching failed ranks). By the time group destruction begins, no requests are active. Thus, Phase 2 only needs to: (1) detect persistent failures, (2) trigger scale_elastic_ep() twice, (3) let the existing drain mechanism handle the rest. Hard-blocking would add complexity without benefit since the drain already guarantees zero active requests during the scaling operation.
Total downtime: drain + 2× scaling + CUDA graph recapture

Feedback Period.

No response

CC List.

@ruisearch42 @pavanimajety @GuanLuo @benchislett @xinli-sw

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions