-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
Summary
Phase 2 enables automatic recovery from persistent rank failures by integrating Phase 0-1's health monitoring with vLLM's existing scale_elastic_ep() API. When failures persist beyond a threshold, the system automatically triggers scale-down (remove failed ranks) followed by scale-up (restore capacity) to achieve self-healing without manual intervention.
Motivation
Problem: Phase 0-1 Leaves System in Degraded State
Phase 0-1 (#27774) successfully provides graceful degradation:
- Detects failures via per-expert latency monitoring within some number of forward passes
- Applies weight penalty to unhealthy experts during EPLB rebalancing
- Continues serving with reduced capacity
However, critical limitations remain in production:
- Incomplete traffic blocking. The weight penalty is probabilistic rather than absolute, so failed experts continue to receive traffic.
- No capacity is restored. Phases 0–1 never recover capacity, so failed ranks remain in the cluster indefinitely. Redundant experts are compensating for these failures.
- Manual intervention is required to restore full capacity.
Proposed Change.
Solution: Automatic Recovery via Elastic Scaling
Phase 2 adds three critical capabilities:
- Hard-blocking: Immediately stop ALL traffic to failed ranks (not just penalty)
- Automatic replacement: Trigger elastic scale-down + scale-up to restore capacity
- Self-healing: System recovers without manual intervention
Track the failed state of each rank
@dataclass
class EPLBConfig:
# ... existing EPLB fields ...
# Phase 0-1 (from #27774)
health_check_enabled: bool = True
health_latency_threshold: float = 3.0 # 3x median = unhealthy
health_penalty_factor: float = 10.0 # Weight penalty for unhealthy experts
# Phase 2 (NEW) - Automatic Recovery
recovery_enabled: bool = True
# Trigger recovery if failure persists for this many forward passes
failure_persistence_threshold: int = 1000
# Per-rank threshold: trigger recovery if this fraction of experts failed
rank_failure_ratio_threshold: float = 0.5
class EplbState:
def __init__(self, ...):
# ... existing state ...
# Phase 2: Recovery state
self.rank_failure_duration: torch.Tensor = torch.zeros(...)
# Worker sets this when recovery needed, coordinator reads it
self.pending_recovery_request: Optional[RecoveryRequest] = None
def _update_failure_duration(self) -> None:
"""
Track consecutive unhealthy passes per rank.
Increment for unhealthy ranks, reset for recovered ranks.
A rank is unhealthy if ALL its experts are unhealthy.
"""
Trigger recovery for serious degradation
# EblbState
def _should_trigger_recovery(self) -> bool:
"""
Check if any rank meets EITHER conditions for recovery:
1. Failure ratio exceeds threshold (e.g., >50% experts failed)
2. Failure persisted for threshold number of passes
Returns:
True if recovery should be triggered for at least one rank
"""
def step(self, model: MixtureOfExperts, ...) -> None:
"""EPLB step with recovery trigger."""
# Phase 0-1: Health monitoring
if self.eplb_config.health_check_enabled:
self._update_health_mask(model)
self._update_failure_duration()
# Phase 2: Recovery trigger (NEW)
if self.eplb_config.recovery_enabled and self._should_trigger_recovery():
failed_ranks = self._hard_block_and_prepare_recovery()
return # Skip normal rebalancing
Recovery
class AsyncLLM:
def __init__(self, ...):
# ... existing init ...
self.last_recovery_check = time.time()
self.recovery_check_interval = recovery_check_interval
def _should_check_recovery(self) -> bool:
"""Rate limit recovery checks to every recovery_check_interval."""
now = time.time()
if now - self.last_recovery_check > self.recovery_check_interval:
self.last_recovery_check = now
return True
return False
async def _check_and_execute_recovery(self) -> None:
"""
Poll workers for recovery triggers and execute if found.
"""
# Poll all workers via collective_rpc
recovery_triggers = await self.collective_rpc(
method="get_recovery_trigger", # ← Calls GPUWorker.get_recovery_trigger()
timeout=0.5
)
# Execute recovery if any worker flagged
all_failed_ranks = set()
for worker_idx, failed_ranks in enumerate(recovery_triggers):
if failed_ranks is not None:
all_failed_ranks.update(failed_ranks)
if len(all_failed_ranks) > 0:
await self._execute_recovery(failed_ranks_list)
async def _execute_recovery(self, failed_ranks: list[int]) -> None:
"""
Execute recovery by calling scale_elastic_ep twice.
Step 1: Scale down (remove failed ranks)
Step 2: Scale up (restore capacity)
"""
Q&A
Q: Why Downtime is Unavoidable for now?
A: Downtime is required due to PyTorch/NCCL architectural constraints: vLLM must destroy and recreate communication groups during scaling because PyTorch provides no API to dynamically update group membership. During cleanup_dist_env_and_memory(), all NCCL/NVSHMEM groups are destroyed, making inter-rank communication impossible. Also, DeepEP's NVSHMEM barriers (e.g., nvshmemx_barrier_all_block()) require ALL ranks to participate, so a dead rank in the group would cause hangs. Therefore, traffic must be drained before scaling begins via wait_for_requests_to_drain().
Q: Why don't we need hard-blocking during recovery?
A: Since drain is mandatory, explicit hard-blocking is unnecessary: Phase 0-1's weight penalty already reduces traffic to failed ranks during the degraded period. When recovery triggers, wait_for_requests_to_drain() naturally finishes any remaining in-flight requests on all ranks (including minimal traffic still reaching failed ranks). By the time group destruction begins, no requests are active. Thus, Phase 2 only needs to: (1) detect persistent failures, (2) trigger scale_elastic_ep() twice, (3) let the existing drain mechanism handle the rest. Hard-blocking would add complexity without benefit since the drain already guarantees zero active requests during the scaling operation.
Total downtime: drain + 2× scaling + CUDA graph recapture
Feedback Period.
No response
CC List.
@ruisearch42 @pavanimajety @GuanLuo @benchislett @xinli-sw
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.