Skip to content

Feature Request: Traffic Router Plugin hook to delay ReplicaSet scale-down until external drain completes #4597

@billyshambrook

Description

@billyshambrook

Summary

Add a new optional method to the Traffic Router Plugin interface that allows plugins to signal whether a ReplicaSet can be safely scaled down. This would enable traffic routers that manage external systems with their own drain/shutdown semantics to prevent premature pod termination.

Use Cases

We're using Argo Rollouts with a custom traffic router plugin for Temporal Worker Versioning. Temporal manages traffic routing to workers externally based on deployment versions, and has its own "drain" lifecycle - when a version is superseded, existing workflows must complete before workers can be safely terminated.

We've encountered two scenarios where pods are terminated before Temporal reports the version as drained:

Scenario 1: Full Promote

  1. Version A is stable (100% traffic)
  2. Version B is deployed as canary, progresses through steps
  3. User clicks "Promote Full"
  4. Argo shifts 100% traffic to B, B becomes new stable
  5. Argo starts scaleDownDelaySeconds timer for A's ReplicaSet
  6. Timer expires → A's pods are deleted
  7. Problem: Temporal workflows are still running on A's workers

The scaleDownDelaySeconds timer runs independently and is not gated by traffic router plugin responses. Even if our plugin's VerifyWeight is waiting for drain to complete, the scale-down proceeds when the timer expires.

Scenario 2: Rainbow Deployment Abort

  1. Version A is stable (100% traffic)
  2. Version B is deployed as canary, reaches 25%
  3. Version C is deployed before B completes
  4. Argo correctly starts draining B's traffic and marks B's ReplicaSet for scale-down
  5. Problem: Argo scales down B's ReplicaSet immediately, killing pods while Temporal workflows are still running on those workers

In both scenarios, workflows can run for hours, so time-based solutions don't work for our case.

What We've Tried

Approach Why It Doesn't Work
scaleDownDelaySeconds Delays scale-down but doesn't wait for actual drain completion. The timer is fixed and not gated by any external condition.
terminationGracePeriodSeconds + preStop hook We use KEDA to scale workers based on queue depth. Pods in Terminating state can't be "un-terminated" if KEDA needs to scale up.
Traffic router plugin VerifyWeight Only called for traffic operations, not before ReplicaSet scale-down. The scaleDownDelaySeconds timer runs independently.
Traffic router plugin UpdateHash Returning an error here blocks the new rollout from proceeding, creating a deadlock where old pods stay but new rollout can't progress.

Proposed Solution

Add an optional method to the Traffic Router Plugin interface:

// CanScaleDown is called before scaling down a ReplicaSet.
// Plugins can return false to delay scale-down until an external condition is met.
// This is useful for traffic routers that manage external systems with drain semantics.
//
// Parameters:
//   - rollout: The rollout being processed
//   - replicaSetHash: The hash of the ReplicaSet being considered for scale-down
//
// Returns:
//   - canScaleDown: true if the ReplicaSet can be safely scaled down
//   - message: optional message explaining why scale-down is delayed (for status/events)
//   - error: if an error occurred checking the condition
type CanScaleDown func(
    rollout *v1alpha1.Rollout,
    replicaSetHash string,
) (canScaleDown bool, message string, err error)

The Argo Rollouts controller would call this method before scaling down any ReplicaSet managed by a traffic router plugin. If canScaleDown returns false, the controller would:

  1. Skip scaling down that ReplicaSet for this reconciliation cycle
  2. Optionally surface the message in rollout status/events
  3. Retry on the next reconciliation

This check would occur after scaleDownDelaySeconds expires but before actually scaling down the ReplicaSet, giving plugins the final say on whether scale-down is safe.

Alternative: Annotation-based delay

A simpler alternative would be an annotation that plugins can set on ReplicaSets to prevent scale-down:

metadata:
  annotations:
    rollouts.argoproj.io/scale-down-blocked: "true"
    rollouts.argoproj.io/scale-down-blocked-reason: "Temporal drain in progress"

The controller would skip scaling down ReplicaSets with this annotation. Traffic router plugins would be responsible for adding/removing it.

Scenarios Addressed

This hook would cover all ReplicaSet scale-down scenarios:

Scenario Current Behavior With Hook
Full Promote scaleDownDelaySeconds timer runs independently of traffic router drain status Hook blocks scale-down until drain completes
Rainbow Abort Old canary scaled down immediately when new canary starts Hook blocks until old canary is drained
Rollback Canary scaled down immediately Hook can verify canary is drained
Normal Completion Works with VerifyWeight at 0% Hook provides additional safety

Impact

This would enable Argo Rollouts to integrate with external systems that have their own lifecycle management, such as:

  • Temporal worker versioning
  • Systems with long-running connections/sessions
  • Message queue consumers that need to finish processing
  • Any traffic router where "drained" is determined by an external system rather than time

Am happy to look at contributing if this seems like a viable solution, thanks!


Message from the maintainers:

Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions