Feature Request: Traffic Router Plugin hook to delay ReplicaSet scale-down until external drain completes

## Summary

Add a new optional method to the Traffic Router Plugin interface that allows plugins to signal whether a ReplicaSet can be safely scaled down. This would enable traffic routers that manage external systems with their own drain/shutdown semantics to prevent premature pod termination.

## Use Cases

We're using Argo Rollouts with a custom traffic router plugin for [Temporal Worker Versioning](https://docs.temporal.io/workers#worker-versioning). Temporal manages traffic routing to workers externally based on deployment versions, and has its own "drain" lifecycle - when a version is superseded, existing workflows must complete before workers can be safely terminated.

We've encountered two scenarios where pods are terminated before Temporal reports the version as drained:

### Scenario 1: Full Promote

1. Version A is stable (100% traffic)
2. Version B is deployed as canary, progresses through steps
3. User clicks "Promote Full"
4. Argo shifts 100% traffic to B, B becomes new stable
5. Argo starts `scaleDownDelaySeconds` timer for A's ReplicaSet
6. Timer expires → A's pods are deleted
7. **Problem**: Temporal workflows are still running on A's workers

The `scaleDownDelaySeconds` timer runs independently and is not gated by traffic router plugin responses. Even if our plugin's `VerifyWeight` is waiting for drain to complete, the scale-down proceeds when the timer expires.

### Scenario 2: Rainbow Deployment Abort

1. Version A is stable (100% traffic)
2. Version B is deployed as canary, reaches 25%
3. Version C is deployed before B completes
4. Argo correctly starts draining B's traffic and marks B's ReplicaSet for scale-down
5. **Problem**: Argo scales down B's ReplicaSet immediately, killing pods while Temporal workflows are still running on those workers

In both scenarios, workflows can run for hours, so time-based solutions don't work for our case.

## What We've Tried

| Approach | Why It Doesn't Work |
|----------|---------------------|
| `scaleDownDelaySeconds` | Delays scale-down but doesn't wait for actual drain completion. The timer is fixed and not gated by any external condition. |
| `terminationGracePeriodSeconds` + preStop hook | We use KEDA to scale workers based on queue depth. Pods in Terminating state can't be "un-terminated" if KEDA needs to scale up. |
| Traffic router plugin `VerifyWeight` | Only called for traffic operations, not before ReplicaSet scale-down. The `scaleDownDelaySeconds` timer runs independently. |
| Traffic router plugin `UpdateHash` | Returning an error here blocks the *new* rollout from proceeding, creating a deadlock where old pods stay but new rollout can't progress. |

## Proposed Solution

Add an optional method to the Traffic Router Plugin interface:

```go
// CanScaleDown is called before scaling down a ReplicaSet.
// Plugins can return false to delay scale-down until an external condition is met.
// This is useful for traffic routers that manage external systems with drain semantics.
//
// Parameters:
//   - rollout: The rollout being processed
//   - replicaSetHash: The hash of the ReplicaSet being considered for scale-down
//
// Returns:
//   - canScaleDown: true if the ReplicaSet can be safely scaled down
//   - message: optional message explaining why scale-down is delayed (for status/events)
//   - error: if an error occurred checking the condition
type CanScaleDown func(
    rollout *v1alpha1.Rollout,
    replicaSetHash string,
) (canScaleDown bool, message string, err error)
```

The Argo Rollouts controller would call this method before scaling down any ReplicaSet managed by a traffic router plugin. If `canScaleDown` returns `false`, the controller would:
1. Skip scaling down that ReplicaSet for this reconciliation cycle
2. Optionally surface the message in rollout status/events
3. Retry on the next reconciliation

This check would occur **after** `scaleDownDelaySeconds` expires but **before** actually scaling down the ReplicaSet, giving plugins the final say on whether scale-down is safe.

### Alternative: Annotation-based delay

A simpler alternative would be an annotation that plugins can set on ReplicaSets to prevent scale-down:

```yaml
metadata:
  annotations:
    rollouts.argoproj.io/scale-down-blocked: "true"
    rollouts.argoproj.io/scale-down-blocked-reason: "Temporal drain in progress"
```

The controller would skip scaling down ReplicaSets with this annotation. Traffic router plugins would be responsible for adding/removing it.

## Scenarios Addressed

This hook would cover all ReplicaSet scale-down scenarios:

| Scenario | Current Behavior | With Hook |
|----------|------------------|-----------|
| **Full Promote** | `scaleDownDelaySeconds` timer runs independently of traffic router drain status | Hook blocks scale-down until drain completes |
| **Rainbow Abort** | Old canary scaled down immediately when new canary starts | Hook blocks until old canary is drained |
| **Rollback** | Canary scaled down immediately | Hook can verify canary is drained |
| **Normal Completion** | Works with `VerifyWeight` at 0% | Hook provides additional safety |

## Impact

This would enable Argo Rollouts to integrate with external systems that have their own lifecycle management, such as:
- Temporal worker versioning
- Systems with long-running connections/sessions
- Message queue consumers that need to finish processing
- Any traffic router where "drained" is determined by an external system rather than time


Am happy to look at contributing if this seems like a viable solution, thanks!

---

**Message from the maintainers**:

Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.

Approach	Why It Doesn't Work
`scaleDownDelaySeconds`	Delays scale-down but doesn't wait for actual drain completion. The timer is fixed and not gated by any external condition.
`terminationGracePeriodSeconds` + preStop hook	We use KEDA to scale workers based on queue depth. Pods in Terminating state can't be "un-terminated" if KEDA needs to scale up.
Traffic router plugin `VerifyWeight`	Only called for traffic operations, not before ReplicaSet scale-down. The `scaleDownDelaySeconds` timer runs independently.
Traffic router plugin `UpdateHash`	Returning an error here blocks the new rollout from proceeding, creating a deadlock where old pods stay but new rollout can't progress.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Traffic Router Plugin hook to delay ReplicaSet scale-down until external drain completes #4597

Summary

Use Cases

Scenario 1: Full Promote

Scenario 2: Rainbow Deployment Abort

What We've Tried

Proposed Solution

Alternative: Annotation-based delay

Scenarios Addressed

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Current Behavior	With Hook
Full Promote	`scaleDownDelaySeconds` timer runs independently of traffic router drain status	Hook blocks scale-down until drain completes
Rainbow Abort	Old canary scaled down immediately when new canary starts	Hook blocks until old canary is drained
Rollback	Canary scaled down immediately	Hook can verify canary is drained
Normal Completion	Works with `VerifyWeight` at 0%	Hook provides additional safety

Feature Request: Traffic Router Plugin hook to delay ReplicaSet scale-down until external drain completes #4597

Description

Summary

Use Cases

Scenario 1: Full Promote

Scenario 2: Rainbow Deployment Abort

What We've Tried

Proposed Solution

Alternative: Annotation-based delay

Scenarios Addressed

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions