Skip to content

Conversation

@jlegrone
Copy link
Collaborator

@jlegrone jlegrone commented Jun 27, 2025

What was changed

  • Refactors progressive rollout logic to resume from the step that is equal to or less than the current ramp value
  • Requires that the wait time for each step be >= the time since the last ramp value change
  • Adds spec validation to require:
    • at least one step in progressive rollouts
    • minimum step pause duration of 30s
    • always increasing ramp values between steps
  • Implements a sink for writing controller logs to test output
  • Refactors test helpers out of public package

Why?

In addition to addressing the issue described in #25, this change makes the controller's behavior when taking over the rollout after a user manually edits the ramp value much safer.

For example, this is what would happen if a user manually edits the ramp value in the middle of a rollout, pauses for a while, and then allows the controller to resume the rollout:

sequenceDiagram
    participant Controller as Worker Controller
    participant Temporal as Temporal Server
    actor CLI as User

    Note over Controller, Temporal: Progressive Rollout: Steps [1%, 5%, 25%, 50%]

    Controller->>Temporal: Start rollout - Set ramp to 1%
    Controller->>Controller: Wait for pause duration (e.g. 1m)

    Note over Controller: Next reconcile loop
    Controller->>Temporal: Check ramp status
    Temporal-->>Controller: Current ramp: 1%, healthy
    Controller->>Temporal: Advance to next step - Set ramp to 5%
    Controller->>Controller: Wait for pause duration (e.g. 5m)

    Note over Controller: Next reconcile loop
    Controller->>Temporal: Check ramp status
    Temporal-->>Controller: Current ramp: 5%, healthy
    Controller->>Temporal: Advance to next step - Set ramp to 25%

    par Controller waits
        Controller->>Controller: Wait for pause duration (e.g. 10m)
    and User intervenes
        Note over CLI: User notices issues<br/>and wants to reduce ramp
        CLI->>Temporal: Manual intervention<br/>Set ramp to 10% via CLI
    end

    Note over Controller: Next reconcile loop
    Controller->>Temporal: Check ramp status
    Temporal-->>Controller: Current ramp: 10%<br/>(manually modified)
    
    par Controller waits
        Note over Controller: Switch to manual rollout mode until<br/>Deployment released by user.
    and User intervenes
        CLI->>Temporal: Release Deployment ramp<br/>(Controller allowed to take over ramping again)
    end

    Note over Controller: Next reconcile loop
    Controller->>Temporal: Check ramp status
    Temporal-->>Controller: Current ramp: 10%, healthy
    Controller->>Temporal: Resume rollout - Set ramp to 5%<br>(this is the highest ramp value <= current)

    Note over Controller: Continue normal progression
    Controller->>Temporal: Advance to 25%
    Controller->>Temporal: Advance to 50%
    Controller->>Temporal: Set as current (100%)
Loading

Checklist

  1. Closes Guarantee that the ramps specified in progressive rollout spend at least the requested time in each step #25

  2. How was this tested: Unit tests

  3. Any docs updates needed?

Not immediately, but we should probably document the requirement for increasing ramp values rather than relying solely on the validator.

@jlegrone jlegrone force-pushed the jlegrone/min-ramp-time branch from 9b75d52 to 8dec4c5 Compare June 28, 2025 01:34
@jlegrone jlegrone requested a review from Copilot June 28, 2025 13:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the progressive rollout logic for TemporalWorkerDeployment to ensure that ramp values only advance safely after an adequate pause duration, even when a user manually intervenes, while also updating related tests and validation rules.

  • Refactored progressive rollout logic and introduced the RampLastModifiedAt field to track ramp changes.
  • Updated test helpers, planner logic, webhook validations, and CRD/schema definitions to support the new behavior.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/testlogr/testlogr.go Added a logger sink for writing logs to test output.
internal/testhelpers/make.go Refactored test helper functions and moved them into a dedicated package.
internal/temporal/worker_deployment.go Updated deployment state to include RampLastModifiedAt.
internal/planner/planner.go Updated progressive rollout logic to use RampLastModifiedAt and handle step transitions.
internal/planner/planner_test.go Added new test cases for progressive rollout edge cases and over-time transitions.
internal/k8s/deployments_test.go Refactored test usage to align with the new test helper functions.
internal/controller/state_mapper.go Mapped RampLastModifiedAt from the temporal state to deployment status.
helm/temporal-worker-controller/templates/crds/temporal.io_temporalworkerdeployments.yaml Added CRD schema for RampLastModifiedAt.
api/v1alpha1/worker_types.go Introduced RampLastModifiedAt in the deployment status.
api/v1alpha1/temporalworker_webhook_test.go Added webhook tests for validating progressive rollout specs with ramp and pause constraints.
api/v1alpha1/temporalworker_webhook.go Updated webhook validation logic to enforce rollout step requirements and ramp progression.
Comments suppressed due to low confidence (2)

internal/planner/planner.go:368

  • Consider clarifying via an inline comment that when the target ramp percentage does not match the current step's defined ramp, the ramp is reset immediately. This explicitly documents that the logic intentionally overrides the previous ramp regardless of pause duration.
	if targetRampPercentage == nil || *targetRampPercentage != currentStep.RampPercentage {

internal/testhelpers/make.go:76

  • [nitpick] Consider renaming the parameter 'twd' to a more generic identifier (such as 'obj') to better reflect that this helper function is generic and may operate on various types.
func ModifyObj[T any](twd T, callback func(obj T) T) T {

Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

@robholland robholland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few non-blocking comments about validation.

@jlegrone jlegrone merged commit ae0d5c9 into main Jul 1, 2025
4 checks passed
@jlegrone jlegrone deleted the jlegrone/min-ramp-time branch July 1, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Guarantee that the ramps specified in progressive rollout spend at least the requested time in each step

4 participants