diff --git a/docs/README.md b/docs/README.md index 8d34d855..c6fc6883 100644 --- a/docs/README.md +++ b/docs/README.md @@ -11,6 +11,9 @@ This documentation structure is designed to support various types of technical d ## Index +### [Migration Guide](migration-guide.md) +Comprehensive guide for migrating from existing unversioned worker deployment systems to the Temporal Worker Controller. Includes step-by-step instructions, configuration mapping, and common patterns. + ### [Limits](limits.md) Technical constraints and limitations of the Temporal Worker Controller system, including maximum field lengths and other operational boundaries. diff --git a/docs/concepts.md b/docs/concepts.md new file mode 100644 index 00000000..c793cb5e --- /dev/null +++ b/docs/concepts.md @@ -0,0 +1,143 @@ +# Temporal Worker Controller Concepts + +This document defines key concepts and terminology used throughout the Temporal Worker Controller documentation. + +## Core Terminology + +### Temporal Worker Deployment +A logical grouping in Temporal that represents a collection of workers that are deployed together and should be versioned together. Examples include "payment-processor", "notification-sender", or "data-pipeline-worker". This is a concept within Temporal itself, not specific to Kubernetes. See https://docs.temporal.io/production-deployment/worker-deployments/worker-versioning for more details. + +**Key characteristics:** +- Identified by a unique worker deployment name (e.g., "payment-processor/staging") +- Can have multiple concurrent worker versions running simultaneously +- Versions of a Worker Deployment are identified by Build IDs (e.g., "v1.5.1", "v1.5.2") +- Temporal routes workflow executions to appropriate worker versions based on the `RoutingConfig` of the Worker Deployment that the versions are in. + +### `TemporalWorkerDeployment` CRD +The Kubernetes Custom Resource Definition that manages one Temporal Worker Deployment. This is the primary resource you interact with when using the Temporal Worker Controller. + +**Key characteristics:** +- One `TemporalWorkerDeployment` Custom Resource per Temporal Worker Deployment +- Manages the lifecycle of all versions for that worker deployment +- Defines rollout strategies, resource requirements, and connection details +- Controller creates and manages multiple Kubernetes `Deployment` resources based on this spec + +The actual Kubernetes `Deployment` resources that run worker pods. The controller automatically creates these - you don't manage them directly. + +**Key characteristics:** +- Multiple Kubernetes `Deployment` resources per `TemporalWorkerDeployment` Custom Resource (one per version) +- Named with the pattern: `{worker-deployment-name}-{build-id}` (e.g., `staging/payment-processor-v1.5.1`) +- Each runs a specific version of your worker code + +### Key Relationship +**One `TemporalWorkerDeployment` Custom Resource → Multiple Kubernetes `Deployment` resources (managed by controller)** + +Make changes to the spec of your `TemporalWorkerDeployment` Custom Resource, and the controller handles all the underlying Kubernetes `Deployment` resources for different versions. + +## Version States + +Worker deployment versions progress through various states during their lifecycle: + +### NotRegistered +The version has been specified in the `TemporalWorkerDeployment` custom resource but hasn't been registered with Temporal yet. This typically happens when: +- The worker pods are still starting up +- There are connectivity issues to Temporal +- The worker code has errors preventing registration + +### Inactive +The version is registered with Temporal but isn't automatically receiving any new workflow executions through the Worker Deployment's `RoutingConfig`. This is the initial state for new versions before they are promoted via Versioning API calls. Inactive versions can receive workflow executions via `VersioningOverride` only. + +### Ramping +The version is receiving a percentage of new workflow executions. If managed by a Progressive rollout, the percentage gradually increases according to the configured rollout steps. If the rollout is Manual, the user is responsible for setting the ramp percentage and ramping version. + +### Current +The current version receives all new workflow executions except those routed to the Ramping version. This is the "stable" version that handles the majority of traffic - all new workflows not being ramped to a newer version, plus all existing AutoUpgrade workflows running on the task queues in this Worker Deployment. + +### Draining +The version is no longer receiving new workflow executions but may still be processing existing workflows. + +### Drained +All Pinned workflows on this version have completed. The version is ready for cleanup according to the sunset configuration. + +## Rollout Strategies + +### Manual Strategy +Requires explicit human intervention to promote versions. New versions remain in the `Inactive` state until manually promoted. + +**Use cases:** +- Advanced deployment scenarios that are not supported by the other strategies (eg. user wants to do custom testing and validation before making changes to how workflow traffic is routed) + +### AllAtOnce Strategy +Immediately routes 100% of new workflow executions to the target version once it's healthy and registered. + +**Use cases:** +- Non-production environments +- Low-risk deployments +- When you want immediate cutover without gradual rollout + +### Progressive Strategy +Gradually increases the percentage of new workflow executions routed to the new version according to configured steps. + +**Use cases:** +- Production deployments where you want to validate new versions gradually +- When you want automated rollouts with built-in safety checks +- Deployments that benefit from canary analysis + +## Configuration Concepts + +### Worker Options +Configuration that tells the controller how to connect to the same Temporal cluster and namespace that the worker is connected to: +- **connection**: Reference to a `TemporalConnection` custom resource +- **temporalNamespace**: The Temporal namespace to connect to +- **deploymentName**: The logical deployment name in Temporal (auto-generated if not specified) + +### Rollout Configuration +Defines how new versions are promoted: +- **strategy**: Manual, AllAtOnce, or Progressive +- **steps**: For Progressive strategy, defines ramp percentages and pause durations +- **gate**: Optional workflow that must succeed on all task queues in the target Worker Deployment Version before promotion continues + +### Sunset Configuration +Defines how Drained versions are cleaned up: +- **scaledownDelay**: How long to wait after a version has been Drained before scaling pods to zero +- **deleteDelay**: How long to wait after a version has been Drained before deleting the Kubernetes `Deployment` + +### Template +The pod template used for the target version of this worker deployment. Similar to the pod template used in a standar Kubernetes `Deployment`, but managed by the controller. + +## Environment Variables + +The controller automatically sets these environment variables for all worker pods: + +### TEMPORAL_ADDRESS +The host and port of the Temporal server, derived from the `TemporalConnection` custom resource. +The worker must connect to this Temporal endpoint, but since this is user provided and not controller generated, the user does not necessarily need to access this env var to get that endpoint if it already knows the endpoint another way. + +### TEMPORAL_NAMESPACE +The Temporal namespace the worker should connect to, from `spec.workerOptions.temporalNamespace`. +The worker must connect to this Temporal namespace, but since this is user provided and not controller generated, the user does not necessarily need to access this env var to get that namespace if it already knows the namespace another way. + +### TEMPORAL_DEPLOYMENT_NAME +The worker deployment name in Temporal, auto-generated from the `TemporalWorkerDeployment` name and Kubernetes namespace. +The worker *must* use this to configure its `worker.DeploymentOptions`. + +### TEMPORAL_WORKER_BUILD_ID +The build ID for this specific version, derived from the container image tag and hash of the target pod template. +The worker *must* use this to configure its `worker.DeploymentOptions`. + +## Resource Management Concepts + +### Rainbow Deployments +The pattern of running multiple versions of the same service simultaneously. Running multiple versions of your workers simultaneously is essential for supporting Pinned workflows in Temporal, as Pinned workflows must continue executing on the worker version they started on. + +### Version Lifecycle Management +The automated process of: +1. Registering new versions with Temporal +2. Gradually routing traffic to new versions +3. Cleaning up resources for drained versions + +### Controller-Managed Resources +Resources that are created, updated, and deleted automatically by the controller: +- `TemporalWorkerDeployment` custom resources, to update their status +- Kubernetes `Deployment` resources for each version +- Labels and annotations for tracking and management diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 00000000..2e4cfef4 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,305 @@ +# Configuration Reference + +This document provides comprehensive configuration options for the Temporal Worker Controller. + +## Table of Contents + +1. [Rollout Strategies](#rollout-strategies) +2. [Sunset Configuration](#sunset-configuration) +3. [Worker Options](#worker-options) +4. [Gate Configuration](#gate-configuration) +5. [Advanced Configuration](#advanced-configuration) + +## Rollout Strategies + +See the [Concepts](concepts.md) document for detailed explanations of rollout strategies. Here are the basic configuration patterns: + +### Manual Strategy (Advanced Use Cases) + +```yaml +rollout: + strategy: Manual +# Requires manual intervention to promote versions +# Only recommended for special cases requiring full manual control +``` + +Use Manual strategy when you need complete control over version promotions, such as: +- Complex validation processes that require human approval +- Coordinated deployments across multiple services +- Special compliance or regulatory requirements + +### AllAtOnce Strategy + +```yaml +rollout: + strategy: AllAtOnce +# Immediately routes 100% traffic to new version when healthy +``` + +Use AllAtOnce strategy for: +- Low-risk environments (development, staging) +- Services where fast deployment is more important than gradual rollout +- Background processing workers with minimal user impact + +### Progressive Strategy (Recommended) + +```yaml +rollout: + strategy: Progressive + steps: + # Conservative initial migration settings + - rampPercentage: 1 + pauseDuration: 10m + - rampPercentage: 5 + pauseDuration: 15m + - rampPercentage: 25 + pauseDuration: 20m + # Can be optimized to faster ramps after validation: + # - rampPercentage: 10 + # pauseDuration: 5m + # - rampPercentage: 50 + # pauseDuration: 10m + gate: + workflowType: "HealthCheck" # Optional validation workflow +``` + +Progressive strategy is recommended for most production deployments because it: +- Minimizes risk by gradually increasing traffic to new versions +- Provides automatic pause points for validation +- Allows for quick rollback if issues are detected +- Can be tuned for different risk tolerances + +#### Progressive Rollout Examples + +**Conservative Production Rollout:** +```yaml +rollout: + strategy: Progressive + steps: + - rampPercentage: 1 + pauseDuration: 15m + - rampPercentage: 5 + pauseDuration: 30m + - rampPercentage: 25 + pauseDuration: 45m + - rampPercentage: 75 + pauseDuration: 30m + gate: + workflowType: "ProductionHealthCheck" +``` + +**Faster Development Environment:** +```yaml +rollout: + strategy: Progressive + steps: + - rampPercentage: 25 + pauseDuration: 2m + - rampPercentage: 75 + pauseDuration: 3m +``` + +**Canary-Style Rollout:** +```yaml +rollout: + strategy: Progressive + steps: + - rampPercentage: 1 + pauseDuration: 30m # Long canary period + - rampPercentage: 100 + pauseDuration: 0s # Full rollout after canary validation +``` + +## Sunset Configuration + +Controls how old versions are scaled down and cleaned up after they're no longer receiving new traffic: + +```yaml +sunset: + scaledownDelay: 1h # Wait 1 hour after draining before scaling to 0 + deleteDelay: 24h # Wait 24 hours after draining before deleting +``` + +### Sunset Configuration Examples + +**Conservative Cleanup (Recommended for Production):** +```yaml +sunset: + scaledownDelay: 2h # Allow time for workflows to complete + deleteDelay: 48h # Keep resources for debugging/rollback +``` + +**Aggressive Cleanup (Development/Staging):** +```yaml +sunset: + scaledownDelay: 15m # Quick scaledown + deleteDelay: 2h # Minimal retention +``` + +**Long-Running Workflow Environment:** +```yaml +sunset: + scaledownDelay: 24h # Long-running workflows need time + deleteDelay: 168h # 1 week retention for analysis +``` + +## Worker Options + +Configure how workers connect to Temporal: + +```yaml +workerOptions: + connection: production-temporal # Reference to TemporalConnection + temporalNamespace: production # Temporal namespace + taskQueues: # Optional: explicit task queue list + - order-processing + - payment-processing +``` + +### Connection Configuration + +Reference a `TemporalConnection` resource that defines server details: + +```yaml +apiVersion: temporal.io/v1alpha1 +kind: TemporalConnection +metadata: + name: production-temporal +spec: + hostPort: "production.abc123.tmprl.cloud:7233" + mutualTLSSecret: temporal-cloud-mtls # Optional: for mTLS +``` + +## Gate Configuration + +Optional validation workflow that must succeed before proceeding with rollout: + +```yaml +rollout: + strategy: Progressive + steps: + - rampPercentage: 10 + pauseDuration: 5m + gate: + workflowType: "HealthCheck" + input: | + { + "version": "{{.Version}}", + "environment": "production" + } + timeout: 300s +``` + +### Gate Workflow Examples + +**Simple Health Check:** +```yaml +gate: + workflowType: "HealthCheck" + timeout: 60s +``` + +**Complex Validation with Input:** +```yaml +gate: + workflowType: "ValidationWorkflow" + input: | + { + "deploymentName": "{{.DeploymentName}}", + "buildId": "{{.BuildId}}", + "rampPercentage": {{.RampPercentage}}, + "environment": "{{.Environment}}" + } + timeout: 600s +``` + +## Advanced Configuration + +### Environment-Specific Configurations + +**Production Configuration:** +```yaml +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: order-processor + namespace: production +spec: + replicas: 5 + workerOptions: + connection: production-temporal + temporalNamespace: production + rollout: + strategy: Progressive + steps: + - rampPercentage: 1 + pauseDuration: 15m + - rampPercentage: 10 + pauseDuration: 30m + - rampPercentage: 50 + pauseDuration: 45m + gate: + workflowType: "ProductionHealthCheck" + timeout: 300s + sunset: + scaledownDelay: 2h + deleteDelay: 48h +``` + +**Staging Configuration:** +```yaml +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: order-processor + namespace: staging +spec: + replicas: 2 + workerOptions: + connection: staging-temporal + temporalNamespace: staging + rollout: + strategy: Progressive + steps: + - rampPercentage: 25 + pauseDuration: 5m + - rampPercentage: 100 + pauseDuration: 0s + sunset: + scaledownDelay: 30m + deleteDelay: 4h +``` + +### Multiple Task Queues + +Configure workers that handle multiple task queues: + +```yaml +workerOptions: + connection: production-temporal + temporalNamespace: production + taskQueues: + - order-processing + - payment-processing + - notification-sending +``` + +## Configuration Validation + +The controller validates configuration and will report errors in the resource status: + +```bash +# Check for configuration errors +kubectl describe temporalworkerdeployment my-worker + +# Look for validation errors in status +kubectl get temporalworkerdeployment my-worker -o yaml +``` + +Common validation errors: +- Invalid ramp percentages (must be 1-100) +- Invalid duration formats (use Go duration format: "5m", "1h", "30s") +- Missing required fields (connection, temporalNamespace) +- Invalid strategy combinations + +For more examples and patterns, see the [Migration Guide](migration-guide.md) and [Concepts](concepts.md) documentation. diff --git a/docs/migration-guide.md b/docs/migration-guide.md new file mode 100644 index 00000000..b4563ce8 --- /dev/null +++ b/docs/migration-guide.md @@ -0,0 +1,677 @@ +# Migrating from Unversioned to Versioned Workflows with Temporal Worker Controller + +This guide helps teams migrate from unversioned Temporal workflows to versioned workflows using the Temporal Worker Controller. It assumes you are currently running workers without Temporal's Worker Versioning feature and want to adopt versioned worker deployments for safer, more controlled rollouts. + +## Important Note + +This guide uses specific terminology that is defined in the [Concepts](concepts.md) document. Please review the concepts document first to understand key terms like **Temporal Worker Deployment**, **`TemporalWorkerDeployment` CRD**, and **Kubernetes `Deployment`**, as well as the relationship between them. + +## Table of Contents + +1. [Why Migrate to Versioned Workflows](#why-migrate-to-versioned-workflows) +2. [Prerequisites](#prerequisites) +3. [Understanding the Differences](#understanding-the-differences) +4. [Migration Strategy](#migration-strategy) +5. [Step-by-Step Migration](#step-by-step-migration) +6. [Configuration Reference](#configuration-reference) +7. [Testing and Validation](#testing-and-validation) +8. [Common Migration Patterns](#common-migration-patterns) +9. [Troubleshooting](#troubleshooting) + +For detailed configuration options, see the [Configuration Reference](configuration.md) document. + +## Why Migrate to Versioned Workflows + +If you're currently running unversioned Temporal workflows, you may be experiencing challenges with deployments. Versioned workflows with the Temporal Worker Controller can solve these problems. For details on the benefits of Worker Versioning, see the [Temporal documentation](https://docs.temporal.io/production-deployment/worker-deployments/worker-versioning). + +## Prerequisites + +Before starting the migration, ensure you have: + +- ✅ **Unversioned Temporal workers**: Currently running workers without Worker Versioning +- ✅ **Kubernetes cluster**: Running Kubernetes 1.19+ with CustomResourceDefinition support +- ✅ **Basic worker configuration**: Workers connect to Temporal with namespace and task queue configuration +- ✅ **Administrative access**: Ability to install Custom Resource Definitions and controllers +- ✅ **Deployment pipeline**: Existing CI/CD system that can be updated to use new deployment method + +### Current Environment Variables + +Your workers are likely configured with basic environment variables like: + +```bash +TEMPORAL_ADDRESS=your-temporal-namespace.tmprl.cloud:7233 +TEMPORAL_NAMESPACE=your-temporal-namespace +# No TEMPORAL_DEPLOYMENT_NAME or TEMPORAL_WORKER_BUILD_ID yet +``` + +The controller will automatically add the versioning-related environment variables during migration. + +## Understanding the Differences + +### Before: Unversioned Workers + +```yaml +# Single deployment, all workers run the same code +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-worker +spec: + replicas: 3 + template: + spec: + containers: + - name: worker + image: my-worker:v1.2.3 + env: + - name: TEMPORAL_ADDRESS + value: "production.tmprl.cloud:7233" + - name: TEMPORAL_NAMESPACE + value: "production" + - name: TEMPORAL_TLS_CLIENT_CERT_PATH + value: "/path/to/temporal.cert" + - name: TEMPORAL_TLS_CLIENT_KEY_PATH + value: "/path/to/temporal.key" + # No versioning environment variables +``` + +**Current Deployment Process:** +1. Build new worker image +2. Update Deployment with new image +3. Kubernetes rolls out new pods, terminating old ones +4. **Risk**: Running workflows may fail if code changes break compatibility + +### After: Versioned Workers with Controller + +```yaml +# Single Custom Resource manages multiple versions of a worker deployment automatically +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: my-worker +spec: + replicas: 3 + workerOptions: + connection: production-temporal + temporalNamespace: production + rollout: + strategy: Progressive # Gradual rollout of new versions + steps: + - rampPercentage: 10 + pauseDuration: 5m + - rampPercentage: 50 + pauseDuration: 10m + sunset: + scaledownDelay: 1h + deleteDelay: 24h + template: + spec: # Any changes to this spec will trigger the controller to deploy a new version. + containers: + - name: worker + image: my-worker:v1.2.4 # This is the most common value to change, as you roll out a new worker image. + # Note: Controller automatically adds versioning environment variables: + # TEMPORAL_ADDRESS, TEMPORAL_NAMESPACE, TEMPORAL_DEPLOYMENT_NAME, TEMPORAL_WORKER_BUILD_ID +``` + +**New Deployment Process:** +1. Build new worker image +2. Update `TemporalWorkerDeployment` custom resource with new image +3. Controller creates new Kubernetes `Deployment` for the new version +4. Controller gradually routes new workflows and existing AutoUpgrade workflows to new version +5. Old version continues handling existing Pinned workflows until they complete +6. **Safety**: No disruption to running workflows, automated rollout control + +**Key Benefits:** +- ✅ **Zero-disruption deployments** - Running workflows continue on original version +- ✅ **Automated version management** - Controller handles registration and routing +- ✅ **Progressive rollouts** - Gradual traffic shifting with automatic pause points +- ✅ **Easy rollbacks** - Instantly route new workflows back to previous version +- ✅ **Workflow continuity** - Deterministic execution preserved across deployments + +## Migration Strategy + +### Safe Migration Approach + +The migration from unversioned to versioned workflows requires careful planning to avoid disrupting running workflows. The key is to transition gradually while maintaining workflow continuity. + +**Key Principles:** +- **Start with Progressive strategy with conservative settings** - Experience the controller's main value while maintaining safety +- **Migrate one worker deployment at a time** - Reduces risk and allows learning +- **Test thoroughly in non-production** - Validate the approach before production migration +- **Preserve running workflows** - Ensure in-flight workflows complete successfully +- **Use very conservative ramp percentages initially** - Start with 1-5% ramps to minimize risk + +### Migration Phases + +#### Phase 1: Preparation +1. **Install the controller** in non-production environments +2. **Update worker code** to support versioning +3. **Test migration process** with non-critical workers +4. **Prepare CI/CD pipeline changes** for new deployment method + +#### Phase 2: Initial Migration +1. **Choose lowest-risk worker** to migrate first +2. **Create `TemporalWorkerDeployment` custom resource** with a Progressive strategy (conservative intervals recommended) +3. **Validate controller management** works correctly +4. **Update deployment pipeline** for this worker + +#### Phase 3: Gradual Rollout +1. **Migrate remaining workers** one at a time +2. **Monitor and tune** rollout configurations +3. **Train team** on new deployment process + +### Recommended Migration Order + +1. **Background/batch processing workers** - Lower risk, easier to validate +2. **Internal service workers** - Limited external impact +3. **Customer-facing workers** - Highest risk, migrate last with most care + +## Step-by-Step Migration + +### Step 1: Install the Temporal Worker Controller + +```bash +# Install the controller using Helm +helm install -n temporal-system --create-namespace \ + temporal-worker-controller \ + oci://docker.io/temporalio/temporal-worker-controller + +``` + +### Step 2: Create TemporalConnection Resources + +Define connection parameters to your Temporal server(s): + +```yaml +apiVersion: temporal.io/v1alpha1 +kind: TemporalConnection +metadata: + name: production-temporal + namespace: default +spec: + hostPort: "production.abc123.tmprl.cloud:7233" + mutualTLSSecret: temporal-cloud-mtls # If using mTLS +--- +apiVersion: temporal.io/v1alpha1 +kind: TemporalConnection +metadata: + name: staging-temporal + namespace: default +spec: + hostPort: "staging.abc123.tmprl.cloud:7233" + mutualTLSSecret: temporal-cloud-mtls +``` + +### Step 3: Prepare Your Worker Code + +Update your worker initialization code to properly handle versioning: + +**Before (Unversioned):** +```go +// Worker connects without versioning +worker := worker.New(client, "my-task-queue", worker.Options{}) +``` + +**After (Versioned):** +```go +// Worker must use the build ID/deployment name from environment +// These are set on the deployment by the controller +buildID := os.Getenv("TEMPORAL_WORKER_BUILD_ID") +deploymentName := os.Getenv("TEMPORAL_DEPLOYMENT_NAME") +if buildID == "" || deploymentName == "" { + // exit with an error +} +workerOptions := worker.Options{} +workerOptions.DeploymentOptions = worker.DeploymentOptions{ + UseVersioning: true, + Version: worker.WorkerDeploymentVersion{ + DeploymentName: deploymentName, + BuildId: buildId, + }, +} +worker := worker.New(client, "my-task-queue", workerOptions) +``` + +### Step 4: Create Your First TemporalWorkerDeployment + +Start with your lowest-risk worker. Make a copy of your existing unversioned Deployment and convert it to a `TemporalWorkerDeployment` custom resource: + +**Existing Unversioned Deployment:** +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: payment-processor +spec: + replicas: 3 + template: + spec: + containers: + - name: worker + image: payment-processor:v1.5.2 + env: + - name: TEMPORAL_HOST_PORT + value: "production.tmprl.cloud:7233" + - name: TEMPORAL_NAMESPACE + value: "production" +``` + +**New TemporalWorkerDeployment custom resource (IMPORTANT: Use Progressive strategy with conservative settings initially):** +```yaml +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: payment-processor + labels: + app: payment-processor +spec: + replicas: 3 + workerOptions: + connection: production-temporal + temporalNamespace: production + # Start with Progressive strategy using conservative ramp percentages + rollout: + strategy: Progressive + steps: + - rampPercentage: 1 + pauseDuration: 10m + - rampPercentage: 5 + pauseDuration: 15m + - rampPercentage: 25 + pauseDuration: 20m + sunset: + scaledownDelay: 30m + deleteDelay: 2h + template: + spec: + containers: + - name: worker + image: payment-processor:v1.5.2 # Same image as current deployment to ensure no breaking changes + resources: + requests: + memory: "512Mi" + cpu: "250m" + # Note: Controller automatically adds TEMPORAL_* env vars +``` + +### Step 5: Deploy the TemporalWorkerDeployment + +1. **Create the `TemporalWorkerDeployment` custom resource:** + ```bash + kubectl apply -f payment-processor-versioned.yaml + ``` + +2. **Wait for the version to be registered:** + ```bash + # Monitor until status shows the version is registered and current + kubectl get temporalworkerdeployment payment-processor -w + ``` + + You should see the controller create a new Kubernetes `Deployment` resource (e.g., `payment-processor-v1.5.2`) for this version. + +3. **Verify the versioned deployment is working:** + ```bash + # Check that controller-managed deployment exists + kubectl get deployments -l temporal.io/managed-by=temporal-worker-controller + + # Check pods are running + kubectl get pods -l temporal.io/worker-deployment=payment-processor + + # Check worker logs to verify versioning is active + kubectl logs -l temporal.io/worker-deployment=payment-processor + ``` + + You should see logs indicating the worker has registered with a Build ID. + +4. **Verify in Temporal UI:** + - Check the Workers page in Temporal UI + - You should see your worker deployment with version information + - New workflows should be routed to the versioned worker + +### Step 6: Transition from Unversioned to Versioned + +Now you need to carefully transition from your old unversioned deployment to the new versioned one: + +1. **Ensure both deployments are running:** + ```bash + # Check your original unversioned deployment + kubectl get deployment payment-processor + + # Check the new controller-managed versioned deployment + kubectl get deployments -l temporal.io/managed-by=temporal-worker-controller + ``` + +2. **Monitor workflow routing:** + - In Temporal UI, check that new workflows are being routed to the versioned worker + - Existing workflows should continue on the unversioned worker until they complete + +3. **Wait for existing workflows to complete:** + ```bash + # Monitor running workflows in Temporal UI + # Or use Temporal CLI to check workflow status + temporal workflow list --namespace production + ``` + +4. **Scale down the original unversioned deployment:** + ```bash + # Scale down the original deployment + kubectl scale deployment payment-processor --replicas=0 + ``` + +5. **Clean up the original deployment:** + ```bash + # Only after confirming all workflows are handled by versioned workers + kubectl delete deployment payment-processor + ``` + +### Step 7: Optimize Rollout Settings + +Once the initial migration is complete and validated, you can optimize rollout settings for faster deployments: + +```bash +# Update the TemporalWorkerDeployment custom resource to use faster Progressive rollout +kubectl patch temporalworkerdeployment payment-processor --type='merge' -p='{ + "spec": { + "rollout": { + "strategy": "Progressive", + "steps": [ + {"rampPercentage": 10, "pauseDuration": "5m"}, + {"rampPercentage": 50, "pauseDuration": "10m"} + ] + } + } +}' +``` + +### Step 8: Update Your CI/CD Pipeline + +Modify your deployment pipeline to work with the new versioned approach: + +**Before (Unversioned):** +```bash +# Old pipeline updated Deployment directly +kubectl set image deployment/payment-processor worker=payment-processor:v1.6.0 +``` + +**After (Versioned):** +```bash +# New pipeline updates TemporalWorkerDeployment custom resource +kubectl patch temporalworkerdeployment payment-processor --type='merge' -p='{"spec":{"template":{"spec":{"containers":[{"name":"worker","image":"payment-processor:v1.6.0"}]}}}}' +``` + +**What happens next:** +1. Controller detects the image change +2. Creates a new Kubernetes `Deployment` (e.g., `payment-processor-v1.6.0`) +3. Registers the new version with Temporal +4. Gradually routes traffic according to Progressive strategy +5. Scales down and cleans up old version once new version is fully deployed + +### Step 9: Test Your First Versioned Deployment + +Deploy a new version to validate the entire flow: + +1. **Make a small, safe change** to your worker code +2. **Build and push** a new container image +3. **Update the TemporalWorkerDeployment:** + ```bash + kubectl patch temporalworkerdeployment payment-processor --type='merge' -p='{"spec":{"template":{"spec":{"containers":[{"name":"worker","image":"payment-processor:v1.6.0"}]}}}}' + ``` +4. **Monitor the rollout:** + ```bash + # Watch the deployment progress + kubectl get temporalworkerdeployment payment-processor -w + + # Check that new deployment is created + kubectl get deployments -l temporal.io/managed-by=temporal-worker-controller + ``` +5. **Verify in Temporal UI** that traffic is gradually shifting to the new version + +## Configuration Reference + +For comprehensive configuration options including rollout strategies, sunset configuration, worker options, and advanced settings, see the [Configuration Reference](configuration.md) document. + +Key configuration patterns for migration: + +- **Progressive Strategy (Recommended)**: Start with conservative ramp percentages (1%, 5%, 25%) for initial migrations +- **AllAtOnce Strategy**: For development/staging environments where speed is preferred over gradual rollout +- **Manual Strategy**: Only for advanced use cases requiring full manual control +- **Sunset Configuration**: Configure delays for scaling down and deleting old versions + +See [Configuration Reference](configuration.md) for detailed examples and advanced configuration options. + +## Testing and Validation + +### Pre-Migration Testing + +1. **Test in non-production environment:** + ```bash + # Create test TemporalWorkerDeployment + kubectl apply -f test-worker.yaml + + # Verify worker registration + kubectl logs -l temporal.io/worker-deployment=test-worker + ``` + +2. **Validate environment variables:** + ```bash + kubectl exec -it deployment/test-worker-v1 -- env | grep TEMPORAL + ``` + +### Post-Migration Validation + +1. **Check version status:** + ```bash + kubectl get temporalworkerdeployment -o wide + ``` + +2. **Monitor version transitions:** + ```bash + kubectl get events --field-selector involvedObject.kind=TemporalWorkerDeployment + ``` + +3. **Validate workflow routing:** + - Start test workflows + - Verify they're routed to correct versions + - Check Temporal UI for version distribution + +## Common Migration Patterns + +Here are common scenarios when migrating from unversioned to versioned workflows: + +### Pattern 1: Microservices Architecture + +If you have multiple services with their own workers, and each service is versioned and patched separately, migrate each service independently: + +```yaml +# Payment service worker +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: payment-processor + namespace: payments +spec: + workerOptions: + connection: production-temporal + temporalNamespace: payments + rollout: + strategy: Progressive + steps: + - rampPercentage: 5 + pauseDuration: 10m + - rampPercentage: 25 + pauseDuration: 15m + # ... rest of config +--- +# Notification service worker (separate service, different risk profile) +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: notification-sender + namespace: notifications +spec: + workerOptions: + connection: production-temporal + temporalNamespace: notifications + rollout: + strategy: AllAtOnce # Lower risk, faster rollouts desired + # ... rest of config +``` + +### Pattern 2: Environment-Specific Strategies + +Use different rollout strategies based on environment risk: + +```yaml +# Production - Conservative rollout +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: order-processor + namespace: production +spec: + workerOptions: + connection: production-temporal + temporalNamespace: production + rollout: + strategy: Progressive + steps: + - rampPercentage: 10 + pauseDuration: 15m + - rampPercentage: 50 + pauseDuration: 30m + gate: + workflowType: "HealthCheck" # Validate new version before proceeding +--- +# Staging - Fast rollout for testing +apiVersion: temporal.io/v1alpha1 +kind: TemporalWorkerDeployment +metadata: + name: order-processor + namespace: staging +spec: + workerOptions: + connection: staging-temporal + temporalNamespace: staging + rollout: + strategy: AllAtOnce # Faster rollout +``` + +### Pattern 3: Gradual Team Migration + +Migrate teams/services based on their readiness and risk tolerance: + +**Phase 1: Low-Risk Services** +- Background processing workers +- Internal tooling workflows +- Non-customer-facing operations + +**Phase 2: Medium-Risk Services** +- Internal API workflows +- Data processing pipelines +- Administrative workflows + +**Phase 3: High-Risk Services** +- Customer-facing workflows +- Payment processing +- Critical business operations + +## Troubleshooting + +### Common Issues + +**1. Workers Not Registering with Temporal** + +*Symptoms:* +``` +status: + targetVersion: + status: NotRegistered +``` + +*Solutions:* +- Check worker logs for connection/initialization errors +- Verify TemporalConnection configuration +- Ensure TLS secrets are properly configured +- Verify network connectivity to Temporal server +- Check that worker code properly handles `TEMPORAL_DEPLOYMENT_NAME` and `TEMPORAL_WORKER_BUILD_ID` environment variables + +**2. New Workflows Still Going to Unversioned Workers** + +*Symptoms:* +- Temporal UI shows workflows executing on unversioned workers +- Versioned workers appear idle + +*Solutions:* +- Verify versioned workers are properly registered in Temporal UI +- Check that new workflows are starting on the correct task queue +- Ensure unversioned workers are scaled down gradually, not immediately +- Verify Temporal routing rules are working correctly + +**3. Existing Workflows Failing During Migration** + +*Symptoms:* +- Running workflows encounter errors during migration +- Workflow history shows non-deterministic errors + +*Solutions:* +- Ensure unversioned workers remain running until workflows complete +- Don't force-terminate unversioned workers with running workflows +- Check that worker code changes are backward compatible +- Monitor workflow completion before scaling down old workers + +**4. Version Stuck in Ramping State** + +*Symptoms:* +``` +status: + targetVersion: + status: Ramping + rampPercentage: 10 +``` + +*Solutions:* +- Check if gate workflow is configured and completing successfully +- Verify progressive rollout steps are reasonable +- Check controller logs for errors +- Ensure new version is healthy and processing workflows correctly + +### Debugging Commands + +```bash +# Check controller logs +kubectl logs -n temporal-worker-controller-system deployment/controller-manager + +# Check worker status +kubectl describe temporalworkerdeployment my-worker + +# Check managed deployments +kubectl get deployments -l temporal.io/managed-by=temporal-worker-controller + +# Check worker logs +kubectl logs -l temporal.io/worker-deployment=my-worker + +# Check controller events +kubectl get events --field-selector involvedObject.kind=TemporalWorkerDeployment +``` + +### Getting Help + +1. **Check controller logs** for error messages +2. **Review TemporalWorkerDeployment status** for detailed state information +3. **Verify Temporal server connectivity** from worker pods +4. **File issues** at the project repository with logs and configuration + +## Migration Summary + +🎯 **Key principles for unversioned to versioned migration**: +- **Start with Progressive strategy using conservative ramp percentages** to experience the controller's value while maintaining safety +- **Run both unversioned and versioned workers** during transition period +- **Wait for existing workflows to complete** before scaling down unversioned workers +- **Begin with very conservative ramp percentages (1-5%)** and optimize after validating the migration process +- **Migrate one service at a time** to reduce risk and enable learning + +See the [Concepts](concepts.md) document for detailed explanations of the resource relationships and terminology. + +This approach ensures a safe transition from unversioned to versioned workflows without disrupting running workflows or introducing deployment risks. + +The Temporal Worker Controller should significantly improve your deployment safety and reduce the risk of workflow disruptions while providing automated rollout capabilities that weren't possible with unversioned workflows. \ No newline at end of file