|
| 1 | +--- |
| 2 | +title: Safe deployments of Versioned workflows |
| 3 | +authors: arzonus |
| 4 | +date: 2025-07-10T16:00 |
| 5 | +tags: |
| 6 | + - deep-dive |
| 7 | + - announcement |
| 8 | +--- |
| 9 | + |
| 10 | +At Uber, we manage billions of workflows with lifetimes ranging from seconds to years. Over the course of their lifetime, workflow code logic often requires changes. To prevent non-deterministic errors that changes may cause, Cadence offers [a Versioning feature](https://cadenceworkflow.io/docs/go-client/workflow-versioning). However, the feature's usage is limited because changes are only backward-compatible, but not forward-compatible. This makes potential rollbacks or workflow execution rescheduling unsafe. |
| 11 | + |
| 12 | +To address these issues, we have made recent enhancements to [the Versioning API](https://cadenceworkflow.io/docs/go-client/workflow-versioning), enabling the safe deployment of versioned workflows by separating code changes from the activation of new logic. |
| 13 | + |
| 14 | +## What is a Versioned Workflow? |
| 15 | + |
| 16 | +Cadence reconstructs a workflow's execution history by replaying past events against your workflow code, expecting the exact same outcome every time. If your workflow code changes in an incompatible way, this replaying process can lead to non-deterministic errors. |
| 17 | + |
| 18 | +A versioned workflow uses a [Versioning feature](https://cadenceworkflow.io/docs/go-client/workflow-versioning) to help you avoid errors. This allows developers to safely update their workflow code without breaking existing executions. The key is the `workflow.GetVersion` function (available in [Go](https://cadenceworkflow.io/docs/go-client/workflow-versioning) and [Java](https://cadenceworkflow.io/docs/java-client/versioning)). By using `workflow.GetVersion`, you can mark points in your code where changes occur, ensuring that future calls will return a specific version number. |
| 19 | + |
| 20 | +Before the rollout, only instances of workflow code **v0.1** existed: |
| 21 | + |
| 22 | +```go |
| 23 | +v := workflow.GetVersion(ctx, "change-id", workflow.DefaultVersion, 1) |
| 24 | +if v == workflow.DefaultVersion { |
| 25 | + err = workflow.ExecuteActivity(ctx, ActivityA, data).Get(ctx, &result1) |
| 26 | +} else { |
| 27 | + err = workflow.ExecuteActivity(ctx, ActivityC, data).Get(ctx, &result1) |
| 28 | +} |
| 29 | +``` |
| 30 | + |
| 31 | +### Deployment flow |
| 32 | + |
| 33 | +Let’s consider an example deployment of a change from workflow code **v0.1**, where only `FooActivity` is supported. |
| 34 | + |
| 35 | +```go |
| 36 | +// Git tag: v0.1 |
| 37 | +func MyWorkflow(ctx workflow.Context) error { |
| 38 | + return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +to workflow code **v0.2**, which introduces a new `BarActivity` and utilizes the Versioning feature: |
| 43 | + |
| 44 | +```go |
| 45 | +// Git tag: v0.2 |
| 46 | +func MyWorkflow(ctx workflow.Context) error { |
| 47 | + version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1) |
| 48 | + if version == workflow.DefaultVersion { |
| 49 | + return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 50 | + } |
| 51 | +return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil) |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +Before the rollout, only instances of workflow code **v0.1** existed: |
| 56 | + |
| 57 | + |
| 58 | + |
| 59 | +Rollouts are typically performed gradually, with new workers replacing previous worker instances one at a time. This means that multiple workers with workflow code **v0.1** and **v0.2** can exist simultaneously. When a worker is replaced, a running workflow execution is rescheduled to another worker. Thanks to the Versioning feature, a worker with workflow code **v0.2** can support a workflow execution started by a worker with workflow code **v0.1**. |
| 60 | + |
| 61 | + |
| 62 | +During rollouts, the service should continue to serve production traffic, allowing new workflows to be initiated. If a new worker processes a "Start Workflow Execution" request, it will execute a workflow based on the new version. However, if an old worker handles the request, it will start a workflow based on the old version. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +If a rollout is completed successfully, both the new and old workflows will continue to execute simultaneously. |
| 67 | + |
| 68 | + |
| 69 | +## Versioned Workflow Rescheduling Problem |
| 70 | + |
| 71 | +Workflows typically execute on the same worker on which they started. However, various factors can necessitate rescheduling with a different worker.: |
| 72 | + |
| 73 | +* **Worker Shutdown**: Occurs when a worker is shut down due to reasons such as rollouts, rollbacks, restarts, or instance crashes. |
| 74 | +* **Worker Unavailability**: Occurs when a worker is running but loses connection to the server, becoming unavailable. |
| 75 | +* **High Traffic Load**: Occurs when a worker's sticky cache is fully utilized, preventing further workflow execution and causing the server to reschedule the workflow to another worker. |
| 76 | + |
| 77 | +During a rollout or rollback, workflow rescheduling for workflow executions with new versions becomes unsafe, especially during rollbacks: |
| 78 | + |
| 79 | + |
| 80 | +* If an old workflow is rescheduled to either an old or a new worker, it generally processes correctly. |
| 81 | +* If a new workflow is rescheduled to an old worker, it will be blocked or even fail (depending on `NonDeterministicWorkflowPolicy`). |
| 82 | + |
| 83 | +### Why did it happen? |
| 84 | + |
| 85 | +The old worker doesn't support the new version and cannot replay its history correctly, which leads to a non-deterministic error. The Versioning API allowed customers to make only backward-compatible changes to workflow code definitions; however, these changes were not forward-compatible. |
| 86 | + |
| 87 | +At the same time, there were no workarounds allowing customers to make these changes forward-compatible, so they couldn't separate code changes from the activation of the new version. |
| 88 | + |
| 89 | +### What impact did we have at Uber? |
| 90 | + |
| 91 | +Depending on the workflow code, code changes, and impact, to eliminate the negative impact of a rollback, a Cadence customer needed to identify all problematic workflows, terminate them if they did not fail automatically, and restart them. These steps resulted in a significant on-call burden, leading to possible SLO violations and incidents. |
| 92 | + |
| 93 | +Based on customer impact, we introduced changes in the Versioning API, enabling customers to separate code changes from the activation of the new version. |
| 94 | + |
| 95 | +## ExecuteWithVersion and ExecuteWithMinVersion |
| 96 | + |
| 97 | +The recent release of the Go SDK (Java soon) has extended the GetVersion function and introduced two new options: |
| 98 | + |
| 99 | +```go |
| 100 | +// When it's executed for the first time, it returns 2, instead of 10 |
| 101 | +version := workflow.GetVersion(ctx, "changeId", 1, 10, workflow.ExecuteWithVersion(2)) |
| 102 | + |
| 103 | +// When it's executed for the first time, it returns 1, instead of 10 |
| 104 | +version := workflow.GetVersion(ctx, "changeId", 1, 10, workflow.ExecuteWithMinVersion()) |
| 105 | +``` |
| 106 | + |
| 107 | +These two new options enable customers to choose which version should be returned when `GetVersion` is executed for the first time, instead of the maximum supported version. |
| 108 | + |
| 109 | +* `ExecuteWithVersion` returns a specified value. |
| 110 | +* `ExecuteWithMinVersion` returns a minimal supported version. |
| 111 | + |
| 112 | +Let’s extend the example above and consider the deployment of versioned workflows with new functions: |
| 113 | + |
| 114 | +### Deployment of Versioned workflows |
| 115 | + |
| 116 | +#### Step 0 |
| 117 | + |
| 118 | +The initial version remains **v0.1** |
| 119 | + |
| 120 | +```go |
| 121 | +// Git tag: v0.1 |
| 122 | +// MyWorkflow supports: workflow.DefaultVersion |
| 123 | +func MyWorkflow(ctx workflow.Context) error { |
| 124 | +return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 125 | +} |
| 126 | +``` |
| 127 | + |
| 128 | +When a `StartWorkflowExecution` request is processed, a new workflow execution will have a `DefaultVersion` of the upcoming change ID. |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | +#### Step 1 |
| 133 | + |
| 134 | +`GetVersion` is still used; however, `workflow.ExecuteWithVersion` has also been added. |
| 135 | + |
| 136 | +```go |
| 137 | +// Git tag: v0.2 |
| 138 | +// MyWorkflow supports: workflow.DefaultVersion and 1 |
| 139 | +func MyWorkflow(ctx workflow.Context) error { |
| 140 | + // When GetVersion is executed for the first time, workflow.DefaultVersion will be returned |
| 141 | + version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1, workflow.ExecuteWithVersion(workflow.DefaultVersion)) |
| 142 | + |
| 143 | + if version == workflow.DefaultVersion { |
| 144 | + return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 145 | + } |
| 146 | + return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil) |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +Worker **v0.2** contains the new workflow code definition that supports the new logic. However, when a StartWorkflowExecution request is processed, a new workflow execution will still have the default version of the “MyChange” change ID. |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | +This change enables customers to easily roll back to worker **v0.1** without encountering any non-deterministic errors. |
| 155 | + |
| 156 | +#### Step 2 |
| 157 | + |
| 158 | +Once all **v0.2** workers are replaced with **v0.1** workers, we can deploy a new worker that begins workflow executions with the new version. |
| 159 | + |
| 160 | +```go |
| 161 | +// Git tag: v0.3 |
| 162 | +// MyWorkflow supports: workflow.DefaultVersion and 1 |
| 163 | +func MyWorkflow(ctx workflow.Context) error { |
| 164 | + // When GetVersion is executed for the first time, Version #1 will be returned |
| 165 | + version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1) |
| 166 | + |
| 167 | + if version == workflow.DefaultVersion { |
| 168 | + return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 169 | + } |
| 170 | + return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil) |
| 171 | +} |
| 172 | +``` |
| 173 | + |
| 174 | +Worker **v0.3** contains the new workflow code definition that supports the new logic while still supporting the previous logic. Therefore, when a StartWorkflowExecution request is processed, a new workflow execution will have Version \#1 of the “MyChange” change ID. |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | +This change enables customers to easily roll back to worker **v0.2** without any non-deterministic errors, as both worker versions support "DefaultVersion" and "Version \#1" of the “MyChange” change ID. |
| 179 | + |
| 180 | +#### Step 3 |
| 181 | + |
| 182 | +Once all workers **v0.3** replace the old worker **v0.2** and all workflows with the DefaultVersion of “MyChange” are **finished**, we can deploy a new worker that starts workflow executions with the new version and doesn’t support the previous logic. |
| 183 | + |
| 184 | +```go |
| 185 | +// Git tag: v0.4 |
| 186 | +// MyWorkflow supports: 1 |
| 187 | +func MyWorkflow(ctx workflow.Context) error { |
| 188 | + // When GetVersion is executed for the first time, Version #1 will be returned |
| 189 | + _ := workflow.GetVersion(ctx, "MyChange", 1, 1) |
| 190 | + return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil) |
| 191 | + } |
| 192 | +``` |
| 193 | + |
| 194 | +Worker **v0.4** contains the new workflow code definition that supports the new logic but does not support the previous logic. Therefore, when a StartWorkflowExecution request is processed, a new workflow execution will have Version \#1 of the “MyChange” change ID. |
| 195 | + |
| 196 | + |
| 197 | + |
| 198 | +This change finalizes the safe rollout of the new versioned workflow. At each step, both versions of workers are fully compatible with one another, making rollouts and rollbacks safe. |
| 199 | + |
| 200 | +#### Differences with the previous deployment flow |
| 201 | + |
| 202 | +The previous deployment flow for versioned workflows included only Steps 0, 2, and 3\. Therefore, a direct upgrade from Step 0 to Step 2 (skipping Step 1\) was not safe due to the inability to perform a safe rollback. The new functions enabled customers to have Step 1, thereby making the deployment process safe. |
| 203 | + |
| 204 | +### Deployment with Dynamic Configuration |
| 205 | + |
| 206 | +Using the new options adds an extra step, which can lead to increased deployment time. If your service has a dynamic configuration, you can integrate it with the new functions to eliminate this problem and maintain the same number of deployments. |
| 207 | + |
| 208 | +The solution combines the code changes from [Step 1](https://docs.google.com/document/d/1mNR3BPqX94dY4-Hd_vBGe2es4XpIbylrE0TCH8xl7Fc/edit?tab=t.0#bookmark=id.rnkz0efm7hc1) and [Step 2](https://docs.google.com/document/d/1mNR3BPqX94dY4-Hd_vBGe2es4XpIbylrE0TCH8xl7Fc/edit?tab=t.0#bookmark=id.txibbvf2qjv) into a single code change. [Step 2](https://docs.google.com/document/d/1mNR3BPqX94dY4-Hd_vBGe2es4XpIbylrE0TCH8xl7Fc/edit?tab=t.0#bookmark=id.txibbvf2qjv) becomes a change in Dynamic Configuration rather than a new version deployment. Therefore, in this case, the deployment will include the following steps: |
| 209 | + |
| 210 | +#### Step 1 |
| 211 | + |
| 212 | +This Step 1 is similar to the original [Step 1](https://docs.google.com/document/d/1mNR3BPqX94dY4-Hd_vBGe2es4XpIbylrE0TCH8xl7Fc/edit?tab=t.0#bookmark=id.rnkz0efm7hc1) but introduces the retrieval of a value from the Dynamic Configuration. To achieve forward and backward compatibility of the change, the Dynamic Configuration must have the value of the minimum support version \- `workflow.DefaultVersion` at this step. |
| 213 | + |
| 214 | +```go |
| 215 | + |
| 216 | +var ( |
| 217 | + // Get dynamic config client during app initialization |
| 218 | +dcClient, _ = dynamicConfig.NewClient() |
| 219 | +region = getRegion() |
| 220 | +) |
| 221 | + |
| 222 | +// Git tag: v0.2 |
| 223 | +// MyWorkflow supports: workflow.DefaultVersion and 1 |
| 224 | +func MyWorkflow(ctx workflow.Context) error { |
| 225 | +// The call of the dynamic configuration client is safe for the workflow execution |
| 226 | + // because it's not saved in the history events. |
| 227 | + changeIDExecVersion, err := dcClient.GetIntValue("changeId", map[string]interface{ |
| 228 | + "region": region, |
| 229 | +}) |
| 230 | +if err != nil { |
| 231 | + // Error handling must be non-deterministic |
| 232 | +} |
| 233 | + |
| 234 | + // When GetVersion is executed for the first time, changeIDExecVersion will be returned |
| 235 | + version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1, workflow.ExecuteWithVersion(changeIDExecVersion)) |
| 236 | + |
| 237 | + if version == workflow.DefaultVersion { |
| 238 | + return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil) |
| 239 | + } |
| 240 | + |
| 241 | + return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil) |
| 242 | + } |
| 243 | + |
| 244 | + |
| 245 | +``` |
| 246 | + |
| 247 | +#### Step 2 |
| 248 | + |
| 249 | +Once all v0.2 workers are replaced by v0.1 workers, you need to change the Dynamic Configuration value to the next version (in this case, 1). This will activate the new logic for new workflow executions. To roll back from Step 2 to Step 1, simply revert the dynamic configuration value. |
| 250 | + |
| 251 | +#### Safety |
| 252 | + |
| 253 | +The new options do not alter the logic of replaying, so dynamically changing the value will not cause non-deterministic errors during the replay of old or new workflow executions; therefore, it is safe to change the value. Only the minimum and maximum supported versions are used during replay, indicating which versions the code supports. |
| 254 | + |
| 255 | +## Conclusion |
| 256 | + |
| 257 | +The new options introduced into `GetVersion` address gaps in the Versioning logic that previously led to failed workflow executions. This enhancement improves the safety of deploying versioned workflows, allowing for the separation of code changes from the activation of new logic, making the deployment process more predictable. This extension of `GetVersion` is a significant improvement that opens the way for future optimizations. |
0 commit comments