Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions 20250404-RS-workflow-rerun-from-activity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Workflow: Rerun from Activity

* Author(s): @joshvanl, @whitwaldo

## Overview

This proposal details the ability to rerun a workflow from a previous point in its history.
A workflow in a terminal state can be rerun from a failed activity, before the failed activity, or at any activity in the history of a successful or failed workflow.

## Background

It is often the case that it's desirable to re-run business logic implemented inside a workflow.
This could be because an activity in the workflow failed due to a transient error, an external dependency changed or a resource is now available to an activity, or it's just desirable for some subset of the last set of Activities to be rerun.
Dapr should provide the functionality to rerun a workflow from any point in its history.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshVanL @WhitWaldo

Maybe I'm missing something important here, but is the implementation looking to delete the event source history in order to rewind back to a previous state? As though a bunch of operations never actually happened?

Modifying the event source history destroys the provenance of the Workflow, which would be a significant detractor to Workflows. Some regulated industries favour (and are attracted to) technologies which have guarantees around provenance / immutability.

Copy link

@olitomlinson olitomlinson Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be because an activity in the workflow failed due to a transient error, an external dependency changed or a resource is now available to an activity, or it's just desirable for some subset of the last set of Activities to be rerun.

This to me sounds like the Business Process was never fully encoded into the Workflow. For example, maybe there was a Compensation Activity that should have ran at the point of the transient error, or the Workflow should have gone into a deliberate Waiting For External Event phase while waiting for a "resource to become available"

As I put on the other proposal, IMO we should be investing in ways to allow users to safely patch/version a Workflow as their Business Process evolves. This has been asked for multiple times on Discord over the years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the feedback, I think we all aligned with the fact that the newInstanceID should be a required field, where the workflow history is never deleted, and is instead cloned.

This to me sounds like the Business Process was never fully encoded into the Workflow.

There will always be a use case for wanting to rerun a workflow from a failed/terminal state. A prime example of this is an activity calling an external service but the IP is currently being rate limited. Yes, it can always be the case that users can implement their own retry logic or create sub workflow etc. but users do expect this feature to be available as a primitive.

Imagine being in a situation whereby a workflow which takes 4 hours to transcode some file fails at the very last step because some s3 upload transiently timed out or the permissions were slightly off. Yes, the user could have "known better" or manually do the rest of the tasks, but it would also be useful for the user to be able to rerun the workflow from this failed state.

I don't disagree that versioning of Workflows is also a useful feature, but this is not what this proposal is going after and is not currently the priority of the team at this time- even if I too would like to do this be worked on.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, maybe this is just a docs problem. We already have the means to re-run workflows from a failed state: put it in a child workflow and when it fails, re-run it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will always be a use case for wanting to rerun a workflow from a failed/terminal state.

A user can do this today, by taking the input of Workflow (which are serialized in the state of the workflow under "properties"."dapr.workflow.input", and starting the workflow again via the existing API to start a workflows.

A prime example of this is an activity calling an external service but the IP is currently being rate limited. Yes, it can always be the case that users can implement their own retry logic or create sub workflow etc. but users do expect this feature to be available as a primitive.

Implementing their own retry logic and creating sub workflows is exactly the right way to go about this. This is what Developers must be educated to do in order to build resilient systems.

Imagine being in a situation whereby a workflow which takes 4 hours to transcode some file fails at the very last step because some s3 upload transiently timed out or the permissions were slightly off. Yes, the user could have "known better" or manually do the rest of the tasks, but it would also be useful for the user to be able to rerun the workflow from this failed state.

I get the problem, truly I do. I spent the best part of 4 years writing and maintaining systems which used Azure Durable Functions, which for all intents and purposes is the same as Dapr Workflows.

I accept that re-running the Workflow from a given point in time would be very helpful for these kinds of scenarios, but what about operations that run after these Activities which have had their control-flow impacted? such as firing of Activities to other systems / updating other external Records of truth / waiting on External Events that may never come and therefore timeout and produce undesirable (potentially damaging!) side-effects in other systems / ledgers?

While some thoughtful users will no doubt take the time to review and understand the implications of rewinding a Workflow to a given point, it will create chaos for others. Which leaves me thinking how practical is this, and are we knowingly handing over a "footgun" to adopters.

At this point, for me, this lies in the "just because we can, doesn't mean we should" camp.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but users do expect this feature to be available as a primitive.

Given all the similarities often raised between Dapr Workflows and Temporal, I took a look through their docs this morning to see if they have anything similar and the only thing I've found is a "Reset" which functions pretty much exactly as my proposal in that it copies the event history up to the reset point (or checkpoint). While they allow the same workflow ID between runs, this is because they instead differentiate by "Workflow Execution" which differ by a unique "Run ID".

They do differ from my proposal in that they allow only one running workflow execution per workflow ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of how Temporal does this, @JoshVanL , do you have any thoughts about adding another nullable property to the workflow object itself that indicates the workflow ID that the workflow was created from (null if it wasn't created from one at all) and then adding that metadata to the event history API you've proposed? At least that way, there's some way to correlate where any given workflow came from, especially as event streaming isn't in the current works.


## Related Items

https://github.com/dapr/proposals/pull/79

## Design

The following proto RPC and messages will be added to durabletask, exposed via each SDK.
This API implements rerunning a workflow from a specific Activity in the history of the workflow.
The Activity to be rerun is chosen via its associated event ID.

All `Activties` are assigned an event ID in the durabletask history.
While `Timers` and `RaiseEvents` tasks are also assigned an event ID in the durabletask history, the workflow cannot be rerun from these events using this API.
Not only would supporting rerunning the workflow from these two event types require a significant code refactor, in practice, users are only interested in rerunning workflow from a specific _activity_, not a "control event".

It must be the case that the workflow is in a _terminal_ state before the rerun can be executed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a cancelled terminal state or only succeeded and failed? Just thinking that there maybe various use cases where you want to edit an inflight workflow and effectively cancel and restart from activity X but not have all the intermidary workflow executions classified as failed. Thinking about how this is represented in other types of workflow engines like GitHub Actions. Maybe we can achieve this without an additional terminal state??

Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several states that could be considered terminal for our purposes here. I read this proposal specifically as looking to avoid having someone change behavior while an activity is running (e.g. the workflow needs to be in such a state that an activity isn't going to be imminently invoked).

To that end, I'd say any of the following states would be fine then:

  • Failed
  • Completed
  • Canceled
  • Terminated
  • Suspended

This would be because the workflow has completed successfully, failed at some activity, or force terminated.
Rerunning a workflow which is currently in progress does not make any practical or academic sense.
Attempting to do so will return an error to the client.

When rerunning a workflow, the workflow will be started from the event ID of an Activity in the history.
The workflow actor will delete all history events after the event ID of the Activity chosen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we'd lose the history of the original execution - is possible to retain/archive that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah think you answered this in a later point

Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I prefer the checkpoint -> clone event state because this approach makes the workflows themselves mutable and subject to change by post-terminal operations. Rather than re-run a workflow and modify the events that potentially have run (even if the larger workflow didn't), it feels icky to modify the workflow data itself instead of clone it out to a distinctly separate workflow subject to its own state.

Further, I guess the ickiness factor compounds for me when I think of the perils of mutability. Because someone can re-run the workflow using the existing activities on it (instead of having a clone of the workflow history with only the inputs and outputs of those activities), it feels like this opens the door to a great deal of unnecessary history-destroying logic that's antithetical to the concept of event sourcing in the first place (where you've got an immutable append-only log).

Further, if you've got mutable workflows, it seems like you rather force usage of Cassie's event stream proposal to understand what has run and when since you can no longer trust any values you might pull for a given workflow from your history endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just change this from optional to required

The client can optionally give a new instance ID to use when rerunning the workflow.

then it won't be possible to mutate the workflow history post-execution

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - this proposal does not call for cloning the workflow state as mine does, but explicitly mutates the workflow history. Because there's no cloning, I don't know how your new workflow would have access to the event history of another workflow.

In my own, because you're using a cloned history from some intermediate step in another workflow, you're never mutating anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll be updating the proposal to clone the state.

The client can give an optional _new_ input to the Activity to which the workflow will be rerun from.
If defined, the activity will be started with the new input data.
If no input is given, the activity will be started with the same input as the original workflow.

The client can optionally give a new instance ID to use when rerunning the workflow.
This is useful for when the client wishes to preserve the history of the source workflow that is being rerun.
By default, `RerunWorkflowFromActivity` will use the same instance ID as the source workflow, and therefore delete all history up until the event ID of the Activity chosen.

If the targeted `eventID` does not exist, or is not an Activity event, the API will return an error to the client.

```proto
service TaskHubSidecarService {
// Rerun a Workflow from a specific event ID from an activity.
rpc RerunWorkflowFromActivity(RerunWorkflowFromActivityRequest) returns (RerunWorkflowFromActivityResponse);
}

// RerunWorkflowFromActivityRequest is used to rerun a workflow instance from a
// specific event ID.
message RerunWorkflowFromActivityRequest {
Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't an existing mechanism to drill into and correlate event IDs with activity names in workflows, so this feels like we're potentially exposing an implementation detail that the developer shouldn't be privy to. Especially as an activity can be used multiple times in a workflow, this feels like a real accident waiting to happen of someone picking the wrong activity to resume from.

Take instead my own checkpoint-based proposal in which it's clear where the checkpoints take place between sets of one or more activities because they're inlined with the rest of the workflow logic (though I like your assertion of them below as a no-op). They wouldn't be so much event insertions like your own proposal so much as places where the event log is cloned for potential future invocation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposed re-run from activity API is desired over the checkpointing API proposal, to me not requiring the users to update their workflow code to have access to this feature is a big plus. Also, notice the new GetInstanceHistory API added to solve the discoverability problem

Copy link

@WhitWaldo WhitWaldo Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose that most people would never actually use this functionality because it can likely be addressed by simply refactoring whatever workflow appears to require it.

For example, take a typical purchase workflow. An activity processes payment information and checks inventory to find that they don't actually have any of that product and it's on backorder. They don't want to compensate by refunding the transaction and instead would like to substitute in an equivalent product at the same price to complete the order. Either the activity or checkpoint API can be used here to start the workflow again following the payment processing and insert the new product identifier to finish the procurement process.

This workflow could just as easily be refactored so the inventory check is offloaded to a child workflow. When it returns false, put compensation logic in the parent workflow to look for a substitution (or await an external event from a manager manually picking the substitute) and pass the new product ID into that child workflow.

While replaying the workflow could certainly address the issue raised, this trivial refactoring addresses the same scenario without actually requiring new development time and features. If you've got some complex scenario that cannot be addressed using workflows and child workflows and specifically requires after-the-fact data manipulation, I'd suggest that's far more an edge case than something that'll be commonly used. When asked in Discord, I'd suggest refactoring more more readily than that's proposed by either of these replay APIs.

I disagree that the activity API is a clear winner:

My checkpoint API provides a clear point in time to re-run an event history clone from ensuring that history is never lost (unless explicitly purged or TTL elapses, of course).

The activity API instead requires the user make sense of a potentially non-linear event history and tease out which activity they want to supersede even though it's very possible that some events have already run which now also requires that developers require activities to be idempotent, which is potentially a high bar not presently required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring the need for updating workflows with pre-known checkpoints means that we loose the functionality for users to be able to re-run a workflow from an activity during unforeseen circumstances. I have commented more on this below.

// instanceID is the orchestration instance ID to rerun.
string instanceID = 1;

// the event id to start the new workflow instance from.
int32 eventID = 2;

// newInstanceID is the optional new instance ID to use for the new workflow
// instance.
optional string newInstanceID = 3;

// input can optionally given to give the new instance a different input to
// the next Activity event.
google.protobuf.StringValue input = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing optional tag?

}

// RerunWorkflowFromActivityResponse is the response to executing
// RerunWorkflowFromActivity.
message RerunWorkflowFromActivityResponse {
string instanceId = 1;
}
```

### Getting Instance History

As a compliment to the `RerunWorkflowFromActivity` API, a new API is added to get the history of run activities for a workflow instance.
Note that the API returns _all_ history events for the workflow instance, including control events which do _not_ contain an event ID.
This API is intended to be used for discovering the event ID of the activity to rerun from.
The actor backend will get the instance history from the state store and return it to the client, using a new workflow Actor invoke method.

```proto
service TaskHubSidecarService {
// GetInstanceHistory retrieves the history of a workflow instance.
rpc GetInstanceHistory(GetInstanceHistoryRequest) returns (GetInstanceHistoryResponse);
}

// RerunWorkflowFromActivityResponse is the response to executing
// RerunWorkflowFromActivity.
message RerunWorkflowFromActivityResponse {
string instanceId = 1;
}

// GetInstanceHistoryRequest is used to get the history of a workflow instance.
message GetInstanceHistoryRequest {
// instanceID is the orchestration instance ID to get the history for.
string instanceID = 1;
}

// GetInstanceHistoryResponse is the response to executing
// GetInstanceHistoryRequest.
message GetInstanceHistoryResponse {
repeated HistoryEvent events = 1;
}
```

### Concurrent Activities

It is often the case that workflow activities are run concurrently, i.e. in fan-out patterns.
This means the resulting workflow history order of execution can be non-deterministic.
The durabletask history is currently a linear sequence of events.
This then means that rerunning a workflow from a specific Activity which is a member of a fan-out pattern will result in possible rerunning of peer fan-out activities, depending on the order of termination of Activities in the fan-out group.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking if it's at all possible for the SDKs to actually force users to use an explicit Task.WhenAll etc. API instead of relying on the language primitives. That way the SDK can actually include metadata to tell the workflow engine that things are executed concurrently. This would be super useful for visualization. Or if there's some other magic way to do this then we could avoid this scenario?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we discussed looking at the history and if you have multiple TaskScheduled with Completion then you can infer they are concurrent?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be opposed to forcing users to use any particular library-specific logic as the shape, style and effect would differ by language. It's clear enough to say that the activity boundary is at the await in .NET whether it's a one-off activity or part of a Task.WhenAll.

Seems like you could fairly easily achieve your visualization idea using Cassie's event stream proposal - if you've got a bunch of activities that just kicked off and are running on a given workflow and you haven't yet received a completion for any, they were concurrently started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't need to enforce explicit methods - you can probably build a "good enough" model based on the linear history of start/complete events - it might just be susceptible to all sorts of edge cases that expose the differences between the workflow structure and the execution flow...
for instance, is it theoretically possible that a subset of concurrently scheduled activities complete before some of the activities in the set have even been start? Or is this not possible?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even theoretically - just because all the activities are started at the same time, there's absolutely no guarantee that they'd each run for the same length of time.

This may or may not be desirable to the user, but is otherwise a limitation of the API.

Users who wish to use this API in a regular fashion and in expected places should be advised to make use of "checkpoint" activities.
These "checkpoint" activities should be a no-op- returning an output that is the same as the input.
These checkpoint activities are useful as well-known activity event ID markers where the user knows it will be desirable to rerun the workflow regularly.

```go
func checkpoint(ctx task.ActivityContext) (any, error) {
var input int
return input,ctx.GetInput(&input)
}
```

### SDK Changes

This API will be exposed on all durabletask SDKs.
The semantics are generally dependant on the flavour of each SDK language, however-
- The `instanceID` is a required string to target for the rerun.
- The `eventID` is a required int32 to target the Activity for the rerun.
- An optional `newInstanceID` string which preserves the existing workflow history, and reruns the workflow from the event ID of the Activity to the new instance ID.
- An optional `input` which, if defined, will be used as the input to the targeted Activity when rerunning the workflow.

## Completion Checklist

* Implement proto & API changes to durabletask.
* Update dapr workflow runtime to support the new APIs.
* Update SDKs to support the new APIs.