Skip to content

Conversation

JoshVanL
Copy link
Contributor

@JoshVanL JoshVanL commented Apr 6, 2025

No description provided.

While `Timers` and `RaiseEvents` tasks are also assigned an event ID in the durabletask history, the workflow cannot be rerun from these events using this API.
Not only would supporting rerunning the workflow from these two event types require a significant code refactor, in practice, users are only interested in rerunning workflow from a specific _activity_, not a "control event".

It must be the case that the workflow is in a _terminal_ state before the rerun can be executed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a cancelled terminal state or only succeeded and failed? Just thinking that there maybe various use cases where you want to edit an inflight workflow and effectively cancel and restart from activity X but not have all the intermidary workflow executions classified as failed. Thinking about how this is represented in other types of workflow engines like GitHub Actions. Maybe we can achieve this without an additional terminal state??

Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several states that could be considered terminal for our purposes here. I read this proposal specifically as looking to avoid having someone change behavior while an activity is running (e.g. the workflow needs to be in such a state that an activity isn't going to be imminently invoked).

To that end, I'd say any of the following states would be fine then:

  • Failed
  • Completed
  • Canceled
  • Terminated
  • Suspended

Attempting to do so will return an error to the client.

When rerunning a workflow, the workflow will be started from the event ID of an Activity in the history.
The workflow actor will delete all history events after the event ID of the Activity chosen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we'd lose the history of the original execution - is possible to retain/archive that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah think you answered this in a later point

Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I prefer the checkpoint -> clone event state because this approach makes the workflows themselves mutable and subject to change by post-terminal operations. Rather than re-run a workflow and modify the events that potentially have run (even if the larger workflow didn't), it feels icky to modify the workflow data itself instead of clone it out to a distinctly separate workflow subject to its own state.

Further, I guess the ickiness factor compounds for me when I think of the perils of mutability. Because someone can re-run the workflow using the existing activities on it (instead of having a clone of the workflow history with only the inputs and outputs of those activities), it feels like this opens the door to a great deal of unnecessary history-destroying logic that's antithetical to the concept of event sourcing in the first place (where you've got an immutable append-only log).

Further, if you've got mutable workflows, it seems like you rather force usage of Cassie's event stream proposal to understand what has run and when since you can no longer trust any values you might pull for a given workflow from your history endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just change this from optional to required

The client can optionally give a new instance ID to use when rerunning the workflow.

then it won't be possible to mutate the workflow history post-execution

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - this proposal does not call for cloning the workflow state as mine does, but explicitly mutates the workflow history. Because there's no cloning, I don't know how your new workflow would have access to the event history of another workflow.

In my own, because you're using a cloned history from some intermediate step in another workflow, you're never mutating anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll be updating the proposal to clone the state.

It is often the case that workflow activities are run concurrently, i.e. in fan-out patterns.
This means the resulting workflow history order of execution can be non-deterministic.
The durabletask history is currently a linear sequence of events.
This then means that rerunning a workflow from a specific Activity which is a member of a fan-out pattern will result in possible rerunning of peer fan-out activities, depending on the order of termination of Activities in the fan-out group.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking if it's at all possible for the SDKs to actually force users to use an explicit Task.WhenAll etc. API instead of relying on the language primitives. That way the SDK can actually include metadata to tell the workflow engine that things are executed concurrently. This would be super useful for visualization. Or if there's some other magic way to do this then we could avoid this scenario?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we discussed looking at the history and if you have multiple TaskScheduled with Completion then you can infer they are concurrent?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be opposed to forcing users to use any particular library-specific logic as the shape, style and effect would differ by language. It's clear enough to say that the activity boundary is at the await in .NET whether it's a one-off activity or part of a Task.WhenAll.

Seems like you could fairly easily achieve your visualization idea using Cassie's event stream proposal - if you've got a bunch of activities that just kicked off and are running on a given workflow and you haven't yet received a completion for any, they were concurrently started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't need to enforce explicit methods - you can probably build a "good enough" model based on the linear history of start/complete events - it might just be susceptible to all sorts of edge cases that expose the differences between the workflow structure and the execution flow...
for instance, is it theoretically possible that a subset of concurrently scheduled activities complete before some of the activities in the set have even been start? Or is this not possible?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even theoretically - just because all the activities are started at the same time, there's absolutely no guarantee that they'd each run for the same length of time.


// RerunWorkflowFromActivityRequest is used to rerun a workflow instance from a
// specific event ID.
message RerunWorkflowFromActivityRequest {
Copy link

@WhitWaldo WhitWaldo Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't an existing mechanism to drill into and correlate event IDs with activity names in workflows, so this feels like we're potentially exposing an implementation detail that the developer shouldn't be privy to. Especially as an activity can be used multiple times in a workflow, this feels like a real accident waiting to happen of someone picking the wrong activity to resume from.

Take instead my own checkpoint-based proposal in which it's clear where the checkpoints take place between sets of one or more activities because they're inlined with the rest of the workflow logic (though I like your assertion of them below as a no-op). They wouldn't be so much event insertions like your own proposal so much as places where the event log is cloned for potential future invocation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposed re-run from activity API is desired over the checkpointing API proposal, to me not requiring the users to update their workflow code to have access to this feature is a big plus. Also, notice the new GetInstanceHistory API added to solve the discoverability problem

Copy link

@WhitWaldo WhitWaldo Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose that most people would never actually use this functionality because it can likely be addressed by simply refactoring whatever workflow appears to require it.

For example, take a typical purchase workflow. An activity processes payment information and checks inventory to find that they don't actually have any of that product and it's on backorder. They don't want to compensate by refunding the transaction and instead would like to substitute in an equivalent product at the same price to complete the order. Either the activity or checkpoint API can be used here to start the workflow again following the payment processing and insert the new product identifier to finish the procurement process.

This workflow could just as easily be refactored so the inventory check is offloaded to a child workflow. When it returns false, put compensation logic in the parent workflow to look for a substitution (or await an external event from a manager manually picking the substitute) and pass the new product ID into that child workflow.

While replaying the workflow could certainly address the issue raised, this trivial refactoring addresses the same scenario without actually requiring new development time and features. If you've got some complex scenario that cannot be addressed using workflows and child workflows and specifically requires after-the-fact data manipulation, I'd suggest that's far more an edge case than something that'll be commonly used. When asked in Discord, I'd suggest refactoring more more readily than that's proposed by either of these replay APIs.

I disagree that the activity API is a clear winner:

My checkpoint API provides a clear point in time to re-run an event history clone from ensuring that history is never lost (unless explicitly purged or TTL elapses, of course).

The activity API instead requires the user make sense of a potentially non-linear event history and tease out which activity they want to supersede even though it's very possible that some events have already run which now also requires that developers require activities to be idempotent, which is potentially a high bar not presently required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring the need for updating workflows with pre-known checkpoints means that we loose the functionality for users to be able to re-run a workflow from an activity during unforeseen circumstances. I have commented more on this below.

@WhitWaldo
Copy link

WhitWaldo commented Apr 6, 2025

I think my general disagreement with this proposal is that you're taking what's generally an immutable append-only event source log and suddenly opening it up to all manner of changes. This wrecks the history of past runs (e.g. we only suspect that Workflow 123 didn't run successfully because it was re-run starting from Activity C, but for all we know, it worked fine) and exposes implementation details I don't think the developer needs to have access to (itself potentially limiting future workflow improvements via inability to change internal details).

I agree with the limitation of the checkpointing method in that there's no great way of introducing a checkpoint in the middle of a fan-out operation, but I think that's a limitation of having a code-based workflow instead of a configuration-based one. I can think of a few ways to overcome this (e.g. introduce a mechanism to have the SDK tag activities run within an execution span or something, then insert the checkpoint after successful activities but before failed activities), but that also feels like a lot of effort for something that wouldn't be widely used.

Your approach almost requires the use of Cassie's event sourcing proposal to keep an external history of what has happened in workflows as their individual histories here can no longer be trusted as they're potentially all overwriteable. I would instead refer back to my own proposal of introducing append-only and otherwise immutable clones of the event source that are run as distinctly separate workflow IDs and can similarly introduce different inputs. But a key difference being that I can access workflow 123 and I'll always get the run for workflow 123 in a way that doesn't require that I persist and parse a mutating history.


// input can optionally given to give the new instance a different input to
// the next Activity event.
google.protobuf.StringValue input = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing optional tag?


It is often the case that it's desirable to re-run business logic implemented inside a workflow.
This could be because an activity in the workflow failed due to a transient error, an external dependency changed or a resource is now available to an activity, or it's just desirable for some subset of the last set of Activities to be rerun.
Dapr should provide the functionality to rerun a workflow from any point in its history.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshVanL @WhitWaldo

Maybe I'm missing something important here, but is the implementation looking to delete the event source history in order to rewind back to a previous state? As though a bunch of operations never actually happened?

Modifying the event source history destroys the provenance of the Workflow, which would be a significant detractor to Workflows. Some regulated industries favour (and are attracted to) technologies which have guarantees around provenance / immutability.

Copy link

@olitomlinson olitomlinson Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be because an activity in the workflow failed due to a transient error, an external dependency changed or a resource is now available to an activity, or it's just desirable for some subset of the last set of Activities to be rerun.

This to me sounds like the Business Process was never fully encoded into the Workflow. For example, maybe there was a Compensation Activity that should have ran at the point of the transient error, or the Workflow should have gone into a deliberate Waiting For External Event phase while waiting for a "resource to become available"

As I put on the other proposal, IMO we should be investing in ways to allow users to safely patch/version a Workflow as their Business Process evolves. This has been asked for multiple times on Discord over the years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the feedback, I think we all aligned with the fact that the newInstanceID should be a required field, where the workflow history is never deleted, and is instead cloned.

This to me sounds like the Business Process was never fully encoded into the Workflow.

There will always be a use case for wanting to rerun a workflow from a failed/terminal state. A prime example of this is an activity calling an external service but the IP is currently being rate limited. Yes, it can always be the case that users can implement their own retry logic or create sub workflow etc. but users do expect this feature to be available as a primitive.

Imagine being in a situation whereby a workflow which takes 4 hours to transcode some file fails at the very last step because some s3 upload transiently timed out or the permissions were slightly off. Yes, the user could have "known better" or manually do the rest of the tasks, but it would also be useful for the user to be able to rerun the workflow from this failed state.

I don't disagree that versioning of Workflows is also a useful feature, but this is not what this proposal is going after and is not currently the priority of the team at this time- even if I too would like to do this be worked on.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, maybe this is just a docs problem. We already have the means to re-run workflows from a failed state: put it in a child workflow and when it fails, re-run it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will always be a use case for wanting to rerun a workflow from a failed/terminal state.

A user can do this today, by taking the input of Workflow (which are serialized in the state of the workflow under "properties"."dapr.workflow.input", and starting the workflow again via the existing API to start a workflows.

A prime example of this is an activity calling an external service but the IP is currently being rate limited. Yes, it can always be the case that users can implement their own retry logic or create sub workflow etc. but users do expect this feature to be available as a primitive.

Implementing their own retry logic and creating sub workflows is exactly the right way to go about this. This is what Developers must be educated to do in order to build resilient systems.

Imagine being in a situation whereby a workflow which takes 4 hours to transcode some file fails at the very last step because some s3 upload transiently timed out or the permissions were slightly off. Yes, the user could have "known better" or manually do the rest of the tasks, but it would also be useful for the user to be able to rerun the workflow from this failed state.

I get the problem, truly I do. I spent the best part of 4 years writing and maintaining systems which used Azure Durable Functions, which for all intents and purposes is the same as Dapr Workflows.

I accept that re-running the Workflow from a given point in time would be very helpful for these kinds of scenarios, but what about operations that run after these Activities which have had their control-flow impacted? such as firing of Activities to other systems / updating other external Records of truth / waiting on External Events that may never come and therefore timeout and produce undesirable (potentially damaging!) side-effects in other systems / ledgers?

While some thoughtful users will no doubt take the time to review and understand the implications of rewinding a Workflow to a given point, it will create chaos for others. Which leaves me thinking how practical is this, and are we knowingly handing over a "footgun" to adopters.

At this point, for me, this lies in the "just because we can, doesn't mean we should" camp.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but users do expect this feature to be available as a primitive.

Given all the similarities often raised between Dapr Workflows and Temporal, I took a look through their docs this morning to see if they have anything similar and the only thing I've found is a "Reset" which functions pretty much exactly as my proposal in that it copies the event history up to the reset point (or checkpoint). While they allow the same workflow ID between runs, this is because they instead differentiate by "Workflow Execution" which differ by a unique "Run ID".

They do differ from my proposal in that they allow only one running workflow execution per workflow ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of how Temporal does this, @JoshVanL , do you have any thoughts about adding another nullable property to the workflow object itself that indicates the workflow ID that the workflow was created from (null if it wasn't created from one at all) and then adding that metadata to the event history API you've proposed? At least that way, there's some way to correlate where any given workflow came from, especially as event streaming isn't in the current works.

@jjcollinge
Copy link

One thing that maybe worth a footnote is how we handle workflow changes during rewinds e.g. if you change the workflow code to add new activities after the point at which you are rewinding to then it should be compatible - if you change previous or concurrent activities to the rewind point then it is not compatible - can we make this an explicit failure?

@WhitWaldo
Copy link

One thing that maybe worth a footnote is how we handle workflow changes during rewinds e.g. if you change the workflow code to add new activities after the point at which you are rewinding to then it should be compatible - if you change previous or concurrent activities to the rewind point then it is not compatible - can we make this an explicit failure?

At a minimum, this should be called something other than a "rewind" as there's no saga-like unwinding of already-completed activities happening.

The proposal suggests that the workflow must be in a terminal state, meaning there cannot be concurrent activities.

@jjcollinge
Copy link

Thanks @WhitWaldo, yes you're right regarding concurrent activities, good call out 👍

@JoshVanL
Copy link
Contributor Author

JoshVanL commented Apr 9, 2025

Changing the workflow code is not supported, and is not in scope for this proposal.

@WhitWaldo
Copy link

WhitWaldo commented Apr 9, 2025

Changing the workflow code is not supported, and is not in scope for this proposal.

Isn't the whole point that you're changing the inputs to any given activity and bypassing the values supplied by the rest of the workflow? Even if not a logical change, that's still a workflow modification.

@JoshVanL
Copy link
Contributor Author

@WhitWaldo indeed this is modifying the Workflow however it is strictly not changing the logic Workflow code itself, which I think is an important distinction.

Note, the proposal has been updated to no longer allow purging of the referenced Workflow ID, and instead always clones the state to a new instance ID.

@WhitWaldo
Copy link

WhitWaldo commented Apr 10, 2025

Well, I'll just leave it here that I just don't see the demand for this particular feature over other workflow opportunities. While I can understand an interest in seeing a workflow in a visualization tool and wanting to restart it, that's a wholly different thing than restarting it from some intermediate point and changing the values sent it, which, being external to the workflow logic, pretty much breaks the deterministic nature of the workflow even as a clone because you're introducing values that it wasn't originally designed to field - if it had been, you wouldn't need to be re-running it.

I'd much rather see the ability to clone and version workflows and pass new values into the tops of those workflows instead of a partial replay like this. I've seen several requests for versioning on Discord but haven't seen anything for the likes of this, so basing this on demand I've observed, I'd vote for versioning over the two (especially as some hypothetical version could achieve the same goal of just removing previously executed activities and subbing in the desired state to some intermediate point to start a new derivative workflow; no re-run necessary).

But, I've said my piece on it and I think this horse is sufficiently beaten.

@WhitWaldo
Copy link

WhitWaldo commented Apr 11, 2025

I've had two additional thoughts on this:

Event-picking is only useful for one-off scenarios

Especially with regard to fan-out operations, there's absolutely no guarantee as to the order in which activities will complete. The proposed API for retrieving activities (whether system- or user-based) may expose the elements of the event source log, but this log is subject to being significantly different each time a workflow runs.

If the idea is that you're using some tool external to Workflows to access this information (e.g. via the HTTP API) and have someone manually walk through the history, pick and choose an event, specify a new payload for it and kick off a replacement workflow, that might be viable (the concern below notwithstanding), though it really begs the question of whether that should be a part of Dapr workflows logic itself or just exposed as control data on top of workflows (e.g. provided to monitoring tools for purposes of super-orchestrating these external to your service itself).

But I've been approaching this as someone that's trying to build compensation logic into their service itself so that when a workflow fails, they have a means of addressing the failure in real-time by providing some other value. The checkpoints could similarly be surfaced to a monitoring service and separately invoked with their different inputs but because they can only be injected between sets of activity runs (and shouldn't be included in a Task.WhenAll), it doesn't matter that the event ordering may be non-deterministic and change every time as it's not susceptible to this issue.

Put another way, I think you and I have been approaching how this features is used by the downstream user differently. Yours is more useful as a management and monitoring retry solution, mine was designed from a code-first perspective running as part of the user's service. If the larger goal here is to make this external management more flexible (less pure observing, more actionable), I think that event source API is useful at a minimum even though I think the system would be richer if it introduced a cloned and versioned change with the new input instead of cherry picking it.

Does event picking preclude versioning altogether?

I think our point of contention is that your proposal seeks to usurp the business logic of a workflow and change it by providing an input that (maybe) couldn't possibly have been a part of the original workflow itself. My other caution here then is that I think this might preclude the ability to ever introduce a versioning capability to workflows alongside this.

Here's my reasoning. You start with a run of WorkflowA and it runs to some terminal state. Via this API, you restart it from EventC with a new value (concerns about the inconsistent ordering of activities aside) and this is saved as WorkflowB. Should you come along and want to change the business logic of WorkflowA to introduce some additional activity via some hypothetical future versioning capability, the logic of it now differs from WorkflowB (associated with A because of the relatedTo property). But now it'll be even more confusing seeking to understand how B ever could have run from A because we've replaced WorkflowA with WorkflowA2 (updated logic). Following this hypothetical versioning, it shouldn't ever be possible to re-run WorkflowA as its been fully replaced (even if just by having introduced a migration between the two).

So if you come back later and say hey, I want to re-run WorkflowB because it should have all the changes introduced to WorkflowA as it's the shared workflow logic, which workflow is it going to run? The original WorkflowA (even though it's been replaced) or WorkflowA2 (the result of the new version) even though fundamental changes may have been made to the logic?

Again, I think most of what we're trying to achieve here can be better done through versioning with clear append-only metadata of what led to what paired with eager introduction of child workflows to use what we already have to scope out smaller chunks of potential failure. I would urge caution in introducing this replay capability without putting serious thought into how it might prevent future development for other highly requested functionality on this front going forward.

@WhitWaldo WhitWaldo mentioned this pull request Apr 25, 2025
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 6, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 6, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 6, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 6, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
@cicoyle cicoyle moved this from Backlog to In progress in v1.16 Release Tracking Board May 6, 2025
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 22, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
JoshVanL added a commit to JoshVanL/dapr that referenced this pull request May 24, 2025
Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>
@WhitWaldo
Copy link

Back on May 1, Yaron, Cassie, Josh and I met to discuss both this and to have more of a real-time discussion of how my proposal at #82 would pair with this. It was agreed upon that both projects had merit and were important to different parts of the community. Josh already had a head start implementing this proposal, so it was agreed that we'd start here and dig more into my own proposal later on (perhaps as part of a subsequent release given how busy the current one already is) to see if there was a better runtime-based way to implement it (as mine is, by design, very dependent on implementation in the SDKs themselves).

Suffice it to say that while I would personally have prioritized different work first, the effort here is still worth doing in furtherance of making Dapr a richer framework for building distributed systems and bringing the workflow building block closer to parity with other similar workflow-based systems, so I give my non-binding +1 to the effort.

Copy link
Contributor

@cicoyle cicoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 binding

We did meet and come to an alignment on this 👍🏻 thx Whit!

@WhitWaldo
Copy link

WhitWaldo commented Jun 3, 2025

I would like to propose that we call that something different than "Re-run" as that's very synonymous with "Retry" which means something very different.

Rather, might we consider calling this a or "Fork and Run" repurposing some Git terminology that some people are familiar with to suggest that there's a clear separation between one run's state and another (independent of retries, of course).

After chatting about it on the call, we've concluded that from a naming perspective, it'd be nice to invoke this on the SDKs on the workflow client as ScheduleCloneWorkflow.

Reason being:

  • The cloning and running doesn't happen until this is called on the workflow client, thus "Clone" instead of "Cloned"
  • It matches the nomenclature of the existing ScheduleNewWorkflow (or ScheduleNewWorkflowAsync in the .NET SDK) by replacing "New" with "Clone"
  • The use of "Clone" makes clear that it came from something (e.g. New didn't exist previously, Clone did) and suggests that while there's a starting point, it might also differ in some way (which it does if you're changing the inputs).

In the documentation then, instead of calling this "Re-run", we'd refer to them as Cloned Workflows. It'd be nice from a consistentcy standpoint to re-name this in the protos and runtime as well, but that's not client facing, so it's not a high priority.

@msfussell

@nelson-parente
Copy link
Contributor

Why not call this "Reset," like Temporal does? https://docs.temporal.io/workflow-execution/event#reset

Using the same terminology as Temporal could help with Dapr Workflows adoption by lowering the entry barrier.

@WhitWaldo
Copy link

Why not call this "Reset," like Temporal does? https://docs.temporal.io/workflow-execution/event#reset

Using the same terminology as Temporal could help with Dapr Workflows adoption by lowering the entry barrier.

Reset implies a clean start. Rather, a "re-run" is running from a middle point in the execution with all the event log baggage that's run up to that point and just changing the input to an activity (and cascading that change to every activity invocation after it).

@yaron2
Copy link
Member

yaron2 commented Jun 5, 2025

+1 binding

dapr-bot added a commit to dapr/dapr that referenced this pull request Jun 5, 2025
* Workflow: Rerun workflow instance form Event ID

Change extends the functionally with an rpc to enable rerunning a
workflow from an existing _terminal_ workflow instance. See;
dapr/proposals#80

Rerunning workflows has the given restrictions;
- The source workflow instance _must_ be in a terminal state- i.e. is
  not running or suspended. It has to have been run before.
- The target event ID must be an activity. The event ID must exist.
- The target event ID input data can be overwritten.
- The new instance ID can be supplied, or will be randomly generated as
  a UUID.
- The source and target instance ID cannot be the same. The orginal
  history will not be overwritten/deleted/truncated.

The new workflow cannot have any timers or raise events which are active
at the time of starting. There is no technical resource why we cannot
achieve this, but is a limitation of the existing implementation. A
future change should see a refactor to enable this; likely moving timers
and raise events to a new actor type and instantiation.

To rerun a workflow, the source instance ID workflow actor will fork its
history state to the point of the target event ID, optionally update its
input, then send the new state to the target instance actor ID who will
then write that state and run the target activity, as well as  any
activity which are in-progress at that point in time. Care is taken to
ensure all currently active activities are executed if the target event
ID is in a fanout scenario.

```
rerunWorkfolow -> source Actor ID [Fork State] -{state}-> target Actor ID ->>> call activities
```

The workflow and activities actor code has been refactored to multiple
files for readability and maintainability.

Signed-off-by: joshvanl <[email protected]>

* Fix actors timers callback timeout check

Signed-off-by: joshvanl <[email protected]>

* Updates dapr/durabletask-go to v0.7.0

Signed-off-by: joshvanl <[email protected]>

* Update dapr/durabletask-go to v0.7.1

Signed-off-by: joshvanl <[email protected]>

* Adds review comments

Signed-off-by: joshvanl <[email protected]>

* Fix parallel order execution timing

Signed-off-by: joshvanl <[email protected]>

* Remove parallels

Signed-off-by: joshvanl <[email protected]>

* Fix content length test

Signed-off-by: joshvanl <[email protected]>

---------

Signed-off-by: joshvanl <[email protected]>
Co-authored-by: Dapr Bot <[email protected]>
@JoshVanL
Copy link
Contributor Author

+1 binding

@yaron2 yaron2 merged commit 52a6edf into dapr:main Jun 24, 2025
1 check passed
@github-project-automation github-project-automation bot moved this from In progress to Done in v1.16 Release Tracking Board Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants