Question: How best to handle "dead-lettering" failed orchestrations? #1766

ma499 · 2021-01-28T11:46:42Z

ma499
Jan 28, 2021

Is your feature request related to a problem? Please describe.
We are starting to use Durable Functions in a message-driven system where our error handling and monitoring stratagies relies on dead-lettering of exceptions which require human intervention. When using durable functions we are unable to deadletter the trigger message as it was completed when the orchestrator was started - we recognise this is by design.

Describe the solution you'd like
An easy way (i.e. a method to invoke) for a developer to somehow get the message into the DLQ when encountering an error that cannot otherwise be handled.

Describe alternatives you've considered
We are investigating some techniques for doing this but would like to either extend this project, or build an add-on library, to abstract this problem for developers. We are happy to contribute a PR if we get some guidance, reassurance that it's likely to be accepted.

Ideas we're looking at:

Save a copy of the original message in orchestrator state. When required, post a copy of the message directly to DLQ on trigger subscription/queue.
Save a copy of the original message in orchestrator state. When required, schedule a copy of the original message with a custom header to indicate it failed. When the starter function resumes detect that it's a failed re-submission and dead-letter.
Save a copy of the original message in orchestrator state. When required, schedule a copy of the original message and update the custom state of the orchestrator to indicate it failed. When the starter function resumes detect that it's a failed re-submission and dead-letter.

We shall spike the above techniques but appreciate any feedback in the meantime.

Additional context
This was previous raised as a bug #560. The discussion includes a workaround which is similar to one of the ideas above.
Scheduling a copy of the message, and then dead-lettering is one of the official workarounds proposed on Microsoft Support Q&A for this topic.

Answered by olitomlinson

Jan 28, 2021

@ma499

I too have a general failure strategy that relies on dead letter alerts.

however, I mimic the same strategy in Durable Functions by subscribing to the Failed Orchestration lifecycle events, and then publishing a custom Event to Application Insights. I then have an Azure Monitor alert checking for the presence of these events. Which then triggers alerts in the same manner as a Service Bus DL.

If a failure does occur and you want to try again, the recently introduced orchestration [restart API] (#1545) can be used to restart the orchestration from the start as though it had never been ran before. Which is exactly what you would be doing if you re-queued the message intent back into S…

View full answer

olitomlinson · 2021-01-28T20:59:27Z

olitomlinson
Jan 28, 2021

@ma499

I too have a general failure strategy that relies on dead letter alerts.

however, I mimic the same strategy in Durable Functions by subscribing to the Failed Orchestration lifecycle events, and then publishing a custom Event to Application Insights. I then have an Azure Monitor alert checking for the presence of these events. Which then triggers alerts in the same manner as a Service Bus DL.

If a failure does occur and you want to try again, the recently introduced orchestration [restart API] (#1545) can be used to restart the orchestration from the start as though it had never been ran before. Which is exactly what you would be doing if you re-queued the message intent back into Service Bus after a DL or Failed Orchestration.

I think the failure alerting in Durable Functions could be better integrated into Azure Monitor as a First Class feature IMO.

But that aside, the above combinations should work without having to keep messages locked and renewed in ServiceBus while waiting for an orchestration to complete, or manually enqueuing failed messages back to Service Bus.

0 replies

ConnorMcMahon · 2021-02-09T20:10:20Z

ConnorMcMahon
Feb 9, 2021

I think that @olitomlinson nails this issue on the head. In general, the restart API should mitigate a lot of the pain of the logistics of restarting an orchestration, as you no longer need to maintain your original message. At that point, you just need some way of knowing when your orchestration fails.

We currently provide two main ways to identify failed orchestrations:

Our orchestration management APIs (HTTP or the DurableClient objects in .NET/Javascript/Python).
Orchestration lifecycle events via EventHub

There are also some very helpful open source tools written by third parties, like Durable Functions Monitor.

@olitomlinson, I remember us having a discussion regarding better Azure Monitor integration a long time ago, but I can't seem to find what issue we had that discussion under. I think having an issue for that enhancement would be super helpful, as generally we decide on what new features to work on based on user engagement with the top level tickets.

0 replies

olitomlinson · 2021-02-10T00:56:12Z

olitomlinson
Feb 10, 2021

@ConnorMcMahon

Not quite the same, but this one comes to mind #1527 (comment)

I think we need an Issue in general for first-class Azure Monitor integration with DF and then we can build a list of various telemetry series to emit.

0 replies

ghost · 2021-02-14T06:00:05Z

ghost
Feb 14, 2021

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

0 replies

ma499 · 2021-02-14T16:34:46Z

ma499
Feb 14, 2021
Author

Thank you for the suggestions @ConnorMcMahon @olitomlinson. We have taken on your points on board and are spiking the suggestions above to explore the possible solutions.

3 replies

boylec May 9, 2022

Has this ever gotten any traction?

davidmrdavid May 11, 2022
Collaborator

@boylec: could you please clarify what you mean?

If you mean whether we've made progress in supporting this as a built-in feature in the framework, then the answer is not much. Internally and externally, this is one of our most brought up features but we have not had the bandwidth to prioritize it relative to other urgent tasks. It's still very much in our radar though.

If you meant something else, please let us know and I can try to answer that as well.

boylec May 12, 2022

Yes, I was wondering if any work had been done to support something akin to this as a built-in feature.

DLQ like qualities:

The ability to track orchestrations that have failed easily and guaranteeing that those attempts don't get "lost". I suppose this can already be done just by querying the underlying storage medium.
The ability to bulk "retry" those failed orchestrations easily which would basically mean
- Reset the activity(ies) which caused the orchestrations to fail.
- Rerun the underlying orchestrations.

Sounds like the answer is that the first bullet can be trivially done already and the second bullet cannot.

Is my description above in the ballpark of what is on the roadmap?

Until MSFT implements such a feature, would this be as simple as writing a script that accomplishes the second bullet against the stored/failed orchestrations?

Question: How best to handle "dead-lettering" failed orchestrations? #1766

Uh oh!

ma499 Jan 28, 2021

Replies: 5 comments · 3 replies

Uh oh!

Uh oh!

olitomlinson Jan 28, 2021

Uh oh!

ConnorMcMahon Feb 9, 2021

Uh oh!

olitomlinson Feb 10, 2021

Uh oh!

ghost Feb 14, 2021

Uh oh!

ma499 Feb 14, 2021 Author

Uh oh!

boylec May 9, 2022

Uh oh!

davidmrdavid May 11, 2022 Collaborator

Uh oh!

Uh oh!

boylec May 12, 2022

ma499
Jan 28, 2021

Replies: 5 comments 3 replies

olitomlinson
Jan 28, 2021

ConnorMcMahon
Feb 9, 2021

olitomlinson
Feb 10, 2021

ghost
Feb 14, 2021

ma499
Feb 14, 2021
Author

davidmrdavid May 11, 2022
Collaborator