Understanding CallActivityWithRetryAsync behavior #1773

olitomlinson · 2019-12-17T15:37:51Z

olitomlinson
Dec 17, 2019

I've come to realise that in my use-case RetryTimeout is undesirable, so I wish to remove it but I'm concerned that this will corrupt the event source of any in-flight orchestrations.

For example,

GIVEN I have some orchestration code that is utilising CallActivityWithRetryAsync with RetryTimeout option set to 10 minutes.

WHEN at the 5th minute after the activity fails,
AND I have deployed a new version of the Function App whereby I have removed the RetryTimeout configuration.

...will this throw an Exception, terminating the orchestration?

The problem with RetryTimeout is that in a scenario when my Activity Function is running slow (for whatever reason) and the first activity attempt fails, the clock starts ticking down. As the app is running slow, I may never make it to the second attempt before the RetryTimeout kicks in and terminally fails the orchestration, despite having done none of my retrys.

This problem has occurred a few times when the app is under load, and then I've hit some scaling bug (I've raised a few over the last year).

I guess I could ask ask the same question when changing any of the retry options? Is changing any options likely to generate a corrupt state for an orchestration that is currently retrying an instance of a failed activity?

Thanks.

Answered by ConnorMcMahon

Feb 19, 2020

@olitomlinson,

I think I understand your concern about how RetryTimeout when your application is slowing down for requests. In general, I would say the field is working as intended, as the purpose is to set a hard time limit on the operation, even if we don't hit the number of specified retries.

I am confused by the behavior of you seeing literally no retries. Given the code found in RetryInterceptor, I would expect that at least one retry would be successfully executed, as after the first execution the value of CurrentUtcDateTime should always be less than the retry expiration.

View full answer

cgillum · 2019-12-20T20:03:24Z

cgillum
Dec 20, 2019
Maintainer

MaxRetryInterval simply controls the maximum amount of time you're willing to wait in between retries. You can see where it is used in this code. Perhaps meant the RetryTimeout configuration?

To answer your question about whether or not it is safe to change this value for existing orchestration instances, I believe it is not safe. Indeed, retries are implemented using durable timers so subsequent replays won't match the execution history. I haven't tested this so I'm not aware of whether this will actually cause problems at runtime, but it's safest to assume you will need to reserve changes like this for new versions of your orchestrations.

0 replies

olitomlinson · 2019-12-23T10:26:46Z

olitomlinson
Dec 23, 2019
Author

@cgillum Sorry Chris, I did mean RetryCount. I'll modify my original post.

Thank you for your response.

Do you understand my concern with how the RetryTimeout can be dangerous in certain scenarios?

0 replies

ConnorMcMahon · 2020-02-19T19:21:40Z

ConnorMcMahon
Feb 19, 2020

@olitomlinson,

I think I understand your concern about how RetryTimeout when your application is slowing down for requests. In general, I would say the field is working as intended, as the purpose is to set a hard time limit on the operation, even if we don't hit the number of specified retries.

I am confused by the behavior of you seeing literally no retries. Given the code found in RetryInterceptor, I would expect that at least one retry would be successfully executed, as after the first execution the value of CurrentUtcDateTime should always be less than the retry expiration.

0 replies

olitomlinson · 2020-02-20T15:39:24Z

olitomlinson
Feb 20, 2020
Author

@ConnorMcMahon Yes the behavior of the RetryTimeout is doing exactly what it should - Prevent retrying after a given period of time.

I can't remember the exact conditions around the incident as it was a few months ago, I may have even got the conditions wrong in my initial report.

But I guess I would like to stress that I ended up with a bunch of Orchestrations in a failed state that never got a fair shot to reach their 'retryCount' limit because the underlying hosts were not stable and not processing the control queues in a timely manner.

I believe my retryTimeout was set to 10 minutes, which in a normal scenario is a fair time to abandon the orchestration providing the activities are not producing the expected outcome after many actual retry attempts. But in my case, they didn't get their fair shot at retrying.

What would I like to see happen instead?

It's an odd one, but I don't think I have an answer to this.

But this prompted me to rip out the retryTimeout property from future activity calls as I realised it was more important to ensure the retrys actually happen, rather than enforcing the retryTimeout.

I was definitely educated on my part, but it came as a painful lesson as it caused a live incident in my software that I didn't anticipate.

This is why I asked Chris about the feasibility of modifying my Orchestration code to remove the retryTimeout for in-flight orchestrations. If this was possible, I could have paused my Function App during the live incident, removed the retryTimeout and then be content upon resuming the Function App that the activities would have reached their full retryCount (eventually, given the host issue).

But this was not the case, so I couldn't really self-serve my way out of the problem. I just had to let it run its course and fail, which was frustrating. (In fact, If my memory serves me correctly, I had to delete the entire TaskHub and admit data loss and start again)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding CallActivityWithRetryAsync behavior #1773

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Understanding CallActivityWithRetryAsync behavior #1773

Uh oh!

Uh oh!

olitomlinson Dec 17, 2019

Replies: 4 comments

Uh oh!

cgillum Dec 20, 2019 Maintainer

Uh oh!

olitomlinson Dec 23, 2019 Author

Uh oh!

ConnorMcMahon Feb 19, 2020

Uh oh!

Uh oh!

olitomlinson Feb 20, 2020 Author

olitomlinson
Dec 17, 2019

cgillum
Dec 20, 2019
Maintainer

olitomlinson
Dec 23, 2019
Author

ConnorMcMahon
Feb 19, 2020

olitomlinson
Feb 20, 2020
Author