Understanding CallActivityWithRetryAsync behavior #1773
-
I've come to realise that in my use-case For example, GIVEN I have some orchestration code that is utilising WHEN at the 5th minute after the activity fails, ...will this throw an Exception, terminating the orchestration? The problem with This problem has occurred a few times when the app is under load, and then I've hit some scaling bug (I've raised a few over the last year). I guess I could ask ask the same question when changing any of the retry options? Is changing any options likely to generate a corrupt state for an orchestration that is currently retrying an instance of a failed activity? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
To answer your question about whether or not it is safe to change this value for existing orchestration instances, I believe it is not safe. Indeed, retries are implemented using durable timers so subsequent replays won't match the execution history. I haven't tested this so I'm not aware of whether this will actually cause problems at runtime, but it's safest to assume you will need to reserve changes like this for new versions of your orchestrations. |
Beta Was this translation helpful? Give feedback.
-
@cgillum Sorry Chris, I did mean Thank you for your response. Do you understand my concern with how the |
Beta Was this translation helpful? Give feedback.
-
I think I understand your concern about how I am confused by the behavior of you seeing literally no retries. Given the code found in RetryInterceptor, I would expect that at least one retry would be successfully executed, as after the first execution the value of |
Beta Was this translation helpful? Give feedback.
-
@ConnorMcMahon Yes the behavior of the I can't remember the exact conditions around the incident as it was a few months ago, I may have even got the conditions wrong in my initial report. But I guess I would like to stress that I ended up with a bunch of Orchestrations in a failed state that never got a fair shot to reach their 'retryCount' limit because the underlying hosts were not stable and not processing the control queues in a timely manner. I believe my What would I like to see happen instead? It's an odd one, but I don't think I have an answer to this. But this prompted me to rip out the I was definitely educated on my part, but it came as a painful lesson as it caused a live incident in my software that I didn't anticipate. This is why I asked Chris about the feasibility of modifying my Orchestration code to remove the But this was not the case, so I couldn't really self-serve my way out of the problem. I just had to let it run its course and fail, which was frustrating. (In fact, If my memory serves me correctly, I had to delete the entire TaskHub and admit data loss and start again) |
Beta Was this translation helpful? Give feedback.
@olitomlinson,
I think I understand your concern about how
RetryTimeout
when your application is slowing down for requests. In general, I would say the field is working as intended, as the purpose is to set a hard time limit on the operation, even if we don't hit the number of specified retries.I am confused by the behavior of you seeing literally no retries. Given the code found in RetryInterceptor, I would expect that at least one retry would be successfully executed, as after the first execution the value of
CurrentUtcDateTime
should always be less than the retry expiration.