Skip to content

Conversation

@natemort
Copy link
Member

@natemort natemort commented Sep 5, 2025

Add RPC retries via an interceptor, with exponential backoff matching both the Go and Java clients. The approach here differs from both however in two main ways:

  1. Adding these via interceptor makes it implicit, while both the Go and Java client require it to be explicit.
  2. The specific requests to retry are based on GRPC error codes, rather than explicitly listing non-retryable errors and retrying everything by default. This seems like a more sustainable approach, since nearly every error type is non-retryable. A newly introduced error type would require a client update to mark it non-retryable before it could safely be used. Any time the python client doesn't recognize an error it gets mapped to just CadenceError, so new errors can safely be added.

What changed?

  • Add RPC Retries

Why?

  • Align with Go and Java clients

How did you test it?

  • Unit tests against a fake service

Potential risks

Release notes

Documentation Changes


def is_retryable(err: CadenceError, call_details: ClientCallDetails) -> bool:
# Handle requests to the passive side, matching the Go and Java Clients
if call_details.method == b'/uber.cadence.api.v1.WorkflowAPI/GetWorkflowExecutionHistory' and isinstance(err, EntityNotExistsError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh not sure I understand what would trigger this besides replication lag races?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's exactly what it's for. If you Start a Workflow and want to poll until it completes you have to do this.

except CadenceError as e:
err = e
attempts = 1
while is_retryable(err, client_call_details):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: simplify using do while loop by using break/continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, good call


def is_retryable(err: CadenceError, call_details: ClientCallDetails) -> bool:
# Handle requests to the passive side, matching the Go and Java Clients
if call_details.method == b'/uber.cadence.api.v1.WorkflowAPI/GetWorkflowExecutionHistory' and isinstance(err, EntityNotExistsError):
Copy link
Member

@shijiesheng shijiesheng Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use DESCRIPTOR but that's too complicated. Maybe at least add a variable to replace the string here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Add RPC retries via an interceptor, with exponential backoff matching both the Go and Java clients. The approach here differs from both however in two main ways:
1. Adding these via interceptor makes it implicit, while both the Go and Java client require it to be explicit.
2. The specific requests to retry are based on GRPC error codes, rather than explicitly listing non-retryable errors and retrying everything by default. This seems like a more sustainable approach, since nearly every error type is non-retryable. A newly introduced error type would require a client update to mark it non-retryable before it could safely be used. Any time the python client doesn't recognize an error it gets mapped to just CadenceError, so new errors can safely be added.
@natemort natemort merged commit a5a257c into cadence-workflow:main Sep 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants