Skip to content

Explore timing-logs or retry-request middlewares #2961

@tcompa

Description

@tcompa

While debugging the latest internal error (cc @mfranzon), we found that this process is very manual and a more systematic approach would be needed. Luckily the current internal-errors rate is very low, but it's still very time consuming when it happens.

The current flow looks something like:

  • Go through backend logs to find the context (what was the history of requests e.g. in the last minute before the failed one?).
  • Take note of the timestamps of the relevant responses.
  • Take note of any background tasks taking place at the same time.
  • Go through client logs or apache logs, to find the timestamps of the corresponding requests.
  • Build a timeline and try to reconstruct the conditions leading to the error.

An approach that would be helpful is to go through middlewares, and we have at least two (possibly coexisting) options.

For observability and debugging, we can add something like https://github.com/steinnes/timing-asgi (or a simpler, custom, version of it - e.g. based on this one). This would make backend logs much more self-contained when debugging internal errors, and client/apache logs would only be used for a more fine-grained investigation into what lead to the failure.

For robustness we can add a middleware that retries a failed request under a certain set of conditions. These conditions would be e.g.:

  • The request is a GET/QUERY one, which is meant to be read-only.
  • The error falls in a class of selected errors. The example we have in mind is https://docs.sqlalchemy.org/en/20/orm/exceptions.html#sqlalchemy.orm.exc.StaleDataError, where a certain ORM session has become stale due to database activity performed via another session. In this case, it is safe to retry the full endpoint.
  • We maintain a high threshold for including new errors, because otherwise we risk to hide some actual bug by just retrying.

If these conditions are satisfied, then we can retry the endpoint a certain number max_retries of times (which would be a very small number, e.g. max_retries=2). If a known error is found (e.g. a StaleDataError), we log extensive information about it and try again - until the max_retries is reached.

Alternative option: we could transform a list of errors into 503 status codes, and implement the retry logic on the client side.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions