Explore timing-logs or retry-request middlewares

While debugging the latest internal error (cc @mfranzon), we found that this process is very manual and a more systematic approach would be needed. Luckily the current internal-errors rate is very low, but it's still very time consuming when it happens.

The current flow looks something like:
* Go through backend logs to find the context (what was the history of requests e.g. in the last minute before the failed one?).
* Take note of the timestamps of the relevant responses.
* Take note of any background tasks taking place at the same time.
* Go through client logs or apache logs, to find the timestamps of the corresponding requests.
* Build a timeline and try to reconstruct the conditions leading to the error.

---

An approach that would be helpful is to go through middlewares, and we have at least two (possibly coexisting) options.

For **observability** and debugging, we can add something like https://github.com/steinnes/timing-asgi (or a simpler, custom, version of it - e.g. based on [this one](https://www.starlette.dev/middleware/#cleanup-and-error-handling)). This would make backend logs much more self-contained when debugging internal errors, and client/apache logs would only be used for a more fine-grained investigation into what lead to the failure.

For **robustness** we can add a middleware that retries a failed request under a certain set of conditions. These conditions would be e.g.:
* The request is a GET/QUERY one, which is meant to be read-only.
* The error falls in a class of selected errors. The example we have in mind is https://docs.sqlalchemy.org/en/20/orm/exceptions.html#sqlalchemy.orm.exc.StaleDataError, where a certain ORM session has become stale due to database activity performed via another session. In this case, it is safe to retry the full endpoint.
* We maintain a high threshold for including new errors, because otherwise we risk to hide some actual bug by just retrying.

If these conditions are satisfied, then we can retry the endpoint a certain number `max_retries` of times (which would be a very small number, e.g. `max_retries=2`). If a known error is found (e.g. a `StaleDataError`), we log extensive information about it and try again - until the `max_retries` is reached.

Alternative option: we could transform a list of errors into 503 status codes, and implement the retry logic on the client side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore timing-logs or retry-request middlewares #2961

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore timing-logs or retry-request middlewares #2961

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions