Skip to content

Support/Mitigate RabbitMQ consumer_timeout behavior #827

@davidt99

Description

@davidt99

In RabbitMQ 3.8.15+, a new concept was introduced: consumer timeout. If a message is not acknowledged within the timeout period (defaulting to 30 minutes), RabbitMQ closes the connection and returns the message to the queue. This timeout can be configured globally, per-queue, or disabled (though disabling it is not recommended).

This behavior affects Dramatiq in several ways:

1. Reprocessing loops

If the TimeLimit middleware is not configured, or if its limit is higher than the RabbitMQ consumer timeout, a message that hits the timeout will be reprocessed indefinitely because Dramatiq loses the ability to ACK it once the connection drops.
A bug causing a worker to hang will eventually propagate to all workers as they all pick up the "poison" message, attempt to process it, and time out.

We should improve documentation regarding this. Additionally, I suggest adding a warning log when booting a RabbitMQ broker without the TimeLimit middleware. Adding TimeLimit by default seems incorrect as we cannot programmatically fetch the server's timeout value.

2. Delays exceeding the timeout

If a message has a delay (or backoff) longer than the consumer timeout, the Delayed Delivery (DQ) consumer thread will trigger a timeout. This causes all messages currently held in that thread to be returned to the queue. While this doesn't stop task processing entirely, it creates an unhealthy "churn" cycle.

I see here two changes we probably need to make:

  1. Update DEFAULT_MAX_BACKOFF to 30 minutes when using the RabbitMQ broker to align with default server settings.
  2. Modify the DQ consumer thread to re-enqueue messages at specific intervals (e.g., every 15 minutes) to avoid hitting the timeout. This is also a prerequisite for supporting Quorum Queues properly in the future.

3. Visibility and Logging

As discussed in #825, current logging can be misleading. In my experience, hunting down the specific message that exceeded 30 minutes was difficult because the exception is raised (and the connection closed) before the task actually finishes. We should improve the clarity of the logs to help users identify when a RabbitMQ-enforced timeout has occurred.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions