-
-
Notifications
You must be signed in to change notification settings - Fork 983
Description
Problem Summary
There is a critical reliability issue in Celery/Kombu when using the SQS transport.
If any message reaches the SQS queue that is not a valid Celery-formatted JSON payload (for example: AWS Console test message, a misconfigured service, a typo, or any non-Celery producer), Kombu attempts to execute:
payload = loads(bytes_to_str(body))This immediately triggers a JSONDecodeError, which propagates up the stack and ultimately causes the following:
Unrecoverable error: JSONDecodeError(...)
-> Celery worker process crashes
Why This Is Dangerous
This behavior introduces a serious stability and security risk:
- There is no supported extension point to intercept or validate the raw SQS message before JSON parsing.
on_decode_erroris never called, since the failure happens earlier in_message_to_python.- Even returning
Nonefrom_message_to_pythoncauses additional crashes ('NoneType' object is not subscriptable), because higher layers assume the structure of a fully valid Celery message. - A single malformed SQS message can kill all Celery consumers on that queue, causing a full outage.
- In real AWS environments, it is common to have:
- accidentally published messages,
- dev/test messages,
- IAM misconfigurations,
- manual AWS Console messages,
- or third-party producers.
A production task processing system should not be this fragile.
What Is Needed
Celery/Kombu should provide a first-class, documented mechanism that allows developers to safely handle invalid raw messages, including:
- A pre-decode validation hook (e.g.
on_raw_message). - A hook specifically for JSON decode failures (e.g.
on_raw_message_error). - An option to discard or log invalid messages without killing the entire worker.
- The ability to mark such messages as handled (delete from SQS) or route them to a DLQ programmatically.
- Avoid classifying malformed messages as "Unrecoverable" errors.
This would make Celery significantly more robust in distributed and cloud environments.
Why Existing Workarounds Are Not Enough
- DLQ does not help because the worker crashes before the message can be retried and moved.
- External sanitizer consumers defeat the purpose of Celery managing the queue and add operational overhead.
- Monkey-patching Kombu internals is brittle and breaks easily with updates.
- There is no clean and stable mechanism today to prevent Celery from crashing on invalid messages.
Proposed Solution
Introduce an official API layer between SQS receive and Celery decode, such as:
before_message_decode(raw_message)on_invalid_raw_message(exception, raw_message)- Configurable behavior, such as:
DROP_INVALID_RAW_MESSAGES = TrueLOG_INVALID_MESSAGES = True
This would let Celery users:
- gracefully discard malformed messages,
- log or notify the event,
- keep the worker alive,
- maintain reliability in real-world environments.
Conclusion
A single malformed SQS message should not be able to bring down a Celery worker permanently.
Given that modern cloud systems frequently deal with mixed producers or accidental messages, this lack of raw-message error handling represents a real operational vulnerability.
Providing a safe and supported mechanism to intercept and handle invalid raw messages would dramatically increase Celery’s stability and adoption in cloud-native deployments.
Thank you for considering this improvement.
Related issues: