Celery SQS Consumer Crashes with JSONDecodeError on Invalid / Non-Celery Messages

### Problem Summary

There is a critical reliability issue in Celery/Kombu when using the SQS transport.

If *any message* reaches the SQS queue that is **not a valid Celery-formatted JSON payload** (for example: AWS Console test message, a misconfigured service, a typo, or any non-Celery producer), Kombu attempts to execute:

```python
payload = loads(bytes_to_str(body))
```

This immediately triggers a `JSONDecodeError`, which propagates up the stack and ultimately causes the following:

```
Unrecoverable error: JSONDecodeError(...)
-> Celery worker process crashes
```

### Why This Is Dangerous

This behavior introduces a serious stability and security risk:

- There is **no supported extension point** to intercept or validate the raw SQS message *before* JSON parsing.
- `on_decode_error` is never called, since the failure happens earlier in `_message_to_python`.
- Even returning `None` from `_message_to_python` causes additional crashes (`'NoneType' object is not subscriptable`), because higher layers assume the structure of a fully valid Celery message.
- A single malformed SQS message can **kill all Celery consumers** on that queue, causing a full outage.
- In real AWS environments, it is common to have:
  - accidentally published messages,
  - dev/test messages,
  - IAM misconfigurations,
  - manual AWS Console messages,
  - or third-party producers.

A production task processing system should not be this fragile.

### What Is Needed

Celery/Kombu should provide a **first-class, documented mechanism** that allows developers to safely handle invalid raw messages, including:

1. A pre-decode validation hook (e.g. `on_raw_message`).
2. A hook specifically for JSON decode failures (e.g. `on_raw_message_error`).
3. An option to discard or log invalid messages without killing the entire worker.
4. The ability to mark such messages as handled (delete from SQS) or route them to a DLQ programmatically.
5. Avoid classifying malformed messages as "Unrecoverable" errors.

This would make Celery significantly more robust in distributed and cloud environments.

### Why Existing Workarounds Are Not Enough

- **DLQ does not help** because the worker crashes *before* the message can be retried and moved.
- **External sanitizer consumers** defeat the purpose of Celery managing the queue and add operational overhead.
- **Monkey-patching Kombu internals** is brittle and breaks easily with updates.
- There is **no clean and stable mechanism** today to prevent Celery from crashing on invalid messages.

### Proposed Solution

Introduce an official API layer between SQS receive and Celery decode, such as:

- `before_message_decode(raw_message)`
- `on_invalid_raw_message(exception, raw_message)`
- Configurable behavior, such as:
  - `DROP_INVALID_RAW_MESSAGES = True`
  - `LOG_INVALID_MESSAGES = True`

This would let Celery users:

- gracefully discard malformed messages,
- log or notify the event,
- keep the worker alive,
- maintain reliability in real-world environments.

### Conclusion

A single malformed SQS message should not be able to bring down a Celery worker permanently.  
Given that modern cloud systems frequently deal with mixed producers or accidental messages, this lack of raw-message error handling represents a real operational vulnerability.

Providing a safe and supported mechanism to intercept and handle invalid raw messages would dramatically increase Celery’s stability and adoption in cloud-native deployments.

Thank you for considering this improvement.

Related issues:
- https://github.com/celery/kombu/issues/1551
- https://github.com/celery/kombu/issues/1181


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celery SQS Consumer Crashes with JSONDecodeError on Invalid / Non-Celery Messages #2421

Problem Summary

Why This Is Dangerous

What Is Needed

Why Existing Workarounds Are Not Enough

Proposed Solution

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Celery SQS Consumer Crashes with JSONDecodeError on Invalid / Non-Celery Messages #2421

Description

Problem Summary

Why This Is Dangerous

What Is Needed

Why Existing Workarounds Are Not Enough

Proposed Solution

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions