|
| 1 | +# Kafka test stability |
| 2 | + |
| 3 | +## The problem |
| 4 | + |
| 5 | +When running tests in parallel (e.g., with `pytest-xdist`), you might encounter sporadic crashes with messages like: |
| 6 | + |
| 7 | +``` |
| 8 | +Fatal Python error: Aborted |
| 9 | +``` |
| 10 | + |
| 11 | +The stack trace typically points to `confluent_kafka` operations, often during producer initialization in fixtures or test setup. This isn't a bug in the application code - it's a known race condition in the underlying `librdkafka` C library. |
| 12 | + |
| 13 | +## Why it happens |
| 14 | + |
| 15 | +The `confluent-kafka-python` library is a thin wrapper around `librdkafka`, a high-performance C library. When multiple Python processes or threads try to create Kafka `Producer` instances simultaneously, they can trigger a race condition in `librdkafka`'s internal initialization routines. |
| 16 | + |
| 17 | +This manifests as: |
| 18 | + |
| 19 | +- Random `SIGABRT` signals during test runs |
| 20 | +- Crashes in `rd_kafka_broker_destroy_final` or similar internal functions |
| 21 | +- Flaky CI failures that pass on retry |
| 22 | + |
| 23 | +The issue is particularly common in CI environments where tests run in parallel across multiple workers. |
| 24 | + |
| 25 | +## The fix |
| 26 | + |
| 27 | +The solution is to serialize `Producer` initialization using a global threading lock. This prevents multiple threads from entering `librdkafka`'s initialization code simultaneously. |
| 28 | + |
| 29 | +In `app/events/core/producer.py`: |
| 30 | + |
| 31 | +```python |
| 32 | +import threading |
| 33 | + |
| 34 | +# Global lock to serialize Producer initialization (workaround for librdkafka race condition) |
| 35 | +# See: https://github.com/confluentinc/confluent-kafka-python/issues/1797 |
| 36 | +_producer_init_lock = threading.Lock() |
| 37 | + |
| 38 | +class UnifiedProducer: |
| 39 | + async def start(self) -> None: |
| 40 | + # ... config setup ... |
| 41 | + |
| 42 | + # Serialize Producer initialization to prevent librdkafka race condition |
| 43 | + with _producer_init_lock: |
| 44 | + self._producer = Producer(producer_config) |
| 45 | + |
| 46 | + # ... rest of startup ... |
| 47 | +``` |
| 48 | + |
| 49 | +The lock is process-global, so all `UnifiedProducer` instances in the same process will serialize their initialization. This adds negligible overhead in production (producers are typically created once at startup) while eliminating the race condition in tests. |
| 50 | + |
| 51 | +## Related issues |
| 52 | + |
| 53 | +These GitHub issues document the underlying problem: |
| 54 | + |
| 55 | +| Issue | Description | |
| 56 | +|-------|-------------| |
| 57 | +| [confluent-kafka-python#1797](https://github.com/confluentinc/confluent-kafka-python/issues/1797) | Segfaults in multithreaded/asyncio pytest environments | |
| 58 | +| [confluent-kafka-python#1761](https://github.com/confluentinc/confluent-kafka-python/issues/1761) | Segfault on garbage collection in multithreaded context | |
| 59 | +| [librdkafka#3608](https://github.com/confluentinc/librdkafka/issues/3608) | Crash in `rd_kafka_broker_destroy_final` | |
| 60 | + |
| 61 | +## Alternative approaches |
| 62 | + |
| 63 | +If you still encounter issues: |
| 64 | + |
| 65 | +1. **Reduce parallelism** - Run Kafka-dependent tests with fewer workers: `pytest -n 2` instead of `-n auto` |
| 66 | + |
| 67 | +2. **Isolate Kafka tests** - Mark Kafka tests and run them separately: |
| 68 | + ```python |
| 69 | + @pytest.mark.kafka |
| 70 | + def test_producer_sends_message(): |
| 71 | + ... |
| 72 | + ``` |
| 73 | + ```bash |
| 74 | + pytest -m "not kafka" -n auto # parallel |
| 75 | + pytest -m kafka -n 1 # sequential |
| 76 | + ``` |
| 77 | + |
| 78 | +3. **Use fixtures carefully** - Ensure producer fixtures are properly scoped and cleaned up: |
| 79 | + ```python |
| 80 | + @pytest.fixture(scope="function") |
| 81 | + async def producer(): |
| 82 | + p = UnifiedProducer(config, schema_registry) |
| 83 | + await p.start() |
| 84 | + yield p |
| 85 | + await p.stop() # Always clean up |
| 86 | + ``` |
0 commit comments