-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Open
Description
Summary
When a consumer is destroyed (rd_kafka_destroy()) and the group coordinator is unavailable, rd_kafka_cgrp_leave() calls rd_kafka_cgrp_handle_LeaveGroup() with a NULL broker pointer (rkb = rkcg->rkcg_coord). The error path then dereferences this NULL pointer:
`rd_kafka_dbg(rkb->rkb_rk, CGRP, "LEAVEGROUP", ...); // CRASH: rkb is NULL`
This crash requires the coordinator to become unavailable at the exact moment the consumer is shutting down. Typical triggers:
- Rolling broker upgrades where the coordinator broker restarts
- Coordinator failover during consumer shutdown
- Network partition isolating the coordinator
Identification
- We observed intermittent SIGSEGV crashes in production during consumer shutdown
- We captured the core dump and analyzed with gdb
#0 rd_kafka_cgrp_handle_LeaveGroup (rk=0x..., rkb=0x0, err=RD_KAFKA_RESP_ERR__WAIT_COORD, ...)
at rdkafka_cgrp.c:984
#1 rd_kafka_cgrp_leave (rkcg=0x...) at rdkafka_cgrp.c:1158
#2 rd_kafka_cgrp_terminate (rkcg=0x...) at rdkafka_cgrp.c:...
#3 rd_kafka_destroy_internal (rk=0x...) at rdkafka.c:...
-
The above backtrace shows
rkb=0x0(NULL) inrd_kafka_cgrp_handle_LeaveGroup() -
We traced the call site in
rd_kafka_cgrp_leave()(line 1158):} else rd_kafka_cgrp_handle_LeaveGroup(rkcg->rkcg_rk, rkcg->rkcg_coord, // <-- rkcg_coord is NULL here RD_KAFKA_RESP_ERR__WAIT_COORD, NULL, NULL, rkcg);
-
This
elsebranch is taken when no coordinator is available (rkcg->rkcg_coord == NULL) -
The function then attempts to log using
rkb->rkb_rkat line 984, causing NULL dereference
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels