admin: fix use-after-free in coord_request error path by piochelepiotr · Pull Request #5397 · confluentinc/librdkafka

Piotr WOLSKI (piochelepiotr) · 2026-04-08T18:36:57Z

Description

Fix a use-after-free bug in rd_kafka_admin_coord_request() that causes a
process abort with:

rd_kafka_enq_once_del_source_return: Assertion `eonce->refcnt > 0' failed

This affects all coordinator-targeted Admin API operations:
DescribeConsumerGroups, DeleteConsumerGroupOffsets,
ListConsumerGroupOffsets, and similar.

Root cause

When rd_kafka_admin_coord_request()'s inner request() call fails
(e.g., API not supported by broker, connection dropped), the error path
called rd_kafka_admin_common_worker_destroy() which freed the eonce
object. However, the caller (rd_kafka_coord_req_fsm) still holds a
reference to the eonce and passes it to rd_kafka_coord_req_fail(),
which enqueues a dummy error response with the now-freed eonce as opaque.

When rd_kafka_admin_coord_response_parse() later processes that response
and calls rd_kafka_enq_once_del_source_return(), it accesses freed memory,
triggering the assertion failure.

Fix

Remove the premature worker_destroy() call from the error path of
rd_kafka_admin_coord_request(). The error is returned to the caller,
which calls coord_req_fail() → coord_response_parse(). That function
already handles the error correctly: it calls del_source_return() to
retrieve the rko, sees the error, and calls worker_destroy() itself.

This matches the pattern already used by
rd_kafka_txn_send_TxnOffsetCommitRequest(), which has explicit comments
on its error paths documenting the same constraint:

/* Do not free the rko, it is passed as the reply_opaque
 * on the reply queue by coord_req_fsm() when we return
 * an error here. */

How to reproduce

The bug triggers under two conditions:

API version mismatch: broker doesn't support the requested API
(e.g., OffsetDelete on broker < 2.4)
Connection disruption: broker restarts or connection drops during
an Admin API fanout operation

Higher request frequency increases the likelihood of hitting the race.

Testing

This is a race condition on the error path of coordinator-targeted admin
requests. It requires either an API version mismatch or a connection failure
during the request send, making it difficult to reproduce deterministically
in a unit test. Happy to add a test if maintainers can suggest an approach
for reliably triggering the send failure.

Fixes #4605
Fixes #3663

confluent-cla-assistant · 2026-04-08T18:37:09Z

🎉 All Contributor License Agreements have been signed. Ready to merge.
✅ piochelepiotr
_{Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.}

When rd_kafka_admin_coord_request()'s request() call fails, the error path called rd_kafka_admin_common_worker_destroy() which freed the eonce object. However, the caller (rd_kafka_coord_req_fsm) still holds a reference to the eonce and passes it to rd_kafka_coord_req_fail(), which enqueues a dummy error response carrying the (now-freed) eonce as opaque. When rd_kafka_admin_coord_response_parse() later processes that response and calls rd_kafka_enq_once_del_source_return(), it accesses freed memory, triggering an assertion failure and abort: rd_kafka_enq_once_del_source_return: Assertion `eonce->refcnt > 0' Fix by not calling worker_destroy() in the error path of rd_kafka_admin_coord_request(). Instead, let the error propagate through the normal coord_req_fail -> coord_response_parse path, which already handles cleanup correctly. This matches the pattern used by rd_kafka_txn_send_TxnOffsetCommitRequest(), which has explicit comments documenting the same constraint. This affects all coordinator-targeted Admin API operations: DescribeConsumerGroups, DeleteConsumerGroupOffsets, ListConsumerGroupOffsets, and similar. Fixes confluentinc#4605 Fixes confluentinc#3663 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…or path Apply upstream fix (confluentinc/librdkafka#5397) for a use-after-free bug in rd_kafka_admin_coord_request() that causes process abort with assertion failure on eonce->refcnt. Affects DescribeConsumerGroups, DeleteConsumerGroupOffsets, ListConsumerGroupOffsets and similar coordinator-targeted Admin API operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Piotr WOLSKI (piochelepiotr) force-pushed the fix/admin-coord-request-eonce-use-after-free branch from 6f5f7a3 to c6b26c7 Compare April 8, 2026 18:45

Piotr WOLSKI (piochelepiotr) marked this pull request as ready for review April 9, 2026 02:08

Piotr WOLSKI (piochelepiotr) requested a review from a team as a code owner April 9, 2026 02:08

Piotr WOLSKI (piochelepiotr) mentioned this pull request Apr 9, 2026

[builders] Patch librdkafka use-after-free in admin coord_request error path DataDog/integrations-core#23240

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admin: fix use-after-free in coord_request error path#5397

admin: fix use-after-free in coord_request error path#5397
Piotr WOLSKI (piochelepiotr) wants to merge 1 commit intoconfluentinc:masterfrom
piochelepiotr:fix/admin-coord-request-eonce-use-after-free

Piotr WOLSKI (piochelepiotr) commented Apr 8, 2026 •

edited

Loading

Uh oh!

confluent-cla-assistant bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Piotr WOLSKI (piochelepiotr) commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root cause

Fix

How to reproduce

Testing

Uh oh!

confluent-cla-assistant bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Piotr WOLSKI (piochelepiotr) commented Apr 8, 2026 •

edited

Loading