Skip to content

admin: fix use-after-free in coord_request error path#5397

Open
Piotr WOLSKI (piochelepiotr) wants to merge 1 commit intoconfluentinc:masterfrom
piochelepiotr:fix/admin-coord-request-eonce-use-after-free
Open

admin: fix use-after-free in coord_request error path#5397
Piotr WOLSKI (piochelepiotr) wants to merge 1 commit intoconfluentinc:masterfrom
piochelepiotr:fix/admin-coord-request-eonce-use-after-free

Conversation

@piochelepiotr
Copy link
Copy Markdown

@piochelepiotr Piotr WOLSKI (piochelepiotr) commented Apr 8, 2026

Description

Fix a use-after-free bug in rd_kafka_admin_coord_request() that causes a
process abort with:

rd_kafka_enq_once_del_source_return: Assertion `eonce->refcnt > 0' failed

This affects all coordinator-targeted Admin API operations:
DescribeConsumerGroups, DeleteConsumerGroupOffsets,
ListConsumerGroupOffsets, and similar.

Root cause

When rd_kafka_admin_coord_request()'s inner request() call fails
(e.g., API not supported by broker, connection dropped), the error path
called rd_kafka_admin_common_worker_destroy() which freed the eonce
object. However, the caller (rd_kafka_coord_req_fsm) still holds a
reference to the eonce and passes it to rd_kafka_coord_req_fail(),
which enqueues a dummy error response with the now-freed eonce as opaque.

When rd_kafka_admin_coord_response_parse() later processes that response
and calls rd_kafka_enq_once_del_source_return(), it accesses freed memory,
triggering the assertion failure.

Fix

Remove the premature worker_destroy() call from the error path of
rd_kafka_admin_coord_request(). The error is returned to the caller,
which calls coord_req_fail()coord_response_parse(). That function
already handles the error correctly: it calls del_source_return() to
retrieve the rko, sees the error, and calls worker_destroy() itself.

This matches the pattern already used by
rd_kafka_txn_send_TxnOffsetCommitRequest(), which has explicit comments
on its error paths documenting the same constraint:

/* Do not free the rko, it is passed as the reply_opaque
 * on the reply queue by coord_req_fsm() when we return
 * an error here. */

How to reproduce

The bug triggers under two conditions:

  1. API version mismatch: broker doesn't support the requested API
    (e.g., OffsetDelete on broker < 2.4)
  2. Connection disruption: broker restarts or connection drops during
    an Admin API fanout operation

Higher request frequency increases the likelihood of hitting the race.

Testing

This is a race condition on the error path of coordinator-targeted admin
requests. It requires either an API version mismatch or a connection failure
during the request send, making it difficult to reproduce deterministically
in a unit test. Happy to add a test if maintainers can suggest an approach
for reliably triggering the send failure.

Fixes #4605
Fixes #3663

@confluent-cla-assistant
Copy link
Copy Markdown

🎉 All Contributor License Agreements have been signed. Ready to merge.
✅ piochelepiotr
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

When rd_kafka_admin_coord_request()'s request() call fails, the error
path called rd_kafka_admin_common_worker_destroy() which freed the
eonce object. However, the caller (rd_kafka_coord_req_fsm) still holds
a reference to the eonce and passes it to rd_kafka_coord_req_fail(),
which enqueues a dummy error response carrying the (now-freed) eonce
as opaque. When rd_kafka_admin_coord_response_parse() later processes
that response and calls rd_kafka_enq_once_del_source_return(), it
accesses freed memory, triggering an assertion failure and abort:

  rd_kafka_enq_once_del_source_return: Assertion `eonce->refcnt > 0'

Fix by not calling worker_destroy() in the error path of
rd_kafka_admin_coord_request(). Instead, let the error propagate
through the normal coord_req_fail -> coord_response_parse path,
which already handles cleanup correctly. This matches the pattern
used by rd_kafka_txn_send_TxnOffsetCommitRequest(), which has
explicit comments documenting the same constraint.

This affects all coordinator-targeted Admin API operations:
DescribeConsumerGroups, DeleteConsumerGroupOffsets,
ListConsumerGroupOffsets, and similar.

Fixes confluentinc#4605
Fixes confluentinc#3663

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@piochelepiotr Piotr WOLSKI (piochelepiotr) force-pushed the fix/admin-coord-request-eonce-use-after-free branch from 6f5f7a3 to c6b26c7 Compare April 8, 2026 18:45
@piochelepiotr Piotr WOLSKI (piochelepiotr) marked this pull request as ready for review April 9, 2026 02:08
@piochelepiotr Piotr WOLSKI (piochelepiotr) requested a review from a team as a code owner April 9, 2026 02:08
Piotr WOLSKI (piochelepiotr) added a commit to DataDog/integrations-core that referenced this pull request Apr 9, 2026
…or path

Apply upstream fix (confluentinc/librdkafka#5397) for a use-after-free
bug in rd_kafka_admin_coord_request() that causes process abort with
assertion failure on eonce->refcnt. Affects DescribeConsumerGroups,
DeleteConsumerGroupOffsets, ListConsumerGroupOffsets and similar
coordinator-targeted Admin API operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant