Skip to content

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

@ajwootto

Description

@ajwootto

Bug Report

Environment

  • Node version: 8
  • Kafka-node version: 4.1.3
  • Kafka version: 1.10

This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:

Given two consumers that have successfully connected to a broker and started heartbeats:

1. next heartbeat is currently scheduled
2. connect is called outside heartbeat loop (due to socket closed etc)
3. next heartbeat happens with rebalance error because of current reconnect
4. another reconnect is scheduled due to heartbeat error
5. first connect finishes
6. heartbeat interval is cleared and restarted
7. next heartbeat succeeds on the latest generation id
8. scheduled reconnect occurrs from previous heartbeat failure (outside context of current heartbeat loop, ie. from the old generation id)
GOTO 3.

Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.

To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.

taplytics@8fd6b92

Just set process.env.FAKE_CONNECT=1 for one consumer and not the other.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions