Skip to content

Datadog kafka_consumer check very slow for large number of consumer when collection highwater markΒ #20715

@arshdeeptinna

Description

@arshdeeptinna

Hi,
We have been running an older version of kafka_consumer (2.16.4) because we had some issues with metrics when we tried to upgrade before. We finally tried again and rolled out 6.5.2 which contains improvements made to address the issue mentioned in #19564 However the check still takes much longer as compared to the previous version which is between 5-10 seconds

sudo datadog-agent status Collector | grep -A 8 kafka
    kafka_consumer (6.5.2)
    ----------------------
      Instance ID: kafka_consumer:6806d01930984041 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 74,352, Total: 74,352
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 33m40.531s
      Last Execution Date : 2025-06-26 14:09:52 UTC (1750946992000)
      Last Successful Execution Date : 2025-06-26 14:09:52 UTC (1750946992000)

There are possible improvments that can be made to reduce this time. Two main ones are

  1. In the get_highwater_offsets we fetch the list of topic partitions for each consumer group. However highwater mark is not a per consumer group setting so fetching topic, partition info once should be enough https://github.com/DataDog/integrations-core/blob/master/kafka_consumer/datadog_checks/kafka_consumer/kafka_consumer.py#L348
  2. Also we can cache the result of list_topics method and use it through the run of the check instead of calling it everytime we need this topic partition info.

I made this changes and tried the locally and it brought the time down to the acceptable range again.

sudo datadog-agent status Collector | grep -A 8 kafka
    kafka_consumer (6.5.2)
    ----------------------
      Instance ID: kafka_consumer:6c446270ca0a8da1 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 76,129, Total: 76,129
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 22.72s
      Last Execution Date : 2025-06-27 20:59:48 UTC (1751057988000)
      Last Successful Execution Date : 2025-06-27 20:59:48 UTC (1751057988000)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions