Significant performance degradation in Kafka JMX metrics on upgrade #7944

neilclark-bgn · 2023-01-19T17:40:58Z

neilclark-bgn
Jan 19, 2023

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.

Describe the bug
We have recently updated two of our cluster from operator 0.27.1 to 0.32.0, with a Kafka version upgrade from 3.0.0 to 3.3.1.

Following the update we have seen significant performance degradation in the scraping of JMX metrics. Previously scrapes would average around 10 seconds. Following the upgrade all scrapes are slower, with many timing out after 60 seconds.

To Reproduce
Steps to reproduce the behavior:

Scrape Kafka metrics with operator 0.27.1, kafka 3.0.0.
Scrape Kafka metrics with operator 0.32.0, kafka 3.3.1.
Compare before and after values for scrape_duration_seconds

Expected behavior
I would expect to see little difference in performance between versions, and certainly not a degradation to the point where scrapes are timing out.

Environment (please complete the following information):

Tested on RKE1 and RKE2
Installation method: Strimzi Helm chart
Kubernetes cluster: Kubernetes 1.23
Infrastructure: On-prem vCentre. Tested on two different sets of hardware with the same result.

YAML files and logs

YAML Attached
output.yaml.zip

neilclark-bgn · 2023-01-19T17:42:13Z

neilclark-bgn
Jan 19, 2023
Author

2 replies

scholzj Jan 19, 2023
Maintainer

Maybe you can add some legend to the picture? Its not really clear what is what. There are some lines taking longer than before. But it is not clear what they belong to etc.

scholzj Jan 19, 2023
Maintainer

Apart from the obvious Kafka version change and change to its dependencies, there was an update to the JMX Prometheus Exporter. That could affect the performance. 0.32 uses the latest 0.17.2 version. I guess if you want, you can try to do a custom build of 0.32 with the older JMX Prometheus version to see if that might be related.

I'm not aware of any other changes.

neilclark-bgn · 2023-01-20T09:37:38Z

neilclark-bgn
Jan 20, 2023
Author

@scholzj apologies, you are right that a legend would have helped. Unfortunately I can't easily add that historically, though I could rerun the test if really needed.

To clarify, these are the scrap time in seconds, with each line representing one Kafka broker. No other services e.g. Zookeeper are included.

The low values on the left are the brokers when running 0.27.1-3.0.0. High values in the right are 0.32.0-3.3.1.

Building with an older JMX Prometheus Exporter sounds like a good test. I will look into that today. Many thanks.

2 replies

scholzj Jan 20, 2023
Maintainer

But what are all the different lines if it is one Kafka broker only? I assumed those might be different pods.

neilclark-bgn Jan 20, 2023
Author

We have 8 brokers. It's one line per Kafka broker pod.

neilclark-bgn · 2023-01-24T11:46:31Z

neilclark-bgn
Jan 24, 2023
Author

@scholzj thank you very much for your help last week. I didn't go back to an older version of the JMX Prometheus Exporter in the end, but the snippet of info that the metrics were coming from that exporter was very useful.

I did some reading of the exporter docs and Github issues, and found many reports of people adding whitelistObjectNames to the config to speed up scrape time. In the end I used this to collect only the metrics we were using in alerts, dashboards, or may otherwise be interested in. Doing this I have managed to reduce the scrap time to lower than our 0.27.1-3.0.0 starting point using the new operator and Kafka version.

1 reply

neilclark-bgn Jan 24, 2023
Author

Adding that config below for anyone who's interested.

  whitelistObjectNames:
    - kafka.cluster:type=Partition,name=AtMinIsr,topic=*,partition=*
    - kafka.cluster:type=Partition,name=ReplicasCount,topic=*,partition=*
    - kafka.cluster:type=Partition,name=UnderMinIsr,topic=*,partition=*
    - kafka.controller:type=ControllerChannelManager,name=*
    - kafka.controller:type=ControllerEventManager,name=*
    - kafka.controller:type=ControllerStats,name=*
    - kafka.controller:type=KafkaController,name=*
    - kafka.log:type=Log,name=Size,topic=*,partition=*
    - kafka.log:type=LogManager,name=OfflineLogDirectoryCount
    - kafka.network:type=Processor,name=IdlePercent,networkProcessor=*
    - kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=*
    - kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request=*
    - kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
    - kafka.server:type=BrokerTopicMetrics,name=*
    - kafka.server:type=KafkaRequestHandlerPool,name=*
    - kafka.server:type=KafkaServer,name=*
    - kafka.server:type=ReplicaFetcherManager,name=*,clientId=*
    - kafka.server:type=ReplicaManager,name=*
    - kafka.server:type=SessionExpireListener,name=*
    - kafka.server:type=socket-server-metrics,listener=*,networkProcessor=*```

Manicben · 2023-03-27T18:00:56Z

Manicben
Mar 27, 2023

Just noting that we experienced the same issue when upgrading from Strimzi 0.29.0 Kafka 3.0.1 to Strimzi 0.33.2 Kafka 3.3.2 (via Kafka 3.2.0).
Scrape durations used to be:
jmx_scrape_duration_seconds 9.679018768 (our scrape timeout is 10s)
After the upgrade, they now are:
jmx_scrape_duration_seconds 15.538398079

The difference is not as significant, but this is on a 3 broker cluster with almost no traffic on it. We also see liveness/readiness probe timeouts too.
We also set the following resources per broker, but we don't exceed any requests/limits:

    resources:
      requests:
        memory: 6Gi
        cpu: "3"
      limits:
        memory: 6Gi
        cpu: "8"
    jvmOptions:
      -Xms: 4G
      -Xmx: 4G

I'll also try whitelistObjectNames to make scraping faster, but I thought to post here in case anyone else is experiencing scrape or probe timeouts. Still no idea what to tackle for the probe timeouts other than just increasing them, any feedback will be useful.
Also, please let me know if I should open another discussion for the probe timeouts.

4 replies

scholzj Mar 27, 2023
Maintainer

liveness/readiness probe timeouts

Assuming you mean timeouts of liveness / readiness probes in Kafka pods, that would suggest a lack of resources as the probes don't do really that much.

vutkin Jul 13, 2023

Confirm, same problem. Looks like strimzi team added new labels and this increased cardinality.

scholzj Jul 13, 2023
Maintainer

I think the performance of the JMX metrics was always kinda bad. It is possible that at some point we exposed more metrics. But I do not think we control the cardinality of the metrics as they are based of Kafka metrics. Also, keep in mind that this is configurable - so you are in control of what metrics are exposed and how they are labeled.

vutkin Jul 13, 2023

Well, I have to drop some labels you added in podMonitor resource:

      metricRelabelings:
        - action: labeldrop
          regex: strimzi_io_controller|strimzi_io_controller_name|app_kubernetes_io_managed_by

If you have a big enough kafka cluster metrics become a headacke.

flecno · 2023-10-05T18:16:37Z

flecno
Oct 5, 2023

We see the same performance problem after upgrading strimzi from 0.29.0 to 0.33.2 with kafka 3.2 (no kafka upgrade, just strimzi upgrade: quay.io/strimzi/kafka:0.33.2-kafka-3.2.0). Querying the metrics (broker:9404/metrics) takes up to 2 minutes where before it was around 20 seconds but sometimes it also timeout.

I can confirm the jmx_exporter is the problem. I have build my own image where I downgrade the prometheus agent to 0.16.1 and it works with the same scrape time as before.

My Dockerfile for the custom image:

FROM quay.io/strimzi/kafka:0.33.2-kafka-3.2.0

USER root
RUN rm /opt/kafka/libs/jmx_prometheus_javaagent-0.17.2.jar
ADD --chmod=644 https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.16.1/jmx_prometheus_javaagent-0.16.1.jar /opt/kafka/libs
USER kafka

5 replies

vutkin Oct 5, 2023

Could you please test with latest version as well?
https://github.com/prometheus/jmx_exporter/releases/tag/0.20.0

flecno Oct 5, 2023

I will give it a try tomorrow

flecno Oct 6, 2023

0.20.0 works well too. Its not significant faster but a little bit. From avg 20s to 17s per scrape. My cluster have 35000 replicas/partitions and 4 broker.

vutkin Oct 6, 2023

Thanks @flecno
And what is baseline timeout with 0.16.1 agent version?

flecno Oct 6, 2023

With 0.16.1 is was around 20 seconds. You can see the deployment from 0.16 to 0.20 at 1:30pm

vutkin · 2023-10-06T13:01:27Z

vutkin
Oct 6, 2023

Hi @scholzj does it make sense to keep 2 versions of prometheus agents (old on - like a stable and new one and switch between them via feature flag) in single docker image?

3 replies

scholzj Oct 6, 2023
Maintainer

Strimzi 0.38 will have JMX Exporter 0.20. But I do not think we have any plans to support multiple versions of it. It would be technically complicated, it would need to be tested somehow, etc. Also, doesn't the old JMX Exporter contain the SnakeYAML CVEs that would flag the container images in all scanners?

flecno Oct 6, 2023

@vutkin I also tested with 0.18.0 which is bundled in Strimzi 0.37 and it looks good. No performance problem but the SnakeYAML CVE is fixed

scholzj Oct 6, 2023
Maintainer

BTW: @mimaison is looking into some other options how to replace the JMX Exporter completely with something that offers better performance. More details are here: strimzi/proposals#85 ... it is still work in progress, but I wanted to mention it.

Charlotte-br560 · 2024-03-16T13:56:39Z

Charlotte-br560
Mar 16, 2024

Sure, I understand your concern. It seems like there might be a performance issue after upgrading your Kafka clusters. Have you checked if there are any specific configurations or settings that might have changed between the versions? Also, have you tried monitoring the system resources during these scrapes to see if there's any bottleneck causing the timeouts?

0 replies

Significant performance degradation in Kafka JMX metrics on upgrade #7944

Uh oh!

Replies: 7 comments · 17 replies

Uh oh!

neilclark-bgn Jan 19, 2023 Author

Uh oh!

scholzj Jan 19, 2023 Maintainer

Uh oh!

scholzj Jan 19, 2023 Maintainer

Uh oh!

neilclark-bgn Jan 20, 2023 Author

Uh oh!

scholzj Jan 20, 2023 Maintainer

Uh oh!

neilclark-bgn Jan 20, 2023 Author

Uh oh!

neilclark-bgn Jan 24, 2023 Author

Uh oh!

neilclark-bgn Jan 24, 2023 Author

Uh oh!

Uh oh!

Uh oh!

scholzj Mar 27, 2023 Maintainer

Uh oh!

Uh oh!

scholzj Jul 13, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scholzj Oct 6, 2023 Maintainer

Uh oh!

Uh oh!

scholzj Oct 6, 2023 Maintainer

Uh oh!

Replies: 7 comments 17 replies

neilclark-bgn
Jan 19, 2023
Author

scholzj Jan 19, 2023
Maintainer

scholzj Jan 19, 2023
Maintainer

neilclark-bgn
Jan 20, 2023
Author

scholzj Jan 20, 2023
Maintainer

neilclark-bgn Jan 20, 2023
Author

neilclark-bgn
Jan 24, 2023
Author

neilclark-bgn Jan 24, 2023
Author

scholzj Mar 27, 2023
Maintainer

scholzj Jul 13, 2023
Maintainer

scholzj Oct 6, 2023
Maintainer

scholzj Oct 6, 2023
Maintainer