metricsConfig change to Zookeeper was not applied without restarting the operator pod #7820

tgib23 · 2022-12-16T05:51:55Z

tgib23
Dec 16, 2022

Describe the bug

I've added the following changes to existing clusters' Kafka CRD manifests and applied to several of our clusters including AKS, GKE, and Openshift.

    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: zookeeper_metrics_config
          name: jmx-exporter-config-kafka-cluster

Then, I observed two kinds of different failure behaviors as well as successfully applied cases.
In both failure cases, Kafka CRD change was applied when observe Kafka CRD.

  Zookeeper:
    Metrics Config:
      Type:  jmxPrometheusExporter
      Value From:
        Config Map Key Ref:
          Key:   zookeeper_metrics_config
          Name:  jmx-exporter-config-kafka-cluster

In one failure case, Zookeeper pods restart was triggered, but the env ZOOKEEPER_METRICS_ENABLED was not set true, and thus metrics was not enabled.
Zookeeper pods seemed to infinitely rolling-restarted, but ZOOKEEPER_METRICS_ENABLED was not set true anyways.

In another failure case, even Zookeeper pods restart was not triggered.
I observed the log Kafka kafka-cluster in namespace kafka was MODIFIED in operator log, but nothing happened.

In both failure cases, I killed the operator pod and let it restart, then Zookeeper pods were triggered to get restarted with ZOOKEEPER_METRICS_ENABLED set true.

I've asked for an advice on this issue in strimzi slack channel, then was advised to report it here.

To Reproduce

Steps to reproduce the behavior:

Zookeeper clusters are running without metrics config set.
Add metrics config

    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: zookeeper_metrics_config
          name: jmx-exporter-config-kafka-cluster

apply the manifest

Expected behavior

Zookeeper is restarted with env var ZOOKEEPER_METRICS_ENABLED set true, so that we can see the metrics

Environment (please complete the following information):

Strimzi version: [0.31.1]
Installation method: [YAML files]
Kubernetes cluster: [Kubernetes 1.23.11-gke.300 (GKE), 1.23.8 (AKS)]
Infrastructure: [GKE, AKS]

YAML files and logs

logs.zip

I've added strimzi operator logs files from two different clusters.

The first one strimzi-aks.log is from AKS cluster, where no zookeeper restart happened.
The second one strimzi-gke.log is from GKE cluster, where Zookeer restart happened, but env var ZOOKEEPER_METRICS_ENABLED was not set true anyways.

For the strimzi-aks.log, I applied the change at around 2022-12-15 02:58:20 and there is no logs regarding zookeeper restarts.
For the strimzi-gke.log, I applied the change at around 2022-12-13 00:57:09 and there is logs regarding zookeeper restarts.

It's not likely the difference of AKS and GKE caused different behaviors because I saw the failure with zookeeper restarts in other AKS cluster.
In addition, I saw some successful cases in GKE and Openshift clusters.

scholzj · 2022-12-16T13:29:17Z

scholzj
Dec 16, 2022
Maintainer

Can you please share the full YAMLs (both for the Kafka custom resources as well as for the Config Map with metrics configuration)? I tried to reproduce it few times, but it always seems to work as expected for me without any issues.

The log from your AKS environment, the log suggests that the reconciliation of the Kafka clusters got stuck for some reason:

2022-12-01 09:09:04 INFO  AbstractOperator:378 - Reconciliation #45421(timer) Kafka(kafka/kafka-cluster): Reconciliation is in progress
2022-12-01 09:09:04 INFO  AbstractOperator:378 - Reconciliation #44684(timer) Kafka(audit/audit-cluster): Reconciliation is in progress
...
2022-12-15 04:56:04 INFO  AbstractOperator:378 - Reconciliation #45421(timer) Kafka(kafka/kafka-cluster): Reconciliation is in progress
2022-12-15 04:56:04 INFO  AbstractOperator:378 - Reconciliation #44684(timer) Kafka(audit/audit-cluster): Reconciliation is in progress

That will be some kind of bug. But it starts before the log starts. So it is hard to say what the cause is. So in this case, the operator will simply not see the change because of being stuck. So it basically ignored it. And the restart helped it recover.

This is essentially a bug, but hard to keep track of without detailed logs :-/. But it would essentially ignore any changes you would make - not just the metrics.

The GKE log with the infinite rolling restart is more interesting. If you can reproduce it, can you capture the Pod YAMLs or the StrimziPodSet resources between the different restarts (either kubectl get pods -o yaml or kubectl get strimzipodsets -o yaml) and share them? Capture it twice for example 10 minutes apart. It looks like some differences are generated into the YAMLs and if we can capture that, we should be able to identify them.

0 replies

tgib23 · 2022-12-21T09:11:55Z

tgib23
Dec 21, 2022
Author

Hi @scholzj

I've attached the yaml, but it's pretty orthodox one.
kafka.txt

but hard to keep track of without detailed logs

Are you suggesting changing loglevel or operator?

But it would essentially ignore any changes you would make - not just the metrics.
If you can reproduce it, can you capture the Pod YAMLs or the StrimziPodSet resources between the different restarts

I thought so, and will try get those info in next release, but can't guarantee unfortunately.

1 reply

scholzj Dec 21, 2022
Maintainer

Are you suggesting changing loglevel or operator?

That would be the log level in the operator. E.g. to have it at the DEBUG level to see what it was doing last before it got stuck. But ... the main issue is that this basically happened some time before your log started. When you tried to change the metrics, this is when it was already stuck.

I've attached the yaml, but it's pretty orthodox one.

Indeed. That is very similar to whta I tested it with :-(. No special options which would cause it. But I will retry it again with exactly your configuration.

I thought so, and will try get those info in next release, but can't guarantee unfortunately.

Well, you will see what you can do. In any case if you again end up in the restart loop, this would be the most important to collect together with the operator logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

metricsConfig change to Zookeeper was not applied without restarting the operator pod #7820

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

metricsConfig change to Zookeeper was not applied without restarting the operator pod #7820

Uh oh!

tgib23 Dec 16, 2022

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

scholzj Dec 16, 2022 Maintainer

Uh oh!

tgib23 Dec 21, 2022 Author

Uh oh!

scholzj Dec 21, 2022 Maintainer

tgib23
Dec 16, 2022

Replies: 2 comments 1 reply

scholzj
Dec 16, 2022
Maintainer

tgib23
Dec 21, 2022
Author

scholzj Dec 21, 2022
Maintainer