[Bug]: Ask grafana questions for strimzi kafka exporter #8231

peter-hst · 2023-03-14T06:34:11Z

peter-hst
Mar 14, 2023

Bug Description

The kafka cluster is deployed in the kafka namespace, and Prometheus and grafana are deployed in the monitoring namespace (using kube-prometheus). Refer to the official documentation to configure everything to run well. The current problem is to import 5 grafana-dashboards (strimzi-cruise-control.json, strimzi-kafka-exporter.json, strimzi-kafka.json, strimzi-zookeeper.json and strimzi-operators.json) in grafana, except that the data display of strimzi-kafka-exporter.json is incorrect, other monitoring dashboards are very good good.

Is there any solution? Thank you very much!
Due to policy requirements, it must be deployed separately in two namespaces (deploying them all in the single kafka namespace dashboards works fine)

here is the kafka topics list

Steps to reproduce

git clone https://github.com/prometheus-operator/kube-prometheus.git

sed -i 's/24h/15d/' kube-prometheus/manifests/setup/0thanosrulerCustomResourceDefinition.yaml
sed -i 's/replicas: 2$/replicas: 1/g' kube-prometheus/manifests/prometheus-prometheus.yaml kube-prometheus/manifests/prometheusAdapter-deployment.yaml

install kube-prometheus operatir's CRD
kubectl apply -f kube-prometheus/manifests/setup/

sed -i 's/namespace: myproject/namespace: monitoring/g' strimzi-0.32.0/examples/metrics/prometheus-install/prometheus.yaml && sed -i 's/prometheus-server/prometheus-k8s/' strimzi-0.32.0/examples/metrics/prometheus-install/prometheus.yaml

sed -i 's/myproject/kafka/g' strimzi-0.32.0/examples/metrics/prometheus-install/strimzi-pod-monitor.yaml

we merged strimzi-0.32.0\examples\metrics\prometheus-install\prometheus.yaml to kube-prometheus\manifests\prometheus-prometheus.yaml configuration and ClusterRole and ClusterRoleBinding

kubectl apply -f .\strimzi-0.32.0\examples\metrics\prometheus-additional-properties\ -n monitoring && kubectl apply -f .\strimzi-0.32.0\examples\metrics\prometheus-alertmanager-config\ -n monitoring

kubectl apply -f .\strimzi-0.32.0\examples\metrics\prometheus-install\strimzi-pod-monitor.yaml -n monitoring
kubectl apply -f .\strimzi-0.32.0\examples\metrics\prometheus-install\alert-manager.yaml -n monitoring
kubectl apply -f .\strimzi-0.32.0\examples\metrics\prometheus-install\prometheus-rules.yaml -n monitoring

# install kube prometheus instance
kubectl create -f kube-prometheus/manifests

Expected behavior

No response

Strimzi version

0.32.0

Kubernetes version

Kubernetes 1.23.13

Installation method

YAML files

Infrastructure

Kubernetes Native

Configuration files and logs

here is kafka cluster instance yaml file:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: qa-broker # kafka cluster instance name
  namespace: kafka
spec:
  kafka:
    rack:
      topologyKey: "app"
    template:
      pod:
        metadata:
          labels:
            app: kafka
        # affinity:
        # nodeAffinity:
        #    requiredDuringSchedulingIgnoredDuringExecution:
        #       nodeSelectorTerms:
        #      - matchExpressions:
        #        - key: app
        #          operator: In
        #          values:
        #          - kafka
        tolerations:
          - key: app
            operator: "Equal"
            value: kafka
    version: 3.3.1
    replicas: 3
    resources:
      requests:
        memory: 2Gi # recommend 4G in prod, prod work node config: 4 core, 8G RAM
        cpu: 500m # recommend 500m in prod, prod work node config: 4 core, 8G RAM
      limits:
        memory: 2Gi
        cpu: 1000m # recommend 2600m in prod, prod work node config: 4 core, 8G RAM
    jvmOptions:
      -Xms: 1536m
      -Xmx: 1536m
      gcLoggingEnabled: false
    jmxOptions:
      authentication:
        type: "password"
    readinessProbe:
      initialDelaySeconds: 25
      timeoutSeconds: 5
    livenessProbe:
      initialDelaySeconds: 25
      timeoutSeconds: 5
    listeners:
      - name: plain
        type: internal
        port: 9092
        tls: false
      - name: external
        port: 9094
        type: ingress
        tls: true
        authentication:
          type: scram-sha-512        
        configuration:
          bootstrap:
            host: kafka-bootstrap-qa.test.only.com
            annotations:
              external-dns.alpha.kubernetes.io/hostname: kafka-bootstrap-qa.test.only.com.
              external-dns.alpha.kubernetes.io/ttl: "60"
          brokers:
          - broker: 0
            host: kafka-broker-qa-0.test.only.com
            annotations:
              external-dns.alpha.kubernetes.io/hostname: kafka-broker-qa-0.test.only.com.
              external-dns.alpha.kubernetes.io/ttl: "60"
          - broker: 1
            host: kafka-broker-qa-1.test.only.com
            annotations:
              external-dns.alpha.kubernetes.io/hostname: kafka-broker-qa-1.test.only.com.
              external-dns.alpha.kubernetes.io/ttl: "60"
          - broker: 2
            host: kafka-broker-qa-2.test.only.com
            annotations:
              external-dns.alpha.kubernetes.io/hostname: kafka-broker-qa-2.test.only.com.
              external-dns.alpha.kubernetes.io/ttl: "60"
    authorization:
      type: simple   
      superUsers:
        - admin  
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 1
      default.replication.factor: 3
      min.insync.replicas: 2
      log.retention.hours: 168
      offsets.retention.minutes: 43800
      num.partitions: 6
      auto.create.topics.enable: false
      unclean.leader.election.enable: false
      auto.leader.rebalance.enable: false
      inter.broker.protocol.version: "3.3"
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 16Gi
        class: default
        deleteClaim: false
      - id: 1
        type: persistent-claim
        size: 16Gi
        class: default
        deleteClaim: false        
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  zookeeper:
    template:
      pod:
        topologySpreadConstraints:
            - labelSelector:
                matchLabels:
                  app: kafka
              maxSkew: 1
              topologyKey: app
              whenUnsatisfiable: ScheduleAnyway      
        metadata:
          labels:
            app: kafka
       # affinity:
       #   nodeAffinity:
       #     requiredDuringSchedulingIgnoredDuringExecution:
       #       nodeSelectorTerms:
       #       - matchExpressions:
       #         - key: app
       #           operator: In
       #           values:
       #           - kafka
        tolerations:
          - key: app
            operator: "Equal"
            value: kafka
    replicas: 3
    storage:
      type: persistent-claim
      class: default
      size: 10Gi
      deleteClaim: false
    jmxOptions:
      authentication:
        type: "password"
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafkaExporter:
    topicRegex: ".*"
    groupRegex: ".*"
  cruiseControl:
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: cruise-control-metrics
          key: metrics-config.yml
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: cruise-control-metrics
  labels:
    app: strimzi
data:
  metrics-config.yml: |
    # See https://github.com/prometheus/jmx_exporter for more info about JMX Prometheus Exporter metrics
    lowercaseOutputName: true
    rules:
    - pattern: kafka.cruisecontrol<name=(.+)><>(\w+)
      name: kafka_cruisecontrol_$1_$2
      type: GAUGE
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: kafka-metrics
  labels:
    app: strimzi
data:
  kafka-metrics-config.yml: |
    # See https://github.com/prometheus/jmx_exporter for more info about JMX Prometheus Exporter metrics
    lowercaseOutputName: true
    rules:
    # Special cases and very specific rules
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
       clientId: "$3"
       topic: "$4"
       partition: "$5"
    - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
      name: kafka_server_$1_$2
      type: GAUGE
      labels:
       clientId: "$3"
       broker: "$4:$5"
    - pattern: kafka.server<type=(.+), cipher=(.+), protocol=(.+), listener=(.+), networkProcessor=(.+)><>connections
      name: kafka_server_$1_connections_tls_info
      type: GAUGE
      labels:
        cipher: "$2"
        protocol: "$3"
        listener: "$4"
        networkProcessor: "$5"
    - pattern: kafka.server<type=(.+), clientSoftwareName=(.+), clientSoftwareVersion=(.+), listener=(.+), networkProcessor=(.+)><>connections
      name: kafka_server_$1_connections_software
      type: GAUGE
      labels:
        clientSoftwareName: "$2"
        clientSoftwareVersion: "$3"
        listener: "$4"
        networkProcessor: "$5"
    - pattern: "kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+):"
      name: kafka_server_$1_$4
      type: GAUGE
      labels:
       listener: "$2"
       networkProcessor: "$3"
    - pattern: kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+)
      name: kafka_server_$1_$4
      type: GAUGE
      labels:
       listener: "$2"
       networkProcessor: "$3"
    # Some percent metrics use MeanRate attribute
    # Ex) kafka.server<type=(KafkaRequestHandlerPool), name=(RequestHandlerAvgIdlePercent)><>MeanRate
    - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>MeanRate
      name: kafka_$1_$2_$3_percent
      type: GAUGE
    # Generic gauges for percents
    - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>Value
      name: kafka_$1_$2_$3_percent
      type: GAUGE
    - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*, (.+)=(.+)><>Value
      name: kafka_$1_$2_$3_percent
      type: GAUGE
      labels:
        "$4": "$5"
    # Generic per-second counters with 0-2 key/value pairs
    - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+), (.+)=(.+)><>Count
      name: kafka_$1_$2_$3_total
      type: COUNTER
      labels:
        "$4": "$5"
        "$6": "$7"
    - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+)><>Count
      name: kafka_$1_$2_$3_total
      type: COUNTER
      labels:
        "$4": "$5"
    - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*><>Count
      name: kafka_$1_$2_$3_total
      type: COUNTER
    # Generic gauges with 0-2 key/value pairs
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Value
      name: kafka_$1_$2_$3
      type: GAUGE
      labels:
        "$4": "$5"
        "$6": "$7"
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Value
      name: kafka_$1_$2_$3
      type: GAUGE
      labels:
        "$4": "$5"
    - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Value
      name: kafka_$1_$2_$3
      type: GAUGE
    # Emulate Prometheus 'Summary' metrics for the exported 'Histogram's.
    # Note that these are missing the '_sum' metric!
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Count
      name: kafka_$1_$2_$3_count
      type: COUNTER
      labels:
        "$4": "$5"
        "$6": "$7"
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*), (.+)=(.+)><>(\d+)thPercentile
      name: kafka_$1_$2_$3
      type: GAUGE
      labels:
        "$4": "$5"
        "$6": "$7"
        quantile: "0.$8"
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Count
      name: kafka_$1_$2_$3_count
      type: COUNTER
      labels:
        "$4": "$5"
    - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*)><>(\d+)thPercentile
      name: kafka_$1_$2_$3
      type: GAUGE
      labels:
        "$4": "$5"
        quantile: "0.$6"
    - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Count
      name: kafka_$1_$2_$3_count
      type: COUNTER
    - pattern: kafka.(\w+)<type=(.+), name=(.+)><>(\d+)thPercentile
      name: kafka_$1_$2_$3
      type: GAUGE
      labels:
        quantile: "0.$4"
  zookeeper-metrics-config.yml: |
    # See https://github.com/prometheus/jmx_exporter for more info about JMX Prometheus Exporter metrics
    lowercaseOutputName: true
    rules:
    # replicated Zookeeper
    - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+)><>(\\w+)"
      name: "zookeeper_$2"
      type: GAUGE
    - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+)><>(\\w+)"
      name: "zookeeper_$3"
      type: GAUGE
      labels:
        replicaId: "$2"
    - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+)><>(Packets\\w+)"
      name: "zookeeper_$4"
      type: COUNTER
      labels:
        replicaId: "$2"
        memberType: "$3"
    - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+)><>(\\w+)"
      name: "zookeeper_$4"
      type: GAUGE
      labels:
        replicaId: "$2"
        memberType: "$3"
    - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+), name3=(\\w+)><>(\\w+)"
      name: "zookeeper_$4_$5"
      type: GAUGE
      labels:
        replicaId: "$2"
        memberType: "$3"
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: qa-rebalance
  labels:
    strimzi.io/cluster: qa-broker # kafka cluster instance name, must be same
spec:
  goals:
    - CpuCapacityGoal
    - NetworkInboundCapacityGoal
    - DiskCapacityGoal
    - RackAwareGoal
    - MinTopicLeadersPerBrokerGoal
    - NetworkOutboundCapacityGoal
    - ReplicaCapacityGoal

kube-prometheus merged yaml files:

#  the kube-prometheus/manifests/prometheus-clusterRole.yaml file
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.42.0
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs:
  - /metrics
  verbs:
  - get

# kube-prometheus/manifests/prometheus-prometheus.yaml file
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.42.0
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - apiVersion: v2
      name: alertmanager-main
      namespace: monitoring
      port: web
    - namespace: monitoring
      name: alertmanager
      port: alertmanager
  enableFeatures: []
  externalLabels: {}
  image: quay.io/prometheus/prometheus:v2.42.0
  nodeSelector:
    kubernetes.io/os: linux
  podMetadata:
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/instance: k8s
      app.kubernetes.io/name: prometheus
      app.kubernetes.io/part-of: kube-prometheus
      app.kubernetes.io/version: 2.42.0
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  probeNamespaceSelector: {}
  probeSelector: {}
  replicas: 1
  resources:
    requests:
      memory: 1024Mi
  retention: 15d
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      role: alert-rules
      app: strimzi
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: 2.42.0
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml

# prometheus-serviceAccount.yaml file
apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.42.0
    app: strimzi
  name: prometheus-k8s
  namespace: monitoring

# prometheusAdapter-apiService.yaml file
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.10.0
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  #insecureSkipTLSVerify: true
  service:
    name: prometheus-adapter
    namespace: monitoring
  version: v1beta1
  versionPriority: 100

# prometheusAdapter-clusterRoleBinding.yaml file
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.10.0
    app: strimzi
  name: prometheus-adapter
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-adapter
subjects:
- kind: ServiceAccount
  name: prometheus-adapter
  namespace: monitoring

Additional context

No response

scholzj · 2023-03-14T09:05:09Z

scholzj
Mar 14, 2023
Maintainer

TBH, I'm not sure I follow what the actual issue is. Having Prometheus and Kafka in different namespaces works completely fine without any issues.

13 replies

peter-hst Mar 14, 2023
Author

here is the targets info of prometheus, They are all UP status

peter-hst Mar 14, 2023
Author

# strimzi-0.32.0/examples/metrics/prometheus-install/strimzi-pod-monitor.yaml file
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: cluster-operator-metrics
  labels:
    app: strimzi
spec:
  selector:
    matchLabels:
      strimzi.io/kind: cluster-operator
  namespaceSelector:
    matchNames:
      - kafka
  podMetricsEndpoints:
  - path: /metrics
    port: http
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: entity-operator-metrics
  labels:
    app: strimzi
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: entity-operator
  namespaceSelector:
    matchNames:
      - kafka
  podMetricsEndpoints:
  - path: /metrics
    port: healthcheck
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: bridge-metrics
  labels:
    app: strimzi
spec:
  selector:
    matchLabels:
      strimzi.io/kind: KafkaBridge
  namespaceSelector:
    matchNames:
      - kafka
  podMetricsEndpoints:
  - path: /metrics
    port: rest-api
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: kafka-resources-metrics
  labels:
    app: strimzi
spec:
  selector:
    matchExpressions:
      - key: "strimzi.io/kind"
        operator: In
        values: ["Kafka", "KafkaConnect", "KafkaMirrorMaker", "KafkaMirrorMaker2"]
  namespaceSelector:
    matchNames:
      - kafka
  podMetricsEndpoints:
  - path: /metrics
    port: tcp-prometheus
    relabelings:
    - separator: ;
      regex: __meta_kubernetes_pod_label_(strimzi_io_.+)
      replacement: $1
      action: labelmap
    - sourceLabels: [__meta_kubernetes_namespace]
      separator: ;
      regex: (.*)
      targetLabel: namespace
      replacement: $1
      action: replace
    - sourceLabels: [__meta_kubernetes_pod_name]
      separator: ;
      regex: (.*)
      targetLabel: kubernetes_pod_name
      replacement: $1
      action: replace
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      separator: ;
      regex: (.*)
      targetLabel: node_name
      replacement: $1
      action: replace
    - sourceLabels: [__meta_kubernetes_pod_host_ip]
      separator: ;
      regex: (.*)
      targetLabel: node_ip
      replacement: $1
      action: replace

# strimzi-0.32.0/examples/metrics/prometheus-install/prometheus-rules.yaml file
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
    app: strimzi
  name: prometheus-k8s-rules
spec:
  groups:
  - name: kafka
    rules:
    - alert: KafkaRunningOutOfSpace
      expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka is running out of free disk space'
        description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
    - alert: UnderReplicatedPartitions
      expr: kafka_server_replicamanager_underreplicatedpartitions > 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka under replicated partitions'
        description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}'
    - alert: AbnormalControllerState
      expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka abnormal controller state'
        description: 'There are {{ $value }} active controllers in the cluster'
    - alert: OfflinePartitions
      expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka offline partitions'
        description: 'One or more partitions have no leader'
    - alert: UnderMinIsrPartitionCount
      expr: kafka_server_replicamanager_underminisrpartitioncount > 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka under min ISR partitions'
        description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.kubernetes_pod_name }}'
    - alert: OfflineLogDirectoryCount
      expr: kafka_log_logmanager_offlinelogdirectorycount > 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka offline log directories'
        description: 'There are {{ $value }} offline log directories on {{ $labels.kubernetes_pod_name }}'
    - alert: ScrapeProblem
      expr: up{kubernetes_namespace!~"openshift-.+",kubernetes_pod_name=~".+-kafka-[0-9]+"} == 0
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'Prometheus unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }}'
        description: 'Prometheus was unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }} for more than 3 minutes'
    - alert: ClusterOperatorContainerDown
      expr: count((container_last_seen{container="strimzi-cluster-operator"} > (time() - 90))) < 1 or absent(container_last_seen{container="strimzi-cluster-operator"})
      for: 1m
      labels:
        severity: major
      annotations:
        summary: 'Cluster Operator down'
        description: 'The Cluster Operator has been down for longer than 90 seconds'
    - alert: KafkaBrokerContainersDown
      expr: absent(container_last_seen{container="kafka",pod=~".+-kafka-[0-9]+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'All `kafka` containers down or in CrashLookBackOff status'
        description: 'All `kafka` containers have been down or in CrashLookBackOff status for 3 minutes'
    - alert: KafkaContainerRestartedInTheLast5Minutes
      expr: count(count_over_time(container_last_seen{container="kafka"}[5m])) > 2 * count(container_last_seen{container="kafka",pod=~".+-kafka-[0-9]+"})
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: 'One or more Kafka containers restarted too often'
        description: 'One or more Kafka containers were restarted too often within the last 5 minutes'
  - name: zookeeper
    rules:
    - alert: AvgRequestLatency
      expr: zookeeper_avgrequestlatency > 10
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Zookeeper average request latency'
        description: 'The average request latency is {{ $value }} on {{ $labels.kubernetes_pod_name }}'
    - alert: OutstandingRequests
      expr: zookeeper_outstandingrequests > 10
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Zookeeper outstanding requests'
        description: 'There are {{ $value }} outstanding requests on {{ $labels.kubernetes_pod_name }}'
    - alert: ZookeeperRunningOutOfSpace
      expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Zookeeper is running out of free disk space'
        description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
    - alert: ZookeeperContainerRestartedInTheLast5Minutes
      expr: count(count_over_time(container_last_seen{container="zookeeper"}[5m])) > 2 * count(container_last_seen{container="zookeeper",pod=~".+-zookeeper-[0-9]+"})
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: 'One or more Zookeeper containers were restarted too often'
        description: 'One or more Zookeeper containers were restarted too often within the last 5 minutes. This alert can be ignored when the Zookeeper cluster is scaling up'
    - alert: ZookeeperContainersDown
      expr: absent(container_last_seen{container="zookeeper",pod=~".+-zookeeper-[0-9]+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'All `zookeeper` containers in the Zookeeper pods down or in CrashLookBackOff status'
        description: 'All `zookeeper` containers in the Zookeeper pods have been down or in CrashLookBackOff status for 3 minutes'
  - name: entityOperator
    rules:
    - alert: TopicOperatorContainerDown
      expr: absent(container_last_seen{container="topic-operator",pod=~".+-entity-operator-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'Container topic-operator in Entity Operator pod down or in CrashLookBackOff status'
        description: 'Container topic-operator in Entity Operator pod has been or in CrashLookBackOff status for 3 minutes'
    - alert: UserOperatorContainerDown
      expr: absent(container_last_seen{container="user-operator",pod=~".+-entity-operator-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'Container user-operator in Entity Operator pod down or in CrashLookBackOff status'
        description: 'Container user-operator in Entity Operator pod have been down or in CrashLookBackOff status for 3 minutes'
    - alert: EntityOperatorTlsSidecarContainerDown
      expr: absent(container_last_seen{container="tls-sidecar",pod=~".+-entity-operator-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'Container tls-sidecar Entity Operator pod down or in CrashLookBackOff status'
        description: 'Container tls-sidecar in Entity Operator pod have been down or in CrashLookBackOff status for 3 minutes'
  - name: connect
    rules:
    - alert: ConnectContainersDown
      expr: absent(container_last_seen{container=~".+-connect",pod=~".+-connect-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'All Kafka Connect containers down or in CrashLookBackOff status'
        description: 'All Kafka Connect containers have been down or in CrashLookBackOff status for 3 minutes'
  - name: bridge
    rules:
    - alert: BridgeContainersDown
      expr: absent(container_last_seen{container=~".+-bridge",pod=~".+-bridge-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'All Kafka Bridge containers down or in CrashLookBackOff status'
        description: 'All Kafka Bridge containers have been down or in CrashLookBackOff status for 3 minutes'
    - alert: AvgProducerLatency
      expr: strimzi_bridge_kafka_producer_request_latency_avg > 10
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka Bridge average consumer fetch latency'
        description: 'The average fetch latency is {{ $value }} on {{ $labels.clientId }}'
    - alert: AvgConsumerFetchLatency
      expr: strimzi_bridge_kafka_consumer_fetch_latency_avg > 500
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka Bridge consumer average fetch latency'
        description: 'The average consumer commit latency is {{ $value }} on {{ $labels.clientId }}'
    - alert: AvgConsumerCommitLatency
      expr: strimzi_bridge_kafka_consumer_commit_latency_avg > 200
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Kafka Bridge consumer average commit latency'
        description: 'The average consumer commit latency is {{ $value }} on {{ $labels.clientId }}'
    - alert: Http4xxErrorRate
      expr: strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge", path !="/favicon.ico"} > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: 'Kafka Bridge returns code 4xx too often'
        description: 'Kafka Bridge returns code 4xx too much ({{ $value }}) for the path {{ $labels.path }}'
    - alert: Http5xxErrorRate
      expr: strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"} > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: 'Kafka Bridge returns code 5xx too often'
        description: 'Kafka Bridge returns code 5xx too much ({{ $value }}) for the path {{ $labels.path }}'
  - name: mirrorMaker
    rules:
    - alert: MirrorMakerContainerDown
      expr: absent(container_last_seen{container=~".+-mirror-maker",pod=~".+-mirror-maker-.+"})
      for: 3m
      labels:
        severity: major
      annotations:
        summary: 'All Kafka Mirror Maker containers down or in CrashLookBackOff status'
        description: 'All Kafka Mirror Maker containers have been down or in CrashLookBackOff status for 3 minutes'
  - name: kafkaExporter
    rules:
    - alert: UnderReplicatedPartition
      expr: kafka_topic_partition_under_replicated_partition > 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Topic has under-replicated partitions'
        description: 'Topic  {{ $labels.topic }} has {{ $value }} under-replicated partition {{ $labels.partition }}'
    - alert: TooLargeConsumerGroupLag
      expr: kafka_consumergroup_lag > 1000
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'Consumer group lag is too big'
        description: 'Consumer group {{ $labels.consumergroup}} lag is too big ({{ $value }}) on topic {{ $labels.topic }}/partition {{ $labels.partition }}'
    - alert: NoMessageForTooLong
      expr: changes(kafka_topic_partition_current_offset{topic!="__consumer_offsets"}[10m]) == 0
      for: 10s
      labels:
        severity: warning
      annotations:
        summary: 'No message for 10 minutes'
        description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'

# strimzi-0.32.0/examples/metrics/prometheus-additional-properties/prometheus-additional.yaml file
apiVersion: v1
kind: Secret
metadata:
  name: additional-scrape-configs
type: Opaque
stringData:
  prometheus-additional.yaml: |
    - job_name: kubernetes-cadvisor
      honor_labels: true
      scrape_interval: 10s
      scrape_timeout: 10s
      metrics_path: /metrics/cadvisor
      scheme: https
      kubernetes_sd_configs:
      - role: node
        namespaces:
          names: []
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      relabel_configs:
      - separator: ;
        regex: __meta_kubernetes_node_label_(.+)
        replacement: $1
        action: labelmap
      - separator: ;
        regex: (.*)
        target_label: __address__
        replacement: kubernetes.default.svc:443
        action: replace
      - source_labels: [__meta_kubernetes_node_name]
        separator: ;
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        action: replace
      - source_labels: [__meta_kubernetes_node_name]
        separator: ;
        regex: (.*)
        target_label: node_name
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_node_address_InternalIP]
        separator: ;
        regex: (.*)
        target_label: node_ip
        replacement: $1
        action: replace
      metric_relabel_configs:
      - source_labels: [container, __name__]
        separator: ;
        regex: POD;container_(network).*
        target_label: container
        replacement: $1
        action: replace
      - source_labels: [container]
        separator: ;
        regex: POD
        replacement: $1
        action: drop
      - source_labels: [container]
        separator: ;
        regex: ^$
        replacement: $1
        action: drop
      - source_labels: [__name__]
        separator: ;
        regex: container_(network_tcp_usage_total|tasks_state|memory_failures_total|network_udp_usage_total)
        replacement: $1
        action: drop

    - job_name: kubernetes-nodes-kubelet
      scrape_interval: 10s
      scrape_timeout: 10s
      scheme: https
      kubernetes_sd_configs:
      - role: node
        namespaces:
          names: []
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

peter-hst Mar 14, 2023
Author

Could it be a problem with the filter condition of the PrmeQL statement defined in the strimzi-kafka-exporter.json file?

scholzj Mar 14, 2023
Maintainer

I have no idea what you mean by that. But I think it works fine for everyone else - or at least it works for me and I'm not aware of anyone commenting about something being wrong.

peter-hst Mar 17, 2023
Author

I have no idea what you mean by that. But I think it works fine for everyone else - or at least it works for me and I'm not aware of anyone commenting about something being wrong.

Hello Bro, this issue has been resolved and our Strimzi-kafka cluster is now deployed to production.

To solve this problem, it has nothing to do with the Strimzi configuration, we modified the parameter settings of the datasource in the kube-prometheus/manifests/grafana-dashboardDatasources.yaml file, and then it worked fine.

# kube-prometheus/manifests/grafana-dashboardDatasources.yaml file
apiVersion: v1
kind: Secret
metadata:
  labels:
    app.kubernetes.io/component: grafana
    app.kubernetes.io/name: grafana
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 8.5.5
  name: grafana-datasources
  namespace: monitoring
stringData:
  datasources.yaml: |-
    {
        "apiVersion": 1,
        "datasources": [
            {
                "access": "proxy",
                "name": "prometheus",
                "orgId": 1,
                "type": "prometheus",
                "url": "http://prometheus-operated:9090",
                "version": 1,
                "jsonData": { "httpMethod": "POST" },
                "isDefault": true
            }
        ]
    }
type: Opaque

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

[Bug]: Ask grafana questions for strimzi kafka exporter #8231

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 13 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

[Bug]: Ask grafana questions for strimzi kafka exporter #8231

Uh oh!

Uh oh!

peter-hst Mar 14, 2023

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

Replies: 1 comment · 13 replies

Uh oh!

scholzj Mar 14, 2023 Maintainer

Uh oh!

Uh oh!

peter-hst Mar 14, 2023 Author

Uh oh!

peter-hst Mar 14, 2023 Author

Uh oh!

peter-hst Mar 14, 2023 Author

Uh oh!

scholzj Mar 14, 2023 Maintainer

Uh oh!

peter-hst Mar 17, 2023 Author

peter-hst
Mar 14, 2023

Replies: 1 comment 13 replies

scholzj
Mar 14, 2023
Maintainer

peter-hst Mar 14, 2023
Author

peter-hst Mar 14, 2023
Author

peter-hst Mar 14, 2023
Author

scholzj Mar 14, 2023
Maintainer

peter-hst Mar 17, 2023
Author