Consumers hanging and getting "java.io.IOException: No space left on device" with near empty disks #6927

robcxyz · 2022-06-12T13:16:22Z

robcxyz
Jun 12, 2022

I am trying to diagnose why I am having some consumers hanging and not processing messages per the plot below:

sum(kafka_consumergroup_lag{consumergroup=~"transformer-$network.*"}) by (consumergroup, topic)

Which I think might be related to the error showing up when I describe my kafka deployment, specifically java.io.IOException: No space left on device which is odd because the disks are nearly empty.

Output of kubectl kafka describe ....

Spec:
  Cruise Control:
    Metrics Config:
      Type:  jmxPrometheusExporter
      Value From:
        Config Map Key Ref:
          Key:   metrics-config.yml
          Name:  cruise-control-metrics
  Entity Operator:
    Topic Operator:
    User Operator:
  Kafka:
    Config:
      auto.create.topics.enable:                 false
      delete.topic.enabled:                      true
      inter.broker.protocol.version:             2.8
      log.message.format.version:                2.8
      message.max.bytes:                         67109632
      offsets.topic.replication.factor:          3
      replica.fetch.max.bytes:                   67109632
      replica.fetch.response.max.bytes:          67109632
      replica.selector.class:                    org.apache.kafka.common.replica.RackAwareReplicaSelector
      transaction.state.log.min.isr:             2
      transaction.state.log.replication.factor:  3
    Jvm Options:
      - XX:
        G1HeapRegionSize:                   16M
        Initiating Heap Occupancy Percent:  35
        Max GC Pause Millis:                20
        Max Metaspace Free Ratio:           80
        Metaspace Size:                     96m
        Min Metaspace Free Ratio:           50
        UseG1GC:                            true
    Listeners:
      Name:  plain
      Port:  9092
      Tls:   false
      Type:  internal
      Name:  tls
      Port:  9093
      Tls:   true
      Type:  internal
    Metrics Config:
      Type:  jmxPrometheusExporter
      Value From:
        Config Map Key Ref:
          Key:   kafka-metrics-config.yml
          Name:  icon-kafka-mainnet-v2-a-kafka-metrics
    Replicas:    3
    Resources:
      Limits:
        Cpu:     2.5
        Memory:  6.5Gi
      Requests:
        Cpu:     100m
        Memory:  5Gi
    Storage:
      Type:  jbod
      Volumes:
        Class:         vultr-block-storage-wait-retain
        Delete Claim:  true
        Id:            0
        Size:          200Gi
        Type:          persistent-claim
        Class:         vultr-block-storage-wait-retain
        Delete Claim:  true
        Id:            1
        Size:          200Gi
        Type:          persistent-claim
    Template:
      Pod:
        Affinity:
          Node Affinity:
            Required During Scheduling Ignored During Execution:
              Node Selector Terms:
                Match Expressions:
                  Key:       vke.vultr.com/node-pool
                  Operator:  In
                  Values:
                    kafka-mainnet-v2
                  Key:       pv.capacity
                  Operator:  NotIn
                  Values:
                    full
          Pod Anti Affinity:
            Required During Scheduling Ignored During Execution:
              Label Selector:
                Match Expressions:
                  Key:       strimzi.io/name
                  Operator:  In
                  Values:
                    icon-kafka-mainnet-v2-a-kafka
              Topology Key:   kubernetes.io/hostname
        Priority Class Name:  system-cluster-critical
    Version:                  3.1.0
  Kafka Exporter:
    Group Regex:  .*
    Topic Regex:  .*
  Zookeeper:
    Replicas:  3
    Resources:
      Limits:
        Cpu:     500m
        Memory:  1Gi
      Requests:
        Cpu:     100m
        Memory:  250Mi
    Storage:
      Class:         vultr-block-storage-wait-retain
      Delete Claim:  true
      Size:          10Gi
      Type:          persistent-claim
    Template:
      Pod:
        Affinity:
          Node Affinity:
            Required During Scheduling Ignored During Execution:
              Node Selector Terms:
                Match Expressions:
                  Key:       vke.vultr.com/node-pool
                  Operator:  In
                  Values:
                    kafka-mainnet-v2
                  Key:       pv.capacity
                  Operator:  NotIn
                  Values:
                    full
          Pod Anti Affinity:
            Required During Scheduling Ignored During Execution:
              Label Selector:
                Match Expressions:
                  Key:       strimzi.io/name
                  Operator:  In
                  Values:
                    icon-kafka-mainnet-v2-a-zookeeper
              Topology Key:  kubernetes.io/hostname
Status:
  Conditions:
    Last Transition Time:  2022-06-10T05:50:10.803Z
    Message:               java.io.IOException: No space left on device
    Reason:                RuntimeException
    Status:                True
    Type:                  NotReady
  Observed Generation:     2
Events:                    <none>

When the lag happens in the consumer group, I am getting these logs as well (error starts at ~2022-06-10 02:02:52):
Broker 1

2022-06-10 01:57:52,678 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 01:57:52,678 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,678 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,678 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:02:52,679 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,679 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,679 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:07:52,680 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:08:09,605 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log0] Rolled new log segment at offset 164436 in 1 ms. (kafka.log.LocalLog) [kafka-scheduler-9]
2022-06-10 02:08:09,608 INFO [ProducerStateManager partition=strimzi.cruisecontrol.partitionmetricsamples-27] Wrote producer snapshot at offset 164436 with 0 producer ids in 3 ms. (kafka.log.ProducerStateManager) [kafka-scheduler-9]
2022-06-10 02:08:09,608 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log0] Deleting segment LogSegment(baseOffset=164243, size=3331, lastModifiedTime=1654823192248, largestRecordTimestamp=Some(1654823156878)) due to retention time 3600000ms breach based on the largest record timestamp in the segment (kafka.log.UnifiedLog) [kafka-scheduler-9]
2022-06-10 02:08:09,613 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log0] Incremented log start offset to 164436 due to segment deletion (kafka.log.UnifiedLog) [kafka-scheduler-9]
2022-06-10 02:09:09,609 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log0] Deleting segment files LogSegment(baseOffset=164243, size=3331, lastModifiedTime=1654823192248, largestRecordTimestamp=Some(1654823156878)) (kafka.log.LocalLog$) [kafka-scheduler-1]
2022-06-10 02:09:09,610 INFO Deleted log /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.log.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]
2022-06-10 02:09:09,610 INFO Deleted offset index /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.index.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]
2022-06-10 02:09:09,610 INFO Deleted time index /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.timeindex.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]
2022-06-10 02:09:09,613 INFO Deleted producer state snapshot /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.snapshot.deleted (kafka.log.SnapshotFile) [kafka-scheduler-9]
2022-06-10 02:12:52,680 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,680 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,680 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,680 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,681 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,681 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,681 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:12:52,681 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:17:52,681 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:22:52,682 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,682 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 DEBUG [Controller id=0] Topics not in preferred replica for broker 0 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 DEBUG [Controller id=0] Topics not in preferred replica for broker 1 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 TRACE [Controller id=0] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 DEBUG [Controller id=0] Topics not in preferred replica for broker 2 HashMap() (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:27:52,683 TRACE [Controller id=0] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [controller-event-thread]
2022-06-10 02:28:09,605 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-21, dir=/var/lib/kafka/data-1/kafka-log0] Rolled new log segment at offset 158969 in 1 ms. (kafka.log.LocalLog) [kafka-scheduler-3]
2022-06-10 02:28:09,608 INFO [ProducerStateManager partition=strimzi.cruisecontrol.partitionmetricsamples-21] Wrote producer snapshot at offset 158969 with 0 producer ids in 3 ms. (kafka.log.ProducerStateManager) [kafka-scheduler-3]
2022-06-10 02:28:09,609 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-21, dir=/var/lib/kafka/data-1/kafka-log0] Deleting segment LogSegment(baseOffset=158776, size=3314, lastModifiedTime=1654824332040, largestRecordTimestamp=Some(1654824297045)) due to retention time 3600000ms breach based on the largest record timestamp in the segment (kafka.log.UnifiedLog) [kafka-scheduler-3]
2022-06-10 02:28:09,613 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-21, dir=/var/lib/kafka/data-1/kafka-log0] Incremented log start offset to 158969 due to segment deletion (kafka.log.UnifiedLog) [kafka-scheduler-3]
2022-06-10 02:29:09,610 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-21, dir=/var/lib/kafka/data-1/kafka-log0] Deleting segment files LogSegment(baseOffset=158776, size=3314, lastModifiedTime=1654824332040, largestRecordTimestamp=Some(1654824297045)) (kafka.log.LocalLog$) [kafka-scheduler-1]
2022-06-10 02:29:09,610 INFO Deleted log /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-21/00000000000000158776.log.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]
2022-06-10 02:29:09,610 INFO Deleted offset index /var/lib/kafka/data-1/kafka-log0/strimzi.cruisecontrol.partitionmetricsamples-21/00000000000000158776.index.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]

Broker 2 / 3 (similar logs)

2022-06-10 01:59:10,209 INFO Deleted time index /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-18/00000000000000158839.timeindex.deleted. (kafka.log.LogSegment) [kafka-scheduler-1]
2022-06-10 01:59:10,210 INFO Deleted producer state snapshot /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-18/00000000000000158839.snapshot.deleted (kafka.log.SnapshotFile) [kafka-scheduler-3]
2022-06-10 02:08:10,205 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log1] Rolled new log segment at offset 164436 in 2 ms. (kafka.log.LocalLog) [kafka-scheduler-6]
2022-06-10 02:08:10,208 INFO [ProducerStateManager partition=strimzi.cruisecontrol.partitionmetricsamples-27] Wrote producer snapshot at offset 164436 with 0 producer ids in 4 ms. (kafka.log.ProducerStateManager) [kafka-scheduler-6]
2022-06-10 02:08:10,209 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log1] Deleting segment LogSegment(baseOffset=164243, size=3331, lastModifiedTime=1654823192243, largestRecordTimestamp=Some(1654823156878)) due to retention time 3600000ms breach based on the largest record timestamp in the segment (kafka.log.UnifiedLog) [kafka-scheduler-6]
2022-06-10 02:08:10,212 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log1] Incremented log start offset to 164436 due to segment deletion (kafka.log.UnifiedLog) [kafka-scheduler-6]
2022-06-10 02:09:10,209 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-27, dir=/var/lib/kafka/data-1/kafka-log1] Deleting segment files LogSegment(baseOffset=164243, size=3331, lastModifiedTime=1654823192243, largestRecordTimestamp=Some(1654823156878)) (kafka.log.LocalLog$) [kafka-scheduler-7]
2022-06-10 02:09:10,210 INFO Deleted log /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.log.deleted. (kafka.log.LogSegment) [kafka-scheduler-7]
2022-06-10 02:09:10,210 INFO Deleted offset index /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.index.deleted. (kafka.log.LogSegment) [kafka-scheduler-7]
2022-06-10 02:09:10,210 INFO Deleted time index /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.timeindex.deleted. (kafka.log.LogSegment) [kafka-scheduler-7]
2022-06-10 02:09:10,212 INFO Deleted producer state snapshot /var/lib/kafka/data-1/kafka-log1/strimzi.cruisecontrol.partitionmetricsamples-27/00000000000000164243.snapshot.deleted (kafka.log.SnapshotFile) [kafka-scheduler-6]
2022-06-10 02:13:10,202 INFO [UnifiedLog partition=strimzi.cruisecontrol.modeltrainingsamples-8, dir=/var/lib/kafka/data-0/kafka-log1] Deleting segment LogSegment(baseOffset=1533, size=343349, lastModifiedTime=1654815031824, largestRecordTimestamp=Some(1654815031825)) due to retention time 12000000ms breach based on the largest record timestamp in the segment (kafka.log.UnifiedLog) [kafka-scheduler-8]
2022-06-10 02:13:10,206 INFO [UnifiedLog partition=strimzi.cruisecontrol.modeltrainingsamples-8, dir=/var/lib/kafka/data-0/kafka-log1] Incremented log start offset to 2508 due to segment deletion (kafka.log.UnifiedLog) [kafka-scheduler-8]
2022-06-10 02:14:10,204 INFO [LocalLog partition=strimzi.cruisecontrol.modeltrainingsamples-8, dir=/var/lib/kafka/data-0/kafka-log1] Deleting segment files LogSegment(baseOffset=1533, size=343349, lastModifiedTime=1654815031824, largestRecordTimestamp=Some(1654815031825)) (kafka.log.LocalLog$) [kafka-scheduler-5]
2022-06-10 02:14:10,205 INFO Deleted log /var/lib/kafka/data-0/kafka-log1/strimzi.cruisecontrol.modeltrainingsamples-8/00000000000000001533.log.deleted. (kafka.log.LogSegment) [kafka-scheduler-5]
2022-06-10 02:14:10,205 INFO Deleted offset index /var/lib/kafka/data-0/kafka-log1/strimzi.cruisecontrol.modeltrainingsamples-8/00000000000000001533.index.deleted. (kafka.log.LogSegment) [kafka-scheduler-5]
2022-06-10 02:14:10,205 INFO Deleted time index /var/lib/kafka/data-0/kafka-log1/strimzi.cruisecontrol.modeltrainingsamples-8/00000000000000001533.timeindex.deleted. (kafka.log.LogSegment) [kafka-scheduler-5]
2022-06-10 02:14:10,206 INFO Deleted producer state snapshot /var/lib/kafka/data-0/kafka-log1/strimzi.cruisecontrol.modeltrainingsamples-8/00000000000000001533.snapshot.deleted (kafka.log.SnapshotFile) [kafka-scheduler-3]
2022-06-10 02:23:10,203 INFO [LocalLog partition=strimzi.cruisecontrol.partitionmetricsamples-24, dir=/var/lib/kafka/data-1/kafka-log1] Rolled new log segment at offset 158323 in 1 ms. (kafka.log.LocalLog) [kafka-scheduler-9]
2022-06-10 02:23:10,207 INFO [ProducerStateManager partition=strimzi.cruisecontrol.partitionmetricsamples-24] Wrote producer snapshot at offset 158323 with 0 producer ids in 4 ms. (kafka.log.ProducerStateManager) [kafka-scheduler-9]
2022-06-10 02:23:10,207 INFO [UnifiedLog partition=strimzi.cruisecontrol.partitionmetricsamples-24, dir=/var/lib/kafka/data-1/kafka-log1] Deleting segment LogSegment(baseOffset=157358, size=16803, lastModifiedTime=1654824091899, largestRecordTimestamp=Some(1654824057010)) due to retention time 3600000ms breach based on the largest record timestamp in the segment (kafka.log.UnifiedLog) [kafka-scheduler-9]

Kafka manifest:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: "icon-kafka-mainnet-v2-a"
  namespace: "icon-kafka-mainnet-v2"
  annotations:
    "consul.hashicorp.com/connect-inject": "true"
    "consul.hashicorp.com/connect-service-port": "9092"
spec:
  cruiseControl:
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: cruise-control-metrics
          key: metrics-config.yml
  kafka:
    template:
      pod:
        priorityClassName: "system-cluster-critical"
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: vke.vultr.com/node-pool
                      operator: In
                      values:
                        - "kafka-mainnet-v2"
                    - key: pv.capacity
                      operator: NotIn
                      values:
                        - full

          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - "icon-kafka-mainnet-v2-a-kafka"
                topologyKey: "kubernetes.io/hostname"

    resources:
      requests:
        cpu: 100m
        memory: 5Gi
      limits:
        cpu: 2.5
        memory: 6.5Gi
    jvmOptions:
      "-XX":
        "MetaspaceSize": "96m"
        "UseG1GC": true
        "MaxGCPauseMillis": "20"
        "InitiatingHeapOccupancyPercent": "35"
        "G1HeapRegionSize": "16M"
        "MinMetaspaceFreeRatio": "50"
        "MaxMetaspaceFreeRatio": "80"
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: "icon-kafka-mainnet-v2-a-kafka-metrics"
          key: kafka-metrics-config.yml
    version: 3.1.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      replica.selector.class: org.apache.kafka.common.replica.RackAwareReplicaSelector
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      log.message.format.version: "2.8"
      inter.broker.protocol.version: "2.8"
      auto.create.topics.enable: "false"
      delete.topic.enabled: "true"
      message.max.bytes: "67109632"
      replica.fetch.max.bytes: "67109632"
      replica.fetch.response.max.bytes: "67109632"

    storage:
      type: jbod
      volumes:

        - id: 0
          type: persistent-claim
          size: "200Gi"
          deleteClaim: true
          class: "vultr-block-storage-wait-retain"

        - id: 1
          type: persistent-claim
          size: "200Gi"
          deleteClaim: true
          class: "vultr-block-storage-wait-retain"

  zookeeper:
    template:
      pod:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: vke.vultr.com/node-pool
                      operator: In
                      values:
                        - "kafka-mainnet-v2"
                    - key: pv.capacity
                      operator: NotIn
                      values:
                        - full

          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - "icon-kafka-mainnet-v2-a-zookeeper"
                topologyKey: "kubernetes.io/hostname"

    resources:
      requests:
        cpu: 100m
        memory: 250Mi
      limits:
        cpu: 500m
        memory: 1Gi
    replicas: 3
    storage:
      type: persistent-claim
      size: "10Gi"
      deleteClaim: true
      class: "vultr-block-storage-wait-retain"
  kafkaExporter:
    topicRegex: ".*"
    groupRegex: ".*"

  entityOperator:
    topicOperator: {}
    userOperator: {}

And finally the topic manifest:

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
  name: "blocks-mainnet-v2"
  namespace: "icon-kafka-mainnet-v2"
  labels:
    strimzi.io/cluster: "icon-kafka-mainnet-v2-a"
spec:
  replicas: 3
  partitions: 24
  config:
    retention.ms: "-1"
    retention.bytes: "-1"
    segment.bytes: 1073741824
    cleanup.policy: compact
    min.cleanable.dirty.ratio: 0.5
    max.message.bytes: 67109632  # 1048588 (default - 1 mb) * 64

Any input much appreciated. Happy to follow up with any other information.

Answered by robcxyz

Jun 29, 2022

Just following up on this, problem of having some consumer lag has not completely gone away but is now not nearly as bad as it was before. Not sure what exact configuration setting you @scholzj recommended had the most impact, but one of them from kafka-persistent.yaml made a difference.

Thanks again for your help.

View full answer

scholzj · 2022-06-12T13:43:02Z

scholzj
Jun 12, 2022
Maintainer

Kafka does not handle well when it runs out of space. So if it did, the brokers will get into some error states. The easiest way how to get out of it is by either increasing the storage (which how aware usually cannot be decreased again later) or by deleting some of the older segments.

(assuming the no spece left on the device comes from Kafka)

0 replies

robcxyz · 2022-06-14T22:02:45Z

robcxyz
Jun 14, 2022
Author

@scholzj - Thank you very much for the answer. I ended up redeploying the cluster which made the "No space left on device" error go away but I am still getting the same behavior with the consumers hanging.

I am getting a warning though in the kafka resource:

default.replication.factor option is not configured. It defaults to 1 which does not guarantee reliability and availability. You should configure this option in .spec.kafka.config.

Could this be related or is there anything else I should be looking for in my config?

Thank you for your time.

3 replies

scholzj Jun 14, 2022
Maintainer

The warning suggests that any topics created with the default settings will have only 1 replica which means no high-availability, no protection against data loss etc. But it is just the default setting. It might not say much about the actual topics which might use their own settings. You can get rid of it by setting the defaults: https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/kafka/kafka-persistent.yaml#L18-L24

If you would have topcis with single replica only, it would of course affect your consumers in case of any broker restarts, outages etc. That said, the consume lag screenshot looks good to me? It shows that you have 0 lag. So if there is any issue with the consumer not getting any messages, maybe producer is not sending them?

robcxyz Jun 14, 2022
Author

So I had most of those settings for replication on the topic level so I guess I could ignore those errors.

I did miss some settings from the kafka-persistent.yaml though so lets see if those have an impact. But to be clear, the grafana above is showing lag up to 100 messages behind as there are two consumer groups shown, the yellow line having zero lag.

Going to watch this for a day and see if the behavior goes away.

Again, thank you for your time.

scholzj Jun 14, 2022
Maintainer

I mean it shows that there was some lag. But at the moment you took the screenshot it was all gone. I do not know your use-case or your cluster. It is normal that there is some lag, especially at peak times if you have some traffic in your cluster. Also, the consumer might be doing some blocking processing and commit the offsets only in chunks. So without some additional information about throughput and the design of the consumer, it is hard to say what the cause might be.

robcxyz · 2022-06-29T01:09:29Z

robcxyz
Jun 29, 2022
Author

Just following up on this, problem of having some consumer lag has not completely gone away but is now not nearly as bad as it was before. Not sure what exact configuration setting you @scholzj recommended had the most impact, but one of them from kafka-persistent.yaml made a difference.

Thanks again for your help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Consumers hanging and getting "java.io.IOException: No space left on device" with near empty disks #6927

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Consumers hanging and getting "java.io.IOException: No space left on device" with near empty disks #6927

Uh oh!

robcxyz Jun 12, 2022

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

scholzj Jun 12, 2022 Maintainer

Uh oh!

robcxyz Jun 14, 2022 Author

Uh oh!

scholzj Jun 14, 2022 Maintainer

Uh oh!

robcxyz Jun 14, 2022 Author

Uh oh!

scholzj Jun 14, 2022 Maintainer

Uh oh!

robcxyz Jun 29, 2022 Author

robcxyz
Jun 12, 2022

Replies: 3 comments 3 replies

scholzj
Jun 12, 2022
Maintainer

robcxyz
Jun 14, 2022
Author

scholzj Jun 14, 2022
Maintainer

robcxyz Jun 14, 2022
Author

scholzj Jun 14, 2022
Maintainer

robcxyz
Jun 29, 2022
Author