You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -65,12 +65,23 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
65
65
severity:'warning',
66
66
},
67
67
annotations: {
68
-
summary:'Kafka ISR expansion rate is expanding.',
69
-
description:'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} In-Sync Replica (ISR) is expanding by {{ $value }} per second. If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR expansion rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
70
-
% [
71
-
instanceLabel,
72
-
groupLabel,
73
-
],
68
+
summary:'Kafka ISR expansion detected.',
69
+
description: |||
70
+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has In-Sync Replica (ISR) expanding at {{ printf "%%.2f" $value }} per second.
71
+
72
+
ISR expansion typically occurs when a broker recovers and its replicas catch up to the leader. The expected steady-state value for ISR expansion rate is 0.
73
+
74
+
Frequent ISR expansion and shrinkage indicates instability and may suggest:
75
+
- Brokers frequently going offline/online
76
+
- Network connectivity issues
77
+
- Replica lag configuration too tight (adjust replica.lag.max.messages or replica.socket.timeout.ms)
78
+
- Insufficient broker resources causing replicas to fall behind
79
+
80
+
If this alert fires frequently without corresponding broker outages, investigate broker health and adjust replica lag settings.
81
+
||| % [
82
+
instanceLabel,
83
+
groupLabel,
84
+
],
74
85
},
75
86
},
76
87
{
@@ -87,12 +98,27 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
87
98
severity:'warning',
88
99
},
89
100
annotations: {
90
-
summary:'Kafka ISR expansion rate is shrinking.',
91
-
description:'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} In-Sync Replica (ISR) is shrinking by {{ $value }} per second. If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR shrink rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
92
-
% [
93
-
instanceLabel,
94
-
groupLabel,
95
-
],
101
+
summary:'Kafka ISR shrinkage detected.',
102
+
description: |||
103
+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has In-Sync Replica (ISR) shrinking at {{ printf "%%.2f" $value }} per second.
104
+
105
+
ISR shrinkage occurs when a replica falls too far behind the leader and is removed from the ISR set. This reduces fault tolerance as fewer replicas are in-sync.
106
+
The expected steady-state value for ISR shrink rate is 0.
107
+
108
+
Common causes include:
109
+
- Broker failures or restarts
110
+
- Network latency or connectivity issues
111
+
- Replica lag exceeding replica.lag.max.messages threshold
112
+
- Replica not contacting leader within replica.socket.timeout.ms
113
+
- Insufficient broker resources (CPU, disk I/O, memory)
114
+
- High producer throughput overwhelming broker capacity
115
+
116
+
If ISR is shrinking without corresponding expansion shortly after, investigate broker health, network connectivity, and resource utilization.
117
+
Consider adjusting replica.lag.max.messages or replica.socket.timeout.ms if shrinkage is frequent but brokers are healthy.
118
+
||| % [
119
+
instanceLabel,
120
+
groupLabel,
121
+
],
96
122
},
97
123
},
98
124
{
@@ -109,9 +135,28 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
109
135
severity:'critical',
110
136
},
111
137
annotations: {
112
-
summary:'Kafka has offline partitons.',
113
-
description:'Kafka cluster {{ $labels.%s }} has {{ $value }} offline partitions. After successful leader election, if the leader for partition dies, then the partition moves to the OfflinePartition state. Offline partitions are not available for reading and writing. Restart the brokers, if needed, and check the logs for errors.'
Offline partitions have no active leader, making them completely unavailable for both reads and writes. This directly impacts application functionality.
143
+
144
+
Common causes include:
145
+
- All replicas for the partition are down
146
+
- No in-sync replicas available for leader election
- Insufficient replica count for the replication factor
149
+
150
+
Immediate actions:
151
+
1. Check broker status - identify which brokers are down
152
+
2. Review broker logs for errors and exceptions
153
+
3. Restart failed brokers if needed
154
+
4. Verify ZooKeeper connectivity
155
+
5. Check for disk space or I/O issues on broker hosts
156
+
6. Monitor ISR status to ensure replicas are catching up
157
+
158
+
Until resolved, affected topics cannot serve traffic for these partitions.
159
+
||| % groupLabel,
115
160
},
116
161
},
117
162
{
@@ -128,12 +173,35 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
128
173
severity:'critical',
129
174
},
130
175
annotations: {
131
-
summary:'Kafka has under replicated partitons.',
132
-
description:'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has {{ $value }} under replicated partitons'
133
-
% [
134
-
instanceLabel,
135
-
groupLabel,
136
-
],
176
+
summary:'Kafka has under-replicated partitions.',
177
+
description: |||
178
+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has {{ printf "%%.0f" $value }} under-replicated partitions.
179
+
180
+
Under-replicated partitions have fewer in-sync replicas (ISR) than the configured replication factor, reducing fault tolerance and risking data loss.
181
+
182
+
Impact:
183
+
- Reduced data durability (fewer backup copies)
184
+
- Increased risk of data loss if additional brokers fail
185
+
- Lower fault tolerance for partition availability
186
+
187
+
Common causes:
188
+
- Broker failures or network connectivity issues
189
+
- Brokers unable to keep up with replication (resource constraints)
190
+
- High producer throughput overwhelming replica capacity
191
+
- Disk I/O saturation on replica brokers
192
+
- Network partition between brokers
193
+
194
+
Actions:
195
+
1. Identify which brokers are lagging (check ISR for affected partitions)
196
+
2. Review broker resource utilization (CPU, memory, disk I/O)
197
+
3. Check network connectivity between brokers
198
+
4. Verify broker logs for replication errors
199
+
5. Consider adding broker capacity if consistently under-replicated
200
+
6. Reassign partitions if specific brokers are problematic
201
+
||| % [
202
+
instanceLabel,
203
+
groupLabel,
204
+
],
137
205
},
138
206
},
139
207
{
@@ -225,8 +293,27 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
225
293
},
226
294
annotations: {
227
295
summary:'Kafka has no active controller.',
228
-
description:'Kafka cluster {{ $labels.%s }} has {{ $value }} broker(s) reporting as the active controller in the last 5 minute interval. During steady state there should be only one active controller per cluster.'
229
-
% groupLabel,
296
+
description: |||
297
+
Kafka cluster {{ $labels.%s }} has {{ $value }} broker(s) reporting as the active controller. Expected exactly 1 active controller.
298
+
299
+
CRITICAL impact:
300
+
The Kafka controller is responsible for cluster-wide administrative operations including partition leader election, broker failure detection, topic creation/deletion, and partition reassignment. Without an active controller (value=0) or with multiple controllers (value>1), the cluster cannot perform these critical operations, potentially causing:
301
+
- Inability to elect new partition leaders when brokers fail
302
+
- Topic creation/deletion operations hang indefinitely
303
+
- Partition reassignments cannot be executed
304
+
- Cluster metadata inconsistencies
305
+
- Split-brain scenarios if multiple controllers exist
306
+
307
+
Common causes:
308
+
- Zookeeper connectivity issues or Zookeeper cluster instability
309
+
- Network partitions between brokers and Zookeeper
310
+
- Controller broker crash or unclean shutdown
311
+
- Long garbage collection pauses on controller broker
- Controller election conflicts during network splits
314
+
315
+
This is a critical cluster-wide issue requiring immediate attention to restore normal operations.
316
+
||| % groupLabel,
230
317
},
231
318
},
232
319
{
@@ -239,8 +326,40 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
239
326
},
240
327
annotations: {
241
328
summary:'Kafka has unclean leader elections.',
242
-
description:'Kafka cluster {{ $labels.%s }} has {{ $value }} unclean partition leader elections reported in the last 5 minute interval. When unclean leader election is held among out-of-sync replicas, there is a possibility of data loss if any messages were not synced prior to the loss of the former leader. So if the number of unclean elections is greater than 0, investigate broker logs to determine why leaders were re-elected, and look for WARN or ERROR messages. Consider setting the broker configuration parameter unclean.leader.election.enable to false so that a replica outside of the set of in-sync replicas is never elected leader.'
243
-
% groupLabel,
329
+
description: |||
330
+
Kafka cluster {{ $labels.%s }} has {{ $value }} unclean partition leader elections reported in the last 5 minutes.
331
+
332
+
CRITICAL Impact - DATA LOSS RISK:
333
+
Unclean leader election occurs when no in-sync replica (ISR) is available to become the leader, forcing Kafka to elect an out-of-sync replica. This WILL result in data loss for any messages that were committed to the previous leader but not replicated to the new leader. This compromises data durability guarantees and can cause:
334
+
- Permanent loss of committed messages
335
+
- Consumer offset inconsistencies
336
+
- Duplicate message processing
337
+
- Data inconsistencies between producers and consumers
338
+
- Violation of at-least-once or exactly-once semantics
339
+
340
+
Common causes:
341
+
- All ISR replicas failed simultaneously (broker crashes, hardware failures)
342
+
- Network partitions isolating all ISR members
343
+
- Extended broker downtime exceeding replica lag tolerance
Kafka cluster {{ $labels.%s }} has zero brokers reporting metrics.
376
+
377
+
No brokers are online or reporting metrics, indicating complete cluster failure. This results in:
378
+
- Total unavailability of all topics and partitions
379
+
- All produce and consume operations failing
380
+
- Complete loss of cluster functionality
381
+
- Potential data loss if unclean shutdown occurred
382
+
- Application downtime for all services depending on Kafka
383
+
384
+
||| % groupLabel,
256
385
},
257
386
},
258
387
{
@@ -263,13 +392,24 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
263
392
severity:'critical',
264
393
},
265
394
annotations: {
266
-
summary:'Kafka Zookeeper sync disconected.',
267
-
description:
268
-
'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has disconected from Zookeeper.'
269
-
% [
270
-
instanceLabel,
271
-
groupLabel,
272
-
],
395
+
summary:'Kafka Zookeeper sync disconnected.',
396
+
description: |||
397
+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has lost connection to Zookeeper.
398
+
399
+
Zookeeper connectivity is essential for Kafka broker operation. A disconnected broker cannot:
400
+
- Participate in controller elections
401
+
- Register or maintain its broker metadata
402
+
- Receive cluster state updates
403
+
- Serve as partition leader (will be removed from ISR)
404
+
- Handle leadership changes or partition reassignments
405
+
406
+
This will cause the broker to become isolated from the cluster, leading to under-replicated partitions and potential service degradation for any topics hosted on this broker.
407
+
408
+
Prolonged Zookeeper disconnection will result in the broker being ejected from the cluster and leadership reassignments.
0 commit comments