Skip to content

Commit 54f3e88

Browse files
committed
Expanded alerts descriptions with runbooks
1 parent 6e52c8a commit 54f3e88

File tree

1 file changed

+173
-33
lines changed

1 file changed

+173
-33
lines changed

kafka-observ-lib/alerts.libsonnet

Lines changed: 173 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -65,12 +65,23 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
6565
severity: 'warning',
6666
},
6767
annotations: {
68-
summary: 'Kafka ISR expansion rate is expanding.',
69-
description: 'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} In-Sync Replica (ISR) is expanding by {{ $value }} per second. If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR expansion rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
70-
% [
71-
instanceLabel,
72-
groupLabel,
73-
],
68+
summary: 'Kafka ISR expansion detected.',
69+
description: |||
70+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has In-Sync Replica (ISR) expanding at {{ printf "%%.2f" $value }} per second.
71+
72+
ISR expansion typically occurs when a broker recovers and its replicas catch up to the leader. The expected steady-state value for ISR expansion rate is 0.
73+
74+
Frequent ISR expansion and shrinkage indicates instability and may suggest:
75+
- Brokers frequently going offline/online
76+
- Network connectivity issues
77+
- Replica lag configuration too tight (adjust replica.lag.max.messages or replica.socket.timeout.ms)
78+
- Insufficient broker resources causing replicas to fall behind
79+
80+
If this alert fires frequently without corresponding broker outages, investigate broker health and adjust replica lag settings.
81+
||| % [
82+
instanceLabel,
83+
groupLabel,
84+
],
7485
},
7586
},
7687
{
@@ -87,12 +98,27 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
8798
severity: 'warning',
8899
},
89100
annotations: {
90-
summary: 'Kafka ISR expansion rate is shrinking.',
91-
description: 'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} In-Sync Replica (ISR) is shrinking by {{ $value }} per second. If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR shrink rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
92-
% [
93-
instanceLabel,
94-
groupLabel,
95-
],
101+
summary: 'Kafka ISR shrinkage detected.',
102+
description: |||
103+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has In-Sync Replica (ISR) shrinking at {{ printf "%%.2f" $value }} per second.
104+
105+
ISR shrinkage occurs when a replica falls too far behind the leader and is removed from the ISR set. This reduces fault tolerance as fewer replicas are in-sync.
106+
The expected steady-state value for ISR shrink rate is 0.
107+
108+
Common causes include:
109+
- Broker failures or restarts
110+
- Network latency or connectivity issues
111+
- Replica lag exceeding replica.lag.max.messages threshold
112+
- Replica not contacting leader within replica.socket.timeout.ms
113+
- Insufficient broker resources (CPU, disk I/O, memory)
114+
- High producer throughput overwhelming broker capacity
115+
116+
If ISR is shrinking without corresponding expansion shortly after, investigate broker health, network connectivity, and resource utilization.
117+
Consider adjusting replica.lag.max.messages or replica.socket.timeout.ms if shrinkage is frequent but brokers are healthy.
118+
||| % [
119+
instanceLabel,
120+
groupLabel,
121+
],
96122
},
97123
},
98124
{
@@ -109,9 +135,28 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
109135
severity: 'critical',
110136
},
111137
annotations: {
112-
summary: 'Kafka has offline partitons.',
113-
description: 'Kafka cluster {{ $labels.%s }} has {{ $value }} offline partitions. After successful leader election, if the leader for partition dies, then the partition moves to the OfflinePartition state. Offline partitions are not available for reading and writing. Restart the brokers, if needed, and check the logs for errors.'
114-
% groupLabel,
138+
summary: 'Kafka has offline partitions.',
139+
description: |||
140+
Kafka cluster {{ $labels.%s }} has {{ printf "%%.0f" $value }} offline partitions.
141+
142+
Offline partitions have no active leader, making them completely unavailable for both reads and writes. This directly impacts application functionality.
143+
144+
Common causes include:
145+
- All replicas for the partition are down
146+
- No in-sync replicas available for leader election
147+
- Cluster controller issues preventing leader election
148+
- Insufficient replica count for the replication factor
149+
150+
Immediate actions:
151+
1. Check broker status - identify which brokers are down
152+
2. Review broker logs for errors and exceptions
153+
3. Restart failed brokers if needed
154+
4. Verify ZooKeeper connectivity
155+
5. Check for disk space or I/O issues on broker hosts
156+
6. Monitor ISR status to ensure replicas are catching up
157+
158+
Until resolved, affected topics cannot serve traffic for these partitions.
159+
||| % groupLabel,
115160
},
116161
},
117162
{
@@ -128,12 +173,35 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
128173
severity: 'critical',
129174
},
130175
annotations: {
131-
summary: 'Kafka has under replicated partitons.',
132-
description: 'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has {{ $value }} under replicated partitons'
133-
% [
134-
instanceLabel,
135-
groupLabel,
136-
],
176+
summary: 'Kafka has under-replicated partitions.',
177+
description: |||
178+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has {{ printf "%%.0f" $value }} under-replicated partitions.
179+
180+
Under-replicated partitions have fewer in-sync replicas (ISR) than the configured replication factor, reducing fault tolerance and risking data loss.
181+
182+
Impact:
183+
- Reduced data durability (fewer backup copies)
184+
- Increased risk of data loss if additional brokers fail
185+
- Lower fault tolerance for partition availability
186+
187+
Common causes:
188+
- Broker failures or network connectivity issues
189+
- Brokers unable to keep up with replication (resource constraints)
190+
- High producer throughput overwhelming replica capacity
191+
- Disk I/O saturation on replica brokers
192+
- Network partition between brokers
193+
194+
Actions:
195+
1. Identify which brokers are lagging (check ISR for affected partitions)
196+
2. Review broker resource utilization (CPU, memory, disk I/O)
197+
3. Check network connectivity between brokers
198+
4. Verify broker logs for replication errors
199+
5. Consider adding broker capacity if consistently under-replicated
200+
6. Reassign partitions if specific brokers are problematic
201+
||| % [
202+
instanceLabel,
203+
groupLabel,
204+
],
137205
},
138206
},
139207
{
@@ -225,8 +293,27 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
225293
},
226294
annotations: {
227295
summary: 'Kafka has no active controller.',
228-
description: 'Kafka cluster {{ $labels.%s }} has {{ $value }} broker(s) reporting as the active controller in the last 5 minute interval. During steady state there should be only one active controller per cluster.'
229-
% groupLabel,
296+
description: |||
297+
Kafka cluster {{ $labels.%s }} has {{ $value }} broker(s) reporting as the active controller. Expected exactly 1 active controller.
298+
299+
CRITICAL impact:
300+
The Kafka controller is responsible for cluster-wide administrative operations including partition leader election, broker failure detection, topic creation/deletion, and partition reassignment. Without an active controller (value=0) or with multiple controllers (value>1), the cluster cannot perform these critical operations, potentially causing:
301+
- Inability to elect new partition leaders when brokers fail
302+
- Topic creation/deletion operations hang indefinitely
303+
- Partition reassignments cannot be executed
304+
- Cluster metadata inconsistencies
305+
- Split-brain scenarios if multiple controllers exist
306+
307+
Common causes:
308+
- Zookeeper connectivity issues or Zookeeper cluster instability
309+
- Network partitions between brokers and Zookeeper
310+
- Controller broker crash or unclean shutdown
311+
- Long garbage collection pauses on controller broker
312+
- Zookeeper session timeout (zookeeper.session.timeout.ms exceeded)
313+
- Controller election conflicts during network splits
314+
315+
This is a critical cluster-wide issue requiring immediate attention to restore normal operations.
316+
||| % groupLabel,
230317
},
231318
},
232319
{
@@ -239,8 +326,40 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
239326
},
240327
annotations: {
241328
summary: 'Kafka has unclean leader elections.',
242-
description: 'Kafka cluster {{ $labels.%s }} has {{ $value }} unclean partition leader elections reported in the last 5 minute interval. When unclean leader election is held among out-of-sync replicas, there is a possibility of data loss if any messages were not synced prior to the loss of the former leader. So if the number of unclean elections is greater than 0, investigate broker logs to determine why leaders were re-elected, and look for WARN or ERROR messages. Consider setting the broker configuration parameter unclean.leader.election.enable to false so that a replica outside of the set of in-sync replicas is never elected leader.'
243-
% groupLabel,
329+
description: |||
330+
Kafka cluster {{ $labels.%s }} has {{ $value }} unclean partition leader elections reported in the last 5 minutes.
331+
332+
CRITICAL Impact - DATA LOSS RISK:
333+
Unclean leader election occurs when no in-sync replica (ISR) is available to become the leader, forcing Kafka to elect an out-of-sync replica. This WILL result in data loss for any messages that were committed to the previous leader but not replicated to the new leader. This compromises data durability guarantees and can cause:
334+
- Permanent loss of committed messages
335+
- Consumer offset inconsistencies
336+
- Duplicate message processing
337+
- Data inconsistencies between producers and consumers
338+
- Violation of at-least-once or exactly-once semantics
339+
340+
Common causes:
341+
- All ISR replicas failed simultaneously (broker crashes, hardware failures)
342+
- Network partitions isolating all ISR members
343+
- Extended broker downtime exceeding replica lag tolerance
344+
- Insufficient replication factor (RF < 3) combined with broker failures
345+
- min.insync.replicas set too low relative to replication factor
346+
- Disk failures on multiple replicas simultaneously
347+
- Aggressive unclean.leader.election.enable=true configuration
348+
349+
Immediate actions:
350+
1. Review broker logs to identify which partitions had unclean elections
351+
2. Investigate root cause of ISR replica failures (check broker health, hardware, network)
352+
3. Assess data loss impact by comparing producer and consumer offsets for affected partitions
353+
4. Alert application teams to potential data loss in affected partitions
354+
5. Bring failed ISR replicas back online as quickly as possible
355+
6. Consider resetting consumer offsets if data loss is unacceptable
356+
7. Review and increase replication factor for critical topics (minimum RF=3)
357+
8. Set unclean.leader.election.enable=false to prevent future unclean elections (availability vs. durability trade-off)
358+
9. Increase min.insync.replicas to strengthen durability guarantees
359+
10. Implement better monitoring for ISR shrinkage to detect issues before unclean elections occur
360+
361+
This indicates a serious reliability event that requires immediate investigation and remediation to prevent future data loss.
362+
||| % groupLabel,
244363
},
245364
},
246365
{
@@ -252,7 +371,17 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
252371
},
253372
annotations: {
254373
summary: 'Kafka has no brokers online.',
255-
description: 'Kafka cluster {{ $labels.%s }} broker count is 0.' % groupLabel,
374+
description: |||
375+
Kafka cluster {{ $labels.%s }} has zero brokers reporting metrics.
376+
377+
No brokers are online or reporting metrics, indicating complete cluster failure. This results in:
378+
- Total unavailability of all topics and partitions
379+
- All produce and consume operations failing
380+
- Complete loss of cluster functionality
381+
- Potential data loss if unclean shutdown occurred
382+
- Application downtime for all services depending on Kafka
383+
384+
||| % groupLabel,
256385
},
257386
},
258387
{
@@ -263,13 +392,24 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
263392
severity: 'critical',
264393
},
265394
annotations: {
266-
summary: 'Kafka Zookeeper sync disconected.',
267-
description:
268-
'Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has disconected from Zookeeper.'
269-
% [
270-
instanceLabel,
271-
groupLabel,
272-
],
395+
summary: 'Kafka Zookeeper sync disconnected.',
396+
description: |||
397+
Kafka broker {{ $labels.%s }} in cluster {{ $labels.%s }} has lost connection to Zookeeper.
398+
399+
Zookeeper connectivity is essential for Kafka broker operation. A disconnected broker cannot:
400+
- Participate in controller elections
401+
- Register or maintain its broker metadata
402+
- Receive cluster state updates
403+
- Serve as partition leader (will be removed from ISR)
404+
- Handle leadership changes or partition reassignments
405+
406+
This will cause the broker to become isolated from the cluster, leading to under-replicated partitions and potential service degradation for any topics hosted on this broker.
407+
408+
Prolonged Zookeeper disconnection will result in the broker being ejected from the cluster and leadership reassignments.
409+
||| % [
410+
instanceLabel,
411+
groupLabel,
412+
],
273413
},
274414
},
275415

0 commit comments

Comments
 (0)