Skip to content

Commit 6e52c8a

Browse files
committed
Add new alerts
KafkaUnderMinISRPartitionCount (critical) KafkaPreferredReplicaImbalance (warning)
1 parent b22b200 commit 6e52c8a

File tree

1 file changed

+83
-0
lines changed

1 file changed

+83
-0
lines changed

kafka-observ-lib/alerts.libsonnet

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
104104
this.signals.brokerReplicaManager.offlinePartitions.asRuleExpression(),
105105
],
106106
'for': '5m',
107+
keep_firing_for: '5m',
107108
labels: {
108109
severity: 'critical',
109110
},
@@ -122,6 +123,7 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
122123
this.signals.brokerReplicaManager.underReplicatedPartitions.asRuleExpression(),
123124
],
124125
'for': '5m',
126+
keep_firing_for: '5m',
125127
labels: {
126128
severity: 'critical',
127129
},
@@ -134,6 +136,86 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
134136
],
135137
},
136138
},
139+
{
140+
alert: 'KafkaUnderMinISRPartitionCount',
141+
expr: |||
142+
sum by (%s) (%s) > 0
143+
||| % [
144+
std.join(',', this.config.groupLabels),
145+
this.signals.brokerReplicaManager.underMinISRPartitions.asRuleExpression(),
146+
],
147+
'for': '2m',
148+
keep_firing_for: '5m',
149+
labels: {
150+
severity: 'critical',
151+
},
152+
annotations: {
153+
summary: 'Kafka partitions below minimum ISR - writes unavailable.',
154+
description: |||
155+
Kafka cluster {{ $labels.%s }} has {{ printf "%%.0f" $value }} partitions with fewer in-sync replicas than min.insync.replicas configuration.
156+
157+
CRITICAL IMPACT: These partitions are UNAVAILABLE FOR WRITES when producers use acks=all, directly impacting application availability.
158+
159+
This configuration prevents data loss by refusing writes when not enough replicas are in-sync, but at the cost of availability.
160+
161+
Common causes:
162+
- Broker failures reducing available replicas below threshold
163+
- Network issues preventing replicas from staying in-sync
164+
- Brokers overwhelmed and unable to keep up with replication
165+
- Recent partition reassignment or broker maintenance
166+
167+
Immediate actions:
168+
1. Identify affected partitions and their current ISR status
169+
2. Check broker health and availability
170+
3. Review network connectivity between brokers
171+
4. Investigate broker resource utilization (CPU, disk I/O, memory)
172+
5. Restart failed brokers or resolve broker issues
173+
6. Monitor ISR recovery as brokers catch up
174+
175+
Producers will receive NOT_ENOUGH_REPLICAS errors until ISR count recovers above min.insync.replicas threshold.
176+
||| % groupLabel,
177+
},
178+
},
179+
{
180+
alert: 'KafkaPreferredReplicaImbalance',
181+
expr: |||
182+
sum by (%s) (%s) > 0
183+
||| % [
184+
std.join(',', this.config.groupLabels),
185+
this.signals.brokerReplicaManager.preferredReplicaImbalance.asRuleExpression(),
186+
],
187+
'for': '30m',
188+
keep_firing_for: '5m',
189+
labels: {
190+
severity: 'warning',
191+
},
192+
annotations: {
193+
summary: 'Kafka has preferred replica imbalance.',
194+
description: |||
195+
Kafka cluster {{ $labels.%s }} has {{ $value }} partitions where the leader is not the preferred replica.
196+
197+
Impact:
198+
Uneven load distribution across brokers can result in some brokers handling significantly more client requests (produce/consume) than others, leading to hotspots, degraded performance, and potential resource exhaustion on overloaded brokers. This prevents optimal cluster utilization and can impact latency and throughput.
199+
200+
Common causes:
201+
- Broker restarts or failures causing leadership to shift to non-preferred replicas
202+
- Manual partition reassignments or replica movements
203+
- Recent broker additions to the cluster
204+
- Failed automatic preferred replica election
205+
- Auto leader rebalancing disabled (auto.leader.rebalance.enable=false)
206+
207+
Actions:
208+
1. Verify auto.leader.rebalance.enable is set to true in broker configuration
209+
2. Check leader.imbalance.check.interval.seconds (default 300s) configuration
210+
3. Manually trigger preferred replica election using kafka-preferred-replica-election tool
211+
4. Monitor broker resource utilization (CPU, network) for imbalance
212+
5. Review broker logs for leadership election errors
213+
6. Verify all brokers are healthy and reachable
214+
215+
If the imbalance persists for extended periods, consider running manual preferred replica election to redistribute leadership and restore balanced load across the cluster.
216+
||| % groupLabel,
217+
},
218+
},
137219
{
138220
alert: 'KafkaNoActiveController',
139221
expr: 'sum by(' + std.join(',', this.config.groupLabels) + ') (' + this.signals.cluster.activeControllers.asRuleExpression() + ') != 1',
@@ -151,6 +233,7 @@ local xtd = import 'github.com/jsonnet-libs/xtd/main.libsonnet';
151233
alert: 'KafkaUncleanLeaderElection',
152234
expr: '(%s) != 0' % this.signals.brokerReplicaManager.uncleanLeaderElection.asRuleExpression(),
153235
'for': '5m',
236+
keep_firing_for: '5m',
154237
labels: {
155238
severity: 'critical',
156239
},

0 commit comments

Comments
 (0)