-
Notifications
You must be signed in to change notification settings - Fork 178
feat(kafka-observ-lib): Expand kafka-observ-lib alerts/panels descriptions and add new alerts #1514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
KafkaUnderMinISRPartitionCount (critical) KafkaPreferredReplicaImbalance (warning)
|
Apologies on the delay of this review, I will try to get to it tomorrow :) |
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you @v-zhuravlev, apologies for the delay on this one I kept getting distracted with other things, but all good from my end here :D
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for improving this, I have two notes about two alerts:
KafkaOfflinePartitionCountKafkaUnderMinISRPartitionCount
Happy to chat about them in the comments.
| expr: ||| | ||
| sum by (%s) (%s) > 0 | ||
| ||| % [ | ||
| std.join(',', this.config.groupLabels), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, the offline partitions in Kafka usually stem from instances (brokers), do you also want to group by the instance here? The group labels don't include it.
I see that the alert KafkaUnderReplicatedPartitionCount nicely includes the instance labels on line 168:
std.join(',', this.config.groupLabels + this.config.instanceLabels),
| std.join(',', this.config.groupLabels), | ||
| this.signals.brokerReplicaManager.underMinISRPartitions.asRuleExpression(), | ||
| ], | ||
| 'for': '2m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have data to prove this out of the box but it seems to me that 2m might short to account for fluctuations that normal operations might cause. Do you agree? I think 5m that is similar to other alerts is appropriate but still short enough for a critical alert to fire.
Here is what has been updated:
Fix typo in totalTime
Fix typo in preferredReplicaImbalance
Increase size of status rows
Update signals/panels descriptions
Fix typo in KafkaOfflinePartitionCount alert
Add new alerts
Expanded alerts descriptions with runbooks