-
Notifications
You must be signed in to change notification settings - Fork 22
feat: gp alerts #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: gp alerts #651
Conversation
56c4f9d
to
0f84173
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #651 +/- ##
=======================================
Coverage ? 50.95%
=======================================
Files ? 13
Lines ? 1841
Branches ? 315
=======================================
Hits ? 938
Misses ? 826
Partials ? 77 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive MySQL Group Replication alert rules for Prometheus monitoring. It restructures the existing alert system and adds CI testing for alert validation.
- Adds new alert group specifically for MySQL Group Replication scenarios (offline members, primary issues, conflicts)
- Restructures existing general MySQL alerts with improved naming and organization
- Implements CI pipeline for alert rule validation and unit testing
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
File | Description |
---|---|
src/prometheus_alert_rules/replication_rules.yaml | New alert rules for MySQL Group Replication monitoring |
src/prometheus_alert_rules/general_rules.yaml | Refactored general MySQL alerts with improved organization |
tests/alerts/test_replication_alerts.yaml | Unit tests for Group Replication alert rules |
tests/alerts/test_general_alerts.yaml | Unit tests for general MySQL alert rules |
.github/workflows/ci.yaml | Added CI job for alert rule validation and testing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
summary: MySQL cluster member {{ $labels.instance }} is offline | ||
description: | | ||
The MySQL member is marked offline in the cluster, although the process might still be running. | ||
If this is unexptected, please check the logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in 'unexptected'. It should be 'unexpected'.
If this is unexptected, please check the logs. | |
If this is unexpected, please check the logs. |
Copilot uses AI. Check for mistakes.
labels: | ||
severity: critical | ||
annotations: | ||
summary: MySQL cluster reports no primariy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in 'primariy'. It should be 'primary'.
summary: MySQL cluster reports no primariy | |
summary: MySQL cluster reports no primary |
Copilot uses AI. Check for mistakes.
annotations: | ||
summary: MySQL cluster reports more than one primary. | ||
description: | | ||
MySQL reports more than one primary. This is can indicate a split-brain situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a grammatical error in 'This is can indicate'. It should be 'This can indicate'.
MySQL reports more than one primary. This is can indicate a split-brain situation. | |
MySQL reports more than one primary. This can indicate a split-brain situation. |
Copilot uses AI. Check for mistakes.
summary: MySQL cluster member db1 is offline | ||
description: | | ||
The MySQL member is marked offline in the cluster, although the process might still be running. | ||
If this is unexptected, please check the logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in 'unexptected'. It should be 'unexpected'.
If this is unexptected, please check the logs. | |
If this is unexpected, please check the logs. |
Copilot uses AI. Check for mistakes.
severity: critical | ||
member_role: PRIMARY | ||
exp_annotations: | ||
summary: MySQL cluster reports no primariy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in 'primariy'. It should be 'primary'.
summary: MySQL cluster reports no primariy | |
summary: MySQL cluster reports no primary |
Copilot uses AI. Check for mistakes.
exp_annotations: | ||
summary: MySQL cluster reports more than one primary. | ||
description: | | ||
MySQL reports more than one primary. This is can indicate a split-brain situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a grammatical error in 'This is can indicate'. It should be 'This can indicate'.
MySQL reports more than one primary. This is can indicate a split-brain situation. | |
MySQL reports more than one primary. This can indicate a split-brain situation. |
Copilot uses AI. Check for mistakes.
- alert: MySQLDown | ||
expr: "mysql_up == 0" | ||
expr: mysql_up == 0 | ||
for: 0m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Setting 'for: 0m' makes the alert fire immediately without any grace period. Consider if a brief delay (like 1m or 2m) would be more appropriate to avoid false positives from temporary network issues.
Copilot uses AI. Check for mistakes.
@@ -42,6 +42,19 @@ jobs: | |||
- name: Upload Coverage to Codecov | |||
uses: codecov/codecov-action@v5 | |||
|
|||
alert-test: | |||
name: Test Prometheus Alert Rules | |||
runs-on: ubuntu-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs-on: ubuntu-latest | |
runs-on: ubuntu-latest | |
timeout-minutes: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not manage to trigger MySQLClusterNoPrimary
although I did simulate it.
max_over_time( | ||
count(mysql_perf_schema_replication_group_member_info{member_state="ONLINE"} == 1)[6h:] | ||
) | ||
for: 15m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be faster to trigger.
for: 15m | |
for: 5m |
If this is unexptected, please check the logs. | ||
LABELS = {{ $labels }}. | ||
|
||
- alert: MySQLClusterNoPrimary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Deezzir
For whatever reason, this is not firing properly when cluster looses its quorum.
The test this on k8s, you can do a kubectl delete pod -n <model> mysql-k8s-0 mysql-k8s-1
on the three node cluster where the primary is include in the delete command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not firing when all three units are offline
Do you mind sharing the metrics from the exporter at the time of simulation? |
Here: This is queried from the not killed unit, there are about 7 or 8 outputs in sequence, separated by |
LABELS = {{ $labels }}. | ||
|
||
- alert: MySQLClusterNoPrimary | ||
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test for an online primary
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY"}) | |
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY",member_state="ONLINE"}) |
Issue
The current setup lacks the alerts for Group Replication
Solution
The PR introduces a new alert group for Group Replication. Additionally, it restructures the alerts and creates a CI job to lint and unit test them.
Improves on: #631