Skip to content

Conversation

Deezzir
Copy link

@Deezzir Deezzir commented Aug 27, 2025

Issue

The current setup lacks the alerts for Group Replication

Solution

The PR introduces a new alert group for Group Replication. Additionally, it restructures the alerts and creates a CI job to lint and unit test them.

Improves on: #631

@Deezzir Deezzir force-pushed the DPE-7859-my-sql-add-basic-gr-metrics-to-charms branch from 56c4f9d to 0f84173 Compare August 27, 2025 02:07
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@29437d3). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #651   +/-   ##
=======================================
  Coverage        ?   50.95%           
=======================================
  Files           ?       13           
  Lines           ?     1841           
  Branches        ?      315           
=======================================
  Hits            ?      938           
  Misses          ?      826           
  Partials        ?       77           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive MySQL Group Replication alert rules for Prometheus monitoring. It restructures the existing alert system and adds CI testing for alert validation.

  • Adds new alert group specifically for MySQL Group Replication scenarios (offline members, primary issues, conflicts)
  • Restructures existing general MySQL alerts with improved naming and organization
  • Implements CI pipeline for alert rule validation and unit testing

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/prometheus_alert_rules/replication_rules.yaml New alert rules for MySQL Group Replication monitoring
src/prometheus_alert_rules/general_rules.yaml Refactored general MySQL alerts with improved organization
tests/alerts/test_replication_alerts.yaml Unit tests for Group Replication alert rules
tests/alerts/test_general_alerts.yaml Unit tests for general MySQL alert rules
.github/workflows/ci.yaml Added CI job for alert rule validation and testing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

summary: MySQL cluster member {{ $labels.instance }} is offline
description: |
The MySQL member is marked offline in the cluster, although the process might still be running.
If this is unexptected, please check the logs.
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling error in 'unexptected'. It should be 'unexpected'.

Suggested change
If this is unexptected, please check the logs.
If this is unexpected, please check the logs.

Copilot uses AI. Check for mistakes.

labels:
severity: critical
annotations:
summary: MySQL cluster reports no primariy
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling error in 'primariy'. It should be 'primary'.

Suggested change
summary: MySQL cluster reports no primariy
summary: MySQL cluster reports no primary

Copilot uses AI. Check for mistakes.

annotations:
summary: MySQL cluster reports more than one primary.
description: |
MySQL reports more than one primary. This is can indicate a split-brain situation.
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a grammatical error in 'This is can indicate'. It should be 'This can indicate'.

Suggested change
MySQL reports more than one primary. This is can indicate a split-brain situation.
MySQL reports more than one primary. This can indicate a split-brain situation.

Copilot uses AI. Check for mistakes.

summary: MySQL cluster member db1 is offline
description: |
The MySQL member is marked offline in the cluster, although the process might still be running.
If this is unexptected, please check the logs.
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling error in 'unexptected'. It should be 'unexpected'.

Suggested change
If this is unexptected, please check the logs.
If this is unexpected, please check the logs.

Copilot uses AI. Check for mistakes.

severity: critical
member_role: PRIMARY
exp_annotations:
summary: MySQL cluster reports no primariy
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling error in 'primariy'. It should be 'primary'.

Suggested change
summary: MySQL cluster reports no primariy
summary: MySQL cluster reports no primary

Copilot uses AI. Check for mistakes.

exp_annotations:
summary: MySQL cluster reports more than one primary.
description: |
MySQL reports more than one primary. This is can indicate a split-brain situation.
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a grammatical error in 'This is can indicate'. It should be 'This can indicate'.

Suggested change
MySQL reports more than one primary. This is can indicate a split-brain situation.
MySQL reports more than one primary. This can indicate a split-brain situation.

Copilot uses AI. Check for mistakes.

- alert: MySQLDown
expr: "mysql_up == 0"
expr: mysql_up == 0
for: 0m
Copy link
Preview

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Setting 'for: 0m' makes the alert fire immediately without any grace period. Consider if a brief delay (like 1m or 2m) would be more appropriate to avoid false positives from temporary network issues.

Copilot uses AI. Check for mistakes.

@@ -42,6 +42,19 @@ jobs:
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v5

alert-test:
name: Test Prometheus Alert Rules
runs-on: ubuntu-latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
runs-on: ubuntu-latest
runs-on: ubuntu-latest
timeout-minutes: 5

@canonical canonical deleted a comment from Copilot AI Aug 29, 2025
Copy link
Contributor

@paulomach paulomach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not manage to trigger MySQLClusterNoPrimary although I did simulate it.

max_over_time(
count(mysql_perf_schema_replication_group_member_info{member_state="ONLINE"} == 1)[6h:]
)
for: 15m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be faster to trigger.

Suggested change
for: 15m
for: 5m

If this is unexptected, please check the logs.
LABELS = {{ $labels }}.

- alert: MySQLClusterNoPrimary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deezzir
For whatever reason, this is not firing properly when cluster looses its quorum.
The test this on k8s, you can do a kubectl delete pod -n <model> mysql-k8s-0 mysql-k8s-1 on the three node cluster where the primary is include in the delete command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not firing when all three units are offline

@Deezzir
Copy link
Author

Deezzir commented Aug 29, 2025

Did not manage to trigger MySQLClusterNoPrimary although I did simulate it.

Do you mind sharing the metrics from the exporter at the time of simulation?

@paulomach
Copy link
Contributor

Did not manage to trigger MySQLClusterNoPrimary although I did simulate it.

Do you mind sharing the metrics from the exporter at the time of simulation?

Here:
https://pastebin.canonical.com/p/X7rjZJSThk/

This is queried from the not killed unit, there are about 7 or 8 outputs in sequence, separated by #####

LABELS = {{ $labels }}.

- alert: MySQLClusterNoPrimary
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test for an online primary

Suggested change
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY"})
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY",member_state="ONLINE"})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants