Skip to content

Conversation

@Gonlo2
Copy link

@Gonlo2 Gonlo2 commented Sep 3, 2025

This PR resolves several race conditions in the MySQL Monitor and HostGroups code, mostly focused on galera.

A clear description of the issue

In several parts of the code, it is assumed that releasing and reacquiring a lock between two protected sections is the same as having a single protected section, but this is not the case, causing unpredictable behavior marking the entire cluster as offline in the worst case. This PR is based on acquiring these locks throughout the protected section, and although this leads to longer locks, it prevents ProxySQL from remaining in an inconsistent state if a node is marked as inactive and then active again. This PR is a rebase of most of the changes changes included in my #3063 PR but using the latest version of ProxySQL.

ProxySQL version

v3.0.2

The steps to reproduce the issue

Since this is an rebase of some changes that I made about 5 years ago, I don't remember in which cases this error occurred but it surely prevented ProxySQL from recovering if any node in the cluster had gone down and then back to up.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Gonlo2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and reliability of ProxySQL's MySQL Monitor and HostGroups by eliminating long-standing data race conditions. The changes ensure that critical operations on hostgroup configurations are performed under consistent and comprehensive locking, preventing scenarios where the system could enter an inconsistent state or incorrectly mark an entire cluster as offline. The focus is on robust concurrency control to maintain data integrity and operational continuity.

Highlights

  • Race Condition Resolution: This pull request addresses and resolves several critical race conditions within the MySQL Monitor and HostGroups components, particularly those affecting Galera clusters. The previous logic of releasing and reacquiring locks between protected sections led to unpredictable behavior and potential cluster outages.
  • Consistent Locking Strategy: The core fix involves refactoring the locking mechanisms to ensure that locks are acquired and held throughout the entire duration of critical sections. This prevents inconsistent states, especially when nodes transition between active and inactive statuses.
  • API Refinement for Locking: New _locked variants of existing methods (e.g., servers_add_locked, commit_locked, save_incoming_mysql_table_locked) have been introduced. These methods are designed to be called when the necessary locks are already held, promoting a clearer and safer locking pattern within the codebase.
  • Centralized Lock Management: The GloAdmin->mysql_servers_wrlock() and wrlock() calls have been consolidated to the entry points of operations that modify shared data, such as the commit() function, reducing the risk of missed locks or premature unlocks in various helper functions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 3, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
25.0% Duplication on New Code (required ≤ 3%)
B Maintainability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@renecannao
Copy link
Contributor

Can one of the admins verify this patch?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses several race conditions within the MySQL Monitor and HostGroups modules by implementing a more robust locking strategy. The changes systematically widen the scope of locks to ensure that complex operations are fully atomic, preventing the inconsistent states that could arise from releasing and re-acquiring locks mid-operation. The introduction of _locked function variants is a clean and maintainable approach that properly separates the core logic from the concurrency control mechanisms. The removal of the more granular galera_set_writer_mutex in favor of these broader, more encompassing locks is consistent with the stated goal. These changes are a significant improvement for the stability and correctness of the system.

@noizu noizu added the status/triage Mark issue for team triage label Sep 30, 2025
@noizu noizu assigned noizu and JavierJF and unassigned noizu Oct 14, 2025
@noizu
Copy link
Collaborator

noizu commented Oct 14, 2025

@JavierJF please review

@JavierJF
Copy link
Collaborator

Hi @Gonlo2,

first thanks for the PR. We were aware of the potential issues between the sections you mention, and while possible for some clusters, and not for others, the scenario described is very unlikely. Said this, we will address the issue, but there are several concerns about the approach taken in the PR:

  1. Contention increase: locks, specially write locks, should be performed on need. With these changes, every monitoring action would result in a global lock on HostgroupManager servers. This is something to avoid, specially since the checks rely right now on SQLite3 queries.
  2. Change in lock meaning: wrlock is the global hostgroup manager lock, meanwhile, the specific mysql_servers_lock is an specific one. Ideally, the critical sections imposed by wrlock are smaller than the ones imposed by mysql_servers_lock, otherwise, the lock specialization loses its meaning. Why having special locks if the general lock is the one solving all the scenarios?

Due to the previous concerns, and after considering a couple of alternatives, we have decided that we will address this issue in a refactor, in which the dependency of SQLite3 will be removed for cluster monitoring. That progress be track as part of #5164. Again, thanks for the contribution.

Best regards, Javier.

@Gonlo2
Copy link
Author

Gonlo2 commented Oct 28, 2025

Hi @Gonlo2,

first thanks for the PR. We were aware of the potential issues between the sections you mention, and while possible for some clusters, and not for others, the scenario described is very unlikely. Said this, we will address the issue, but there are several concerns about the approach taken in the PR:

I should mention that I've had to look into this because this behavior has affected us in production several times, so I don't think it's that unusual in an environment like this.

  1. Contention increase: locks, specially write locks, should be performed on need. With these changes, every monitoring action would result in a global lock on HostgroupManager servers. This is something to avoid, specially since the checks rely right now on SQLite3 queries.

I don't like this solution either, but it's a simple change, and the alternative is to have a load balancer that can leave you stranded in the event of an error and requires human intervention to fix which, in a production environment, is not a viable approach.

  1. Change in lock meaning: wrlock is the global hostgroup manager lock, meanwhile, the specific mysql_servers_lock is an specific one. Ideally, the critical sections imposed by wrlock are smaller than the ones imposed by mysql_servers_lock, otherwise, the lock specialization loses its meaning. Why having special locks if the general lock is the one solving all the scenarios?

I don't know what you mean by mysql_servers_lock. Could you elaborate a little on your answer? I'm not an expert on ProxySQL and I opened this PR two months ago so I don't remember the details anymore.

Due to the previous concerns, and after considering a couple of alternatives, we have decided that we will address this issue in a refactor, in which the dependency of SQLite3 will be removed for cluster monitoring. That progress be track as part of #5164. Again, thanks for the contribution.

I understand that it is not a clean approach, but the intention is to have a functional proxy, and since it is an update to this PR #3063 created in 2020 I am concerned that this #5164 refactoring will not be carried out in the short term with a low priority. Would it be possible to at least ignore the fact that having so much code inside locks is bad practice and review it from the point of view that without this it is not viable to have ProxySQL in a production environment for fear that it will stop routing traffic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/triage Mark issue for team triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants