-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix data races in the MySQL Monitor and HostGroups #5107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3.0
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Gonlo2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the stability and reliability of ProxySQL's MySQL Monitor and HostGroups by eliminating long-standing data race conditions. The changes ensure that critical operations on hostgroup configurations are performed under consistent and comprehensive locking, preventing scenarios where the system could enter an inconsistent state or incorrectly mark an entire cluster as offline. The focus is on robust concurrency control to maintain data integrity and operational continuity.
Highlights
- Race Condition Resolution: This pull request addresses and resolves several critical race conditions within the MySQL Monitor and HostGroups components, particularly those affecting Galera clusters. The previous logic of releasing and reacquiring locks between protected sections led to unpredictable behavior and potential cluster outages.
- Consistent Locking Strategy: The core fix involves refactoring the locking mechanisms to ensure that locks are acquired and held throughout the entire duration of critical sections. This prevents inconsistent states, especially when nodes transition between active and inactive statuses.
- API Refinement for Locking: New
_lockedvariants of existing methods (e.g.,servers_add_locked,commit_locked,save_incoming_mysql_table_locked) have been introduced. These methods are designed to be called when the necessary locks are already held, promoting a clearer and safer locking pattern within the codebase. - Centralized Lock Management: The
GloAdmin->mysql_servers_wrlock()andwrlock()calls have been consolidated to the entry points of operations that modify shared data, such as thecommit()function, reducing the risk of missed locks or premature unlocks in various helper functions.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses several race conditions within the MySQL Monitor and HostGroups modules by implementing a more robust locking strategy. The changes systematically widen the scope of locks to ensure that complex operations are fully atomic, preventing the inconsistent states that could arise from releasing and re-acquiring locks mid-operation. The introduction of _locked function variants is a clean and maintainable approach that properly separates the core logic from the concurrency control mechanisms. The removal of the more granular galera_set_writer_mutex in favor of these broader, more encompassing locks is consistent with the stated goal. These changes are a significant improvement for the stability and correctness of the system.
|
@JavierJF please review |
|
Hi @Gonlo2, first thanks for the PR. We were aware of the potential issues between the sections you mention, and while possible for some clusters, and not for others, the scenario described is very unlikely. Said this, we will address the issue, but there are several concerns about the approach taken in the PR:
Due to the previous concerns, and after considering a couple of alternatives, we have decided that we will address this issue in a refactor, in which the dependency of Best regards, Javier. |
I should mention that I've had to look into this because this behavior has affected us in production several times, so I don't think it's that unusual in an environment like this.
I don't like this solution either, but it's a simple change, and the alternative is to have a load balancer that can leave you stranded in the event of an error and requires human intervention to fix which, in a production environment, is not a viable approach.
I don't know what you mean by
I understand that it is not a clean approach, but the intention is to have a functional proxy, and since it is an update to this PR #3063 created in 2020 I am concerned that this #5164 refactoring will not be carried out in the short term with a low priority. Would it be possible to at least ignore the fact that having so much code inside locks is bad practice and review it from the point of view that without this it is not viable to have ProxySQL in a production environment for fear that it will stop routing traffic? |




This PR resolves several race conditions in the MySQL Monitor and HostGroups code, mostly focused on galera.
A clear description of the issue
In several parts of the code, it is assumed that releasing and reacquiring a lock between two protected sections is the same as having a single protected section, but this is not the case, causing unpredictable behavior marking the entire cluster as offline in the worst case. This PR is based on acquiring these locks throughout the protected section, and although this leads to longer locks, it prevents ProxySQL from remaining in an inconsistent state if a node is marked as inactive and then active again. This PR is a rebase of most of the changes changes included in my #3063 PR but using the latest version of ProxySQL.
ProxySQL version
v3.0.2
The steps to reproduce the issue
Since this is an rebase of some changes that I made about 5 years ago, I don't remember in which cases this error occurred but it surely prevented ProxySQL from recovering if any node in the cluster had gone down and then back to up.