-
Notifications
You must be signed in to change notification settings - Fork 343
Moved configuration reloading to a single dedicated thread #5479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved configuration reloading to a single dedicated thread #5479
Conversation
6bc96d8 to
76eb1ed
Compare
|
@cwperks @kumargu @shikharj05 See here for a draft of the changes for async config updates. There's one issue which I did not anticipate: Many tests rely on the fact that the changed configuration is instantly available after the TransportConfigUpdateAction has been called and returned. With the async handling, this is not the case any more. Thus, we have a number of test failures at the moment. Of course, one could go ahead and adapt these tests to use Awaitility or something similar to wait for the updated configuration. But, in the end, this is also a behavioral change, and I am wondering whether we need to preserve this behavior. It might be possible by creating an async version of the nodeOperation in TransportNodesAction ... I was always wondering why this was not async. I am not sure if there are any reasons against it. |
|
@nibix should we consider keeping the behavior where REST API calls to security apis only return to the caller when all nodes have finished updating? If there is an async version of each security api, then I think it needs to be similar to async search where a task id is returned that you can then use to get the status of the update. One thing that was discussed was creating a dedicated threadpool for security config updates (The size of the pool can even be 1 to ensure a single thread is performing updates at any given moment). Plugins can define threadpools like this: #5464 (comment) Then you can use the name of the threadpool in the constructor for TransportConfigUpdateAction. |
yes I am looking into this at the moment.
I do not think thread pools are a good way to go. If you use a thread pool, you do not have any way to deduplicate redundant update requests. If you do 100 REST API calls for config updates quickly one after another, you will also trigger 100 config updates (provided your thread pool queue size is configured to be that big), even though completing these might take up several minutes after the final REST API call to complete. With the solution in this PR, we have a deduplication of redundant update requests. |
src/main/java/org/opensearch/security/util/TransportNodesAsyncAction.java
Show resolved
Hide resolved
src/test/java/org/opensearch/security/configuration/ConfigurationRepositoryTest.java
Show resolved
Hide resolved
This is now implemented by creating an async version of TransportNodesAction. Potentially, we can/should move that to the core project. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5479 +/- ##
==========================================
+ Coverage 73.15% 73.29% +0.14%
==========================================
Files 412 412
Lines 26168 26299 +131
Branches 3963 3984 +21
==========================================
+ Hits 19142 19276 +134
+ Misses 5125 5113 -12
- Partials 1901 1910 +9
🚀 New features to boost your workflow:
|
DarshitChanpura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nibix for this fix! Left a few questions.
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/util/TransportNodesAsyncAction.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Show resolved
Hide resolved
|
We invested some time into reproducing the issue in order to verify our fixes. Of course, we had to make some assumptions to fill in the unknowns. We both reproduced the issue on a docker based cluster and using a Java based integration test. This is the code for the test: Our manual tests were executed on OpenSearch 2.19.2, the Java test was executed on a recent OpenSearch Both the manual test and the automatic test follow the same structure:
The output of the test (without full node stats due to the length): In the output, the phrase "when 5 were pending" refers to the sent config REST API requests which did not get a response yet. Theoretically, we should get already a timeout for 5 pending requests, due to the size of the management thread pool, which is 5. In practise, we seem to get one more because one of the pending threads finishes while we are waiting for the node stats response. Thus, the thread gets freed for the node stats response. We then executed the same test with the commit from this PR included. Then, we could not reproduce the timeout, even though we had up to 8 pending config REST API calls: We also captured thread dumps to check for any deadlocks or other issues. We only found what we expected: In the version without the optimization, all management threads were busy or waiting for the |
4223cb6 to
a740b13
Compare
From my side, this is ready to go. Just now, I have rebased the branch and fixed the conflict in the changelog. |
src/main/java/org/opensearch/security/util/TransportNodesAsyncAction.java
Show resolved
Hide resolved
src/main/java/org/opensearch/security/util/TransportNodesAsyncAction.java
Show resolved
Hide resolved
Signed-off-by: Nils Bandener <[email protected]>
… longer blocks thread Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
a740b13 to
adea1c7
Compare
I have resolved the conflict. Please check again and re-approve in case :) |
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Darshit Chanpura <[email protected]>
…h-project#5479) Signed-off-by: Nils Bandener <[email protected]> Signed-off-by: Nils Bandener <[email protected]> Signed-off-by: Darshit Chanpura <[email protected]> Co-authored-by: Darshit Chanpura <[email protected]> Signed-off-by: Dennis Toepker <[email protected]>
…h-project#5479) Signed-off-by: Nils Bandener <[email protected]> Signed-off-by: Nils Bandener <[email protected]> Signed-off-by: Darshit Chanpura <[email protected]> Co-authored-by: Darshit Chanpura <[email protected]> Signed-off-by: Dennis Toepker <[email protected]>
Description
This moves the reloading process of the security configuration (including reading the index and parsing and activating the configuration) to a dedicated thread. Any code which wants to trigger a configuration reload now only signals the thread to perform the reload.
This will especially be the
TransportConfigUpdateAction. This action waited so far on a lock for any ongoing configuration reload to complete. Thus, triggering theTransportConfigUpdateActionseveral times in environments with slow configuration reload processes could exhaust theMANAGEMENTthread pool and thus lead to node failures.The new code tries to minimize the number of reloads. Thus, if there is already a reload request queued, any further reload request will be merged into the one that is already queued.
Issues Resolved
Testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.