-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve CrossClusterAsyncQueryStopIT test resiliency #122219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hey @smalyshev ! I'd like to re-label this PR with The reason is that these PR labels determine what the CI bot will use to label test failures; and it's probably better if the CI bot labels failures in cross cluster tests with a label that pings you directly rather than requiring manual relabeling, or us pinging you by other means. I hope that's okay - feel free to undo my relabeling if you'd like to keep the original labels. |
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
quux00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions left.
| .filter(t -> t.description().contains("_LuceneSourceOperator") == false) | ||
| .toList(); | ||
| assertThat(reduceTasks, empty()); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this assertBusy code isn't changed, but I don't think I understand what it is doing. I think this is waiting for the drivers to be cancelled and all the Lucene operations to no longer present? If yes, can you add a comment to that effect for later help in understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I don't fully understand it myself, it's Nhat's code, but it looks like it checks that the drivers don't have any other tasks except _LuceneSourceOperator. I am not 100% sure why _LuceneSourceOperator is excluded though. I personally don't feel confident enough with this to comment it, but this is not the purpose of this patch.
| } | ||
| } finally { | ||
| // Ensure proper cleanup if the test fails | ||
| CountingPauseFieldPlugin.allowEmitting.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only substantive change in this PR? It's hard to tell what is new and not as even the side-by-side view in GH is not helping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes it so if the test fails, the locks are still removed and the async request is cleared. I've got some very confusing rare failures in the logs, and I suspect this is caused by one test's setup leaking into another. So I want to make it cleaner.
quux00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Backport successful
|
Force the lock to be reset existing from the test method, and async query be always deleted even on failure.