-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Avoid restarting data stream reindex when cluster is upgraded #125587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
masseyke
merged 16 commits into
elastic:main
from
masseyke:data-stream-reindex-no-restart-on-upgrade
Mar 25, 2025
Merged
Changes from 4 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
87fc468
Avoid restarting data stream reindex when cluster is upgraded
masseyke 5f61f74
fixing a compilation error
masseyke 231de38
killing task if it is already complete on a fresh node
masseyke 12919f4
[CI] Auto commit changes from spotless
9b3ec98
scheduling task for future removal
masseyke b0fce5c
Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…
masseyke 1b5184c
using state we already have to determine isComplete
masseyke d785813
Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade
masseyke 69f1ef8
minor fix to DataStreamsUpgradeIT
masseyke e091420
making sure that we cancel tasks
masseyke 744870a
Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…
masseyke deba319
avoiding forbidden method
masseyke 1a7a56e
Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade
masseyke e1ff8a0
making sure time is not negative
masseyke 02a1e64
fixing getTimeToLive
masseyke 1c215ee
removing unnecessary null check
masseyke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have to schedule the
completeTaskfor removal here, so the task is cleaned up after 24h?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point -- the scheduling of the removal was on the threadpool on the old node, which won't carry over to this node. But I think I can just do
task.markAsCompleted();here. The user has already upgraded, meaning they know that it completed. And the information is really out of date at this point anyway (because the data stream is "old" for this new cluster).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node is terminated and the task moves to a new node (after completion), we wouldn't remove the task metadata otherwise.
Also note, the
ScheduledCancellablereturned by below line keeps running on the old node in this case, making the behavior a bit indeterministic. As far as I remembercompleteAndNotifyIfNeededmight throw if completed multiple times in some cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that the task would only be moved if the node it was on came down altogether, threadpool included, right? Regardless though, the exception will be harmless won't it?
I'll change it to schedule the removal in the future so that it continues to work once users intentionally run this on 9.x (I don't think they will want to until we're approaching 10.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, rethinking this, the task might also be moved for other reasons (a node might crash any time), we can't assume the upgrade has happened just because the task was moved. Not sure how important it is to keep the old state around for 24h, but it might lead to surprising results this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this change definitely doesn't assume the task was moved just because it's on a new node -- it is explicitly checking that the completion time is non-null, meaning it was completed on some other node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only keeps the task for 24h beyond whenever it completes. If a node crashes before it completes, it will not have been scheduled for removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is only ever a problem if a node crashes after completion within those 24h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks like that's already addressed