- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
Avoid restarting data stream reindex when cluster is upgraded #125587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
            masseyke
  merged 16 commits into
  elastic:main
from
masseyke:data-stream-reindex-no-restart-on-upgrade
  
      
      
   
  Mar 25, 2025 
      
    
  
     Merged
                    Changes from 1 commit
      Commits
    
    
            Show all changes
          
          
            16 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      87fc468
              
                Avoid restarting data stream reindex when cluster is upgraded
              
              
                masseyke 5f61f74
              
                fixing a compilation error
              
              
                masseyke 231de38
              
                killing task if it is already complete on a fresh node
              
              
                masseyke 12919f4
              
                [CI] Auto commit changes from spotless
              
              
                 9b3ec98
              
                scheduling task for future removal
              
              
                masseyke b0fce5c
              
                Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…
              
              
                masseyke 1b5184c
              
                using state we already have to determine isComplete
              
              
                masseyke d785813
              
                Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade
              
              
                masseyke 69f1ef8
              
                minor fix to DataStreamsUpgradeIT
              
              
                masseyke e091420
              
                making sure that we cancel tasks
              
              
                masseyke 744870a
              
                Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…
              
              
                masseyke deba319
              
                avoiding forbidden method
              
              
                masseyke 1a7a56e
              
                Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade
              
              
                masseyke e1ff8a0
              
                making sure time is not negative
              
              
                masseyke 02a1e64
              
                fixing getTimeToLive
              
              
                masseyke 1c215ee
              
                removing unnecessary null check
              
              
                masseyke File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have to schedule the
completeTaskfor removal here, so the task is cleaned up after 24h?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point -- the scheduling of the removal was on the threadpool on the old node, which won't carry over to this node. But I think I can just do
task.markAsCompleted();here. The user has already upgraded, meaning they know that it completed. And the information is really out of date at this point anyway (because the data stream is "old" for this new cluster).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node is terminated and the task moves to a new node (after completion), we wouldn't remove the task metadata otherwise.
Also note, the
ScheduledCancellablereturned by below line keeps running on the old node in this case, making the behavior a bit indeterministic. As far as I remembercompleteAndNotifyIfNeededmight throw if completed multiple times in some cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that the task would only be moved if the node it was on came down altogether, threadpool included, right? Regardless though, the exception will be harmless won't it?
I'll change it to schedule the removal in the future so that it continues to work once users intentionally run this on 9.x (I don't think they will want to until we're approaching 10.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, rethinking this, the task might also be moved for other reasons (a node might crash any time), we can't assume the upgrade has happened just because the task was moved. Not sure how important it is to keep the old state around for 24h, but it might lead to surprising results this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this change definitely doesn't assume the task was moved just because it's on a new node -- it is explicitly checking that the completion time is non-null, meaning it was completed on some other node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only keeps the task for 24h beyond whenever it completes. If a node crashes before it completes, it will not have been scheduled for removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is only ever a problem if a node crashes after completion within those 24h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks like that's already addressed