Avoid restarting data stream reindex when cluster is upgraded #125587

masseyke · 2025-03-25T14:29:00Z

This prevents a node from restarting a completed data stream reindex after the cluster has been upgraded from 8.x.
The persistent task for a data stream reindex is kept alive for 24 hours so that the user has time to check its status. When a cluster is upgraded, the persistent task moves to a new node. It currently does not check whether it was previously completed. So the persistent task running on (for example) 9.0 sees that the data stream contains nothing but "old" (8.18) indices, and unnecessarily begins reindexing all of those so that they are 9.0 indices.
This creates unnecessary work on the cluster. But worse, this second reindexing of the data stream indices causes the backing indices to have a name that is incompatible with the naming pattern expected by system data stream backing indices, leading to a variety of problems querying data.

elasticsearchmachine · 2025-03-25T14:29:25Z

Pinging @elastic/es-data-management (Team:Data Management)

mosche · 2025-03-25T14:34:15Z

.../main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamPersistentTaskExecutor.java

        ReindexDataStreamTaskParams params,
        PersistentTaskState persistentTaskState
    ) {
+        if (isComplete(persistentTaskState)) {


Don't you have to schedule the completeTask for removal here, so the task is cleaned up after 24h?

Ah, good point -- the scheduling of the removal was on the threadpool on the old node, which won't carry over to this node. But I think I can just do task.markAsCompleted(); here. The user has already upgraded, meaning they know that it completed. And the information is really out of date at this point anyway (because the data stream is "old" for this new cluster).

If a node is terminated and the task moves to a new node (after completion), we wouldn't remove the task metadata otherwise.

Also note, the ScheduledCancellable returned by below line keeps running on the old node in this case, making the behavior a bit indeterministic. As far as I remember completeAndNotifyIfNeeded might throw if completed multiple times in some cases.

threadPool.schedule(completeTask, timeToLive, threadPool.generic())

I would guess that the task would only be moved if the node it was on came down altogether, threadpool included, right? Regardless though, the exception will be harmless won't it?
I'll change it to schedule the removal in the future so that it continues to work once users intentionally run this on 9.x (I don't think they will want to until we're approaching 10.0).

The user has already upgraded, meaning they know that it completed. And the information is really out of date at this point anyway (because the data stream is "old" for this new cluster).

Actually, rethinking this, the task might also be moved for other reasons (a node might crash any time), we can't assume the upgrade has happened just because the task was moved. Not sure how important it is to keep the old state around for 24h, but it might lead to surprising results this way.

Yeah this change definitely doesn't assume the task was moved just because it's on a new node -- it is explicitly checking that the completion time is non-null, meaning it was completed on some other node.

It only keeps the task for 24h beyond whenever it completes. If a node crashes before it completes, it will not have been scheduled for removal.

Yes, this is only ever a problem if a node crashes after completion within those 24h

👍 looks like that's already addressed

...plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamTask.java

…m:masseyke/elasticsearch into data-stream-reindex-no-restart-on-upgrade

mosche

LGTM 🎉

lukewhiting

Looks good :-) 👍🏻

…m:masseyke/elasticsearch into data-stream-reindex-no-restart-on-upgrade

JVerwolf · 2025-03-25T20:01:13Z

.../main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamPersistentTaskExecutor.java

+    private Long getCompletionTime(PersistentTaskState persistentTaskState) {
+        if (persistentTaskState == null) {
+            return null;
+        } else {
+            if (persistentTaskState instanceof ReindexDataStreamPersistentTaskState state) {
+                return state.completionTime();
+            } else {
+                return null;
+            }
+        }
+    }
+


Nit: I don't think you need a null check prior to calling instanceof

JVerwolf · 2025-03-25T20:15:45Z

.../main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamPersistentTaskExecutor.java

+
+    private TimeValue getTimeToLive(long completionTimeInMillis) {
+        return TimeValue.timeValueMillis(
+            TASK_KEEP_ALIVE_TIME.millis() - Math.max(


Should this be Math.min?

Yep, thanks!

JVerwolf

LGTM!

elasticsearchmachine · 2025-03-25T22:44:59Z

💔 Backport failed

Status	Branch	Result
❌	9.0	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 125587

masseyke · 2025-03-25T22:52:26Z

💚 All backports created successfully

Status	Branch	Result
✅	9.0

Questions ?

Please refer to the Backport tool documentation

…c#125587) (cherry picked from commit 6a74aba) # Conflicts: # x-pack/plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamTask.java

… (#125627)

…c#125587)

Avoid restarting data stream reindex when cluster is upgraded

87fc468

masseyke added >non-issue :Data Management/Data streams Data streams and their lifecycles v9.0.0 v9.1.0 labels Mar 25, 2025

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 25, 2025

mosche reviewed Mar 25, 2025

View reviewed changes

masseyke and others added 3 commits March 25, 2025 09:49

fixing a compilation error

5f61f74

killing task if it is already complete on a fresh node

231de38

[CI] Auto commit changes from spotless

12919f4

mosche reviewed Mar 25, 2025

View reviewed changes

...plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/task/ReindexDataStreamTask.java Outdated Show resolved Hide resolved

masseyke added 3 commits March 25, 2025 10:15

scheduling task for future removal

9b3ec98

Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…

b0fce5c

…m:masseyke/elasticsearch into data-stream-reindex-no-restart-on-upgrade

using state we already have to determine isComplete

1b5184c

masseyke requested a review from mosche March 25, 2025 15:31

mosche approved these changes Mar 25, 2025

View reviewed changes

masseyke requested review from dakrone and lukewhiting March 25, 2025 15:35

masseyke added the auto-backport Automatically create backport pull requests when merged label Mar 25, 2025

Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade

d785813

lukewhiting approved these changes Mar 25, 2025

View reviewed changes

masseyke added 6 commits March 25, 2025 12:57

minor fix to DataStreamsUpgradeIT

69f1ef8

making sure that we cancel tasks

e091420

Merge branch 'data-stream-reindex-no-restart-on-upgrade' of github.co…

744870a

…m:masseyke/elasticsearch into data-stream-reindex-no-restart-on-upgrade

avoiding forbidden method

deba319

Merge branch 'main' into data-stream-reindex-no-restart-on-upgrade

1a7a56e

making sure time is not negative

e1ff8a0

JVerwolf reviewed Mar 25, 2025

View reviewed changes

fixing getTimeToLive

02a1e64

JVerwolf approved these changes Mar 25, 2025

View reviewed changes

removing unnecessary null check

1c215ee

masseyke added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Mar 25, 2025

masseyke enabled auto-merge (squash) March 25, 2025 22:15

masseyke merged commit 6a74aba into elastic:main Mar 25, 2025
17 checks passed

elasticsearchmachine added the backport pending label Mar 25, 2025

masseyke mentioned this pull request Mar 25, 2025

[9.0] Avoid restarting data stream reindex when cluster is upgraded (#125587) #125627

Merged

masseyke added a commit that referenced this pull request Mar 26, 2025

Avoid restarting data stream reindex when cluster is upgraded (#125587)…

1f89666

… (#125627)

alexey-ivanov-es mentioned this pull request Mar 26, 2025

System data streams issue after upgrading to 9.0 with feature migration API #125560

Closed

omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025

Avoid restarting data stream reindex when cluster is upgraded (elasti…

804212f

…c#125587)

Avoid restarting data stream reindex when cluster is upgraded #125587

Avoid restarting data stream reindex when cluster is upgraded #125587

Uh oh!

Conversation

masseyke commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mosche left a comment

Choose a reason for hiding this comment

Uh oh!

lukewhiting left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JVerwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 25, 2025

💔 Backport failed

Uh oh!

masseyke commented Mar 25, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masseyke commented Mar 25, 2025 •

edited

Loading