Make the health node available even if all nodes are marked for shutdown (#92193) #112825

nmesot · 2024-09-12T16:09:06Z

This fixes the behavior of the health node not being available as soon as all nodes are marked for shutdown, but still running. A cluster without an operating health node will return "No disk usage data" through its health API.

This introduces a cluster setting to enable this behavior, with the default being the previous behavior.

I also had to modify the part of the code that immediately evicts any running tasks as soon as a node is marked for shutdown, leading to an endless loop of eviction -> reassignment -> eviction -> ...

Fixes #92193

…own (elastic#92193) This fixes the behavior of the health node not being available as soon as all nodes are marked for shutdown, but still running. A cluster without an operating health node will return "No disk usage data" through its health API. This introduces a cluster setting to enable this behavior, with the default being the previous behavior. I also had to modify the part of the code that immediately evicts any running tasks as soon as a node is marked for shutdown, leading to an endless loop of eviction -> reassignment -> eviction -> ... Fixes elastic#92193

github-actions · 2024-09-12T16:09:19Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-09-13T15:21:58Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2024-09-16T07:46:26Z

Pinging @elastic/es-data-management (Team:Data Management)

ldematte · 2024-09-16T07:50:26Z

buildkite test this

ldematte · 2024-09-18T10:02:20Z

@elasticmachine update branch

…-shutdown

nmesot · 2024-09-29T12:11:09Z

@dakrone @andreidan Could you have a look at this? Our client is asking for an update on this feature.

I am relatively new to open source contributions. If there is anything that I still need to do for this to go through, feel free to let me know, I am happy to keep working on it.

gmarouli · 2024-10-03T07:17:36Z

Hi @nmesot , thank you for picking up this issue. I will start the review.

ldematte · 2024-10-04T07:15:52Z

docs/reference/modules/cluster/misc.asciidoc

+For certain lightweight, persistent tasks (like the health node), it makes sense to not
+have this limitation.
+
+NOTE: For the health node for example, this provides the benefit of having accurate


I don't think this note is needed here; better add this information to the changelog.

ldematte · 2024-10-04T07:16:22Z

server/src/main/java/org/elasticsearch/health/node/selection/HealthNodeTaskExecutor.java

    private final AtomicReference<HealthNode> currentTask = new AtomicReference<>();
    private final ClusterStateListener taskStarter;
    private final ClusterStateListener shutdownListener;
+    private Boolean allowLightweightAssignmentsToNodesShuttingDown;


why a Boolean and not a boolean?

ldematte · 2024-10-04T07:19:27Z

server/src/main/java/org/elasticsearch/health/node/selection/HealthNodeTaskExecutor.java

    }

+    private static boolean isEveryNodeShuttingDown(ClusterState clusterState, ClusterChangedEvent event) {
+        return clusterState.nodes().getAllNodes().stream().allMatch(dn -> event.state().metadata().nodeShutdowns().contains(dn.getId()));


Why using clusterState form clusterService.state() when you already have event.state()? Am I missing something here?

ldematte · 2024-10-04T07:33:35Z

Looks like a great contribution overall, thanks!
I'll let @gmarouli do the main review, we discussed offline a couple of issues, I just added some minor comments here and there.

gmarouli

Hi @nmesot , thank you again for putting this effort and opening a very good PR. I have left some comments that hopefully will relax the complexity a bit but it looks really good.

gmarouli · 2024-10-04T07:51:31Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

+    public static final Setting<Boolean> CLUSTER_TASKS_ALLOCATION_ALLOW_LIGHTWEIGHT_ASSIGNMENTS_TO_NODES_SHUTTING_DOWN_SETTING = Setting
+        .boolSetting(
+            "cluster.persistent_tasks.allocation.allow_lightweight_assignments_to_nodes_shutting_down",
+            false,
+            Setting.Property.Dynamic,
+            Setting.Property.NodeScope
+        );
+


I think this change is safe enough to not add a switch. Each task that defines itself as lightweight should ensure that it is safe to run always like that or that it needs the flag. What do you think?

gmarouli · 2024-10-04T07:53:12Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

+        // nodes are available. An example of a lightweight task is the HealthNodeTask.
+        // This behavior needs to be enabled with the CLUSTER_TASKS_ALLOCATION_ALLOW_LIGHTWEIGHT_ASSIGNMENTS_TO_NODES_SHUTTING_DOWN_SETTING
+        // setting.


I think the comment is clear enough without the example. Someone can always check which tasks implement the isLightweight.

gmarouli · 2024-10-04T08:25:44Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

+        final List<DiscoveryNode> allNodes = currentState.nodes().stream().toList();
+        final List<DiscoveryNode> candidateNodes = allNodes.stream()
            .filter(dn -> currentState.metadata().nodeShutdowns().contains(dn.getId()) == false)
            .collect(Collectors.toCollection(ArrayList::new));


You could use .toList() here too.

gmarouli · 2024-10-04T08:30:38Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

            .filter(dn -> currentState.metadata().nodeShutdowns().contains(dn.getId()) == false)
            .collect(Collectors.toCollection(ArrayList::new));
+
+        if (candidateNodes.isEmpty() && allNodes.isEmpty() == false) { // all nodes are shutting down


I think creating a new collection allNodes is not necessary. I think currentState.nodes().getNodes().isEmpty() will not create new objects and therefore be more efficient.

gmarouli · 2024-10-04T10:48:56Z

server/src/main/java/org/elasticsearch/persistent/PersistentTaskParams.java

+     * {@link PersistentTasksClusterService#CLUSTER_TASKS_ALLOCATION_ALLOW_LIGHTWEIGHT_ASSIGNMENTS_TO_NODES_SHUTTING_DOWN_SETTING}
+     * needs to be set for this behavior to apply.
+     */
+    default boolean isLightweight() {


I am guessing this term was inspired by the original issue. However, this term is not established and it was more used to describe the health node task. Since we do not have any other usage for this method, I suggest we give a more specific name, something along the lines of keepRunningDuringShutdown or canRunOnShuttingDownNodes. What do you think?

++ for canRunOnShuttingDownNodes

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 12, 2024

javanna added :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown and removed needs:triage Requires assignment of a team area label labels Sep 13, 2024

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Sep 13, 2024

ldematte added the :Data Management/Health label Sep 16, 2024

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 16, 2024

ldematte added the >feature label Sep 16, 2024

Merge branch 'main' into feat/make-health-node-available-if-all-nodes…

9abbc09

…-shutdown

andreidan requested a review from gmarouli October 2, 2024 12:51

gmarouli self-assigned this Oct 3, 2024

ldematte reviewed Oct 4, 2024

View reviewed changes

gmarouli requested changes Oct 4, 2024

View reviewed changes

gmarouli added >enhancement and removed >feature labels Oct 4, 2024

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

elasticsearchmachine removed the v9.2.0 label Oct 2, 2025

elasticsearchmachine added the v9.3.0 label Oct 2, 2025

nmesot closed this Oct 3, 2025

Make the health node available even if all nodes are marked for shutdown (#92193) #112825

Make the health node available even if all nodes are marked for shutdown (#92193) #112825

Uh oh!

Conversation

nmesot commented Sep 12, 2024

Uh oh!

github-actions bot commented Sep 12, 2024

Uh oh!

elasticsearchmachine commented Sep 13, 2024

Uh oh!

elasticsearchmachine commented Sep 16, 2024

Uh oh!

ldematte commented Sep 16, 2024

Uh oh!

ldematte commented Sep 18, 2024

Uh oh!

nmesot commented Sep 29, 2024

Uh oh!

gmarouli commented Oct 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldematte commented Oct 4, 2024

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!