Prevent NPE when generating snapshot metrics before initial cluster state is set #136350

nicktindall · 2025-10-10T04:58:17Z

Metrics can be requested at any point in the node's lifecycle. We've seen examples of this happening before the initial cluster state is set.

This change moves the async shards-by-state and snapshots-by-state metric computation out of SnapshotService and into its own class. The new class will only ask the ClusterService for the cluster state when it is in lifecycleState STARTED. This should prevent attempts to read the state before it's present.

Resolves: ES-13022

elasticsearchmachine · 2025-10-10T05:09:35Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-10-10T05:09:35Z

Hi @nicktindall, I've created a changelog YAML for you.

nicktindall · 2025-10-10T05:33:51Z

Please hold off reviewing, much nicer approach inbound

…sticsearch into fix_npe_snapshot_metrics

nicktindall · 2025-10-10T06:25:26Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotMetricsService.java

+        if (shouldReturnSnapshotMetrics == false) {
+            return List.of();
+        }
+        return recalculateIfStale(clusterService.state()).snapshotStateMetrics();


There is a slight race if the node is demoted from master, it could happen just after we evaluate shouldReturnSnapshotMetrics == true, but it's no big deal and definitely not worth synchronizing.

We have the same issue with other Gauge metrics. I don't think it's important enough to address.

DaveCTurner

+1 on moving this to its own service class, but could we protect against early calls to clusterService.state() by checking clusterService.lifecycleState() == Lifecycle.State.STARTED (or using clusterService.addLifecycleListener()) instead? And then go back to checking the elected master (and the blocks) in the state being used for the stats rather than using a volatile field that isn't quite synchronized with the actual state?

nicktindall · 2025-10-13T00:08:20Z

server/src/main/java/org/elasticsearch/node/NodeConstruction.java

+        final CachingSnapshotAndShardByStateMetricsService cachingSnapshotAndShardByStateMetricsService =
+            new CachingSnapshotAndShardByStateMetricsService(clusterService);
+        snapshotMetrics.createSnapshotsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getSnapshotsByState);
+        snapshotMetrics.createSnapshotShardsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getShardsByState);


This behaviour (creation of the gauges) isn't tested in new code, but there are integration tests for these metrics already.

nicktindall · 2025-10-13T00:12:47Z

.../src/main/java/org/elasticsearch/snapshots/CachingSnapshotAndShardByStateMetricsService.java

+ * re-calculate the metrics if the {@link SnapshotsInProgress} has changed since the last time
+ * they were calculated.
+ */
+public class CachingSnapshotAndShardByStateMetricsService {


This could also probably be added to the SnapshotMetrics class, but I feel like it contains sufficient complexity to be it's own thing. Also SnapshotMetrics is currently immutable (a record), and I think the cache would kind-of taint that.

ywangd

LGTM

ywangd · 2025-10-13T03:08:44Z

.../src/main/java/org/elasticsearch/snapshots/CachingSnapshotAndShardByStateMetricsService.java

+        if (clusterService.lifecycleState() != Lifecycle.State.STARTED) {
+            return List.of();
+        }


TIL also :)

DaveCTurner

LGTM2 (one question about concurrency but it relates to behaviour that was there already)

DaveCTurner · 2025-10-13T07:33:07Z

.../src/main/java/org/elasticsearch/snapshots/CachingSnapshotAndShardByStateMetricsService.java

+public class CachingSnapshotAndShardByStateMetricsService {
+
+    private final ClusterService clusterService;
+    private CachedSnapshotStateMetrics cachedSnapshotStateMetrics;


Are we guaranteed that the metrics requests all come from the same thread (or at least, each one strictly happens-before the next)? If not, this should be volatile.

nicktindall added 3 commits October 10, 2025 14:10

Fix checkstyle

b3c35ce

Generate snapshot metrics from last applied state

4214472

Merge remote-tracking branch 'origin/main' into fix_npe_snapshot_metrics

f46ef6e

elasticsearchmachine added the v9.3.0 label Oct 10, 2025

nicktindall requested a review from ywangd October 10, 2025 05:07

nicktindall marked this pull request as ready for review October 10, 2025 05:08

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 10, 2025

nicktindall added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Oct 10, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 10, 2025

Update docs/changelog/136350.yaml

4d7ab3d

nicktindall marked this pull request as draft October 10, 2025 05:34

nicktindall added 4 commits October 10, 2025 17:18

Split out metrics calculation

f1c7b95

Merge remote-tracking branch 'origin/main' into fix_npe_snapshot_metrics

9e926ae

Merge branch 'fix_npe_snapshot_metrics' of github.com:nicktindall/ela…

b312683

…sticsearch into fix_npe_snapshot_metrics

Fix changelog

8c72da5

nicktindall changed the title ~~Generate snapshot metrics from last applied state~~ Prevent NPE when generating snapshot metrics before initial cluster state is set Oct 10, 2025

nicktindall commented Oct 10, 2025

View reviewed changes

Clean up remnants

3c5bb9e

DaveCTurner reviewed Oct 10, 2025

View reviewed changes

nicktindall added 4 commits October 13, 2025 09:52

Remove redundant staleness check

3684b57

Add test for no-longer master

a9e46fe

Use ClusterService lifecycle to decide when to poll

1a79b92

Test whole lifecyle

6f8978c

nicktindall marked this pull request as ready for review October 13, 2025 00:06

nicktindall requested a review from a team as a code owner October 13, 2025 00:06

nicktindall commented Oct 13, 2025

View reviewed changes

nicktindall requested a review from DaveCTurner October 13, 2025 00:38

nicktindall added 3 commits October 13, 2025 11:41

Merge remote-tracking branch 'origin/main' into fix_npe_snapshot_metrics

ee4a21c

Make minimum nodes 2

e3146c6

Use original cluster state when creating snapshotsInProgress

20715a3

ywangd approved these changes Oct 13, 2025

View reviewed changes

DaveCTurner approved these changes Oct 13, 2025

View reviewed changes

Prevent NPE when generating snapshot metrics before initial cluster state is set #136350

Are you sure you want to change the base?

Prevent NPE when generating snapshot metrics before initial cluster state is set #136350

Conversation

nicktindall commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

nicktindall commented Oct 10, 2025

Uh oh!

nicktindall Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicktindall commented Oct 10, 2025 •

edited

Loading

nicktindall Oct 13, 2025 •

edited

Loading