Skip to content

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Oct 10, 2025

Metrics can be requested at any point in the node's lifecycle. We've seen examples of this happening before the initial cluster state is set.

This change moves the async shards-by-state and snapshots-by-state metric computation out of SnapshotService and into its own class. The new class will only ask the ClusterService for the cluster state when it is in lifecycleState STARTED. This should prevent attempts to read the state before it's present.

Resolves: ES-13022

@nicktindall nicktindall requested a review from ywangd October 10, 2025 05:07
@nicktindall nicktindall marked this pull request as ready for review October 10, 2025 05:08
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 10, 2025
@nicktindall nicktindall added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Oct 10, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@nicktindall
Copy link
Contributor Author

Please hold off reviewing, much nicer approach inbound

@nicktindall nicktindall marked this pull request as draft October 10, 2025 05:34
@nicktindall nicktindall changed the title Generate snapshot metrics from last applied state Prevent NPE when generating snapshot metrics before initial cluster state is set Oct 10, 2025
if (shouldReturnSnapshotMetrics == false) {
return List.of();
}
return recalculateIfStale(clusterService.state()).snapshotStateMetrics();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a slight race if the node is demoted from master, it could happen just after we evaluate shouldReturnSnapshotMetrics == true, but it's no big deal and definitely not worth synchronizing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same issue with other Gauge metrics. I don't think it's important enough to address.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on moving this to its own service class, but could we protect against early calls to clusterService.state() by checking clusterService.lifecycleState() == Lifecycle.State.STARTED (or using clusterService.addLifecycleListener()) instead? And then go back to checking the elected master (and the blocks) in the state being used for the stats rather than using a volatile field that isn't quite synchronized with the actual state?

@nicktindall nicktindall marked this pull request as ready for review October 13, 2025 00:06
@nicktindall nicktindall requested a review from a team as a code owner October 13, 2025 00:06
final CachingSnapshotAndShardByStateMetricsService cachingSnapshotAndShardByStateMetricsService =
new CachingSnapshotAndShardByStateMetricsService(clusterService);
snapshotMetrics.createSnapshotsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getSnapshotsByState);
snapshotMetrics.createSnapshotShardsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getShardsByState);
Copy link
Contributor Author

@nicktindall nicktindall Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour (creation of the gauges) isn't tested in new code, but there are integration tests for these metrics already.

* re-calculate the metrics if the {@link SnapshotsInProgress} has changed since the last time
* they were calculated.
*/
public class CachingSnapshotAndShardByStateMetricsService {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also probably be added to the SnapshotMetrics class, but I feel like it contains sufficient complexity to be it's own thing. Also SnapshotMetrics is currently immutable (a record), and I think the cache would kind-of taint that.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +44 to +46
if (clusterService.lifecycleState() != Lifecycle.State.STARTED) {
return List.of();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL also :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 (one question about concurrency but it relates to behaviour that was there already)

public class CachingSnapshotAndShardByStateMetricsService {

private final ClusterService clusterService;
private CachedSnapshotStateMetrics cachedSnapshotStateMetrics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed that the metrics requests all come from the same thread (or at least, each one strictly happens-before the next)? If not, this should be volatile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants