-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Prevent NPE when generating snapshot metrics before initial cluster state is set #136350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @nicktindall, I've created a changelog YAML for you. |
Please hold off reviewing, much nicer approach inbound |
if (shouldReturnSnapshotMetrics == false) { | ||
return List.of(); | ||
} | ||
return recalculateIfStale(clusterService.state()).snapshotStateMetrics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a slight race if the node is demoted from master, it could happen just after we evaluate shouldReturnSnapshotMetrics == true
, but it's no big deal and definitely not worth synchronizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same issue with other Gauge metrics. I don't think it's important enough to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on moving this to its own service class, but could we protect against early calls to clusterService.state()
by checking clusterService.lifecycleState() == Lifecycle.State.STARTED
(or using clusterService.addLifecycleListener()
) instead? And then go back to checking the elected master (and the blocks) in the state being used for the stats rather than using a volatile field that isn't quite synchronized with the actual state?
final CachingSnapshotAndShardByStateMetricsService cachingSnapshotAndShardByStateMetricsService = | ||
new CachingSnapshotAndShardByStateMetricsService(clusterService); | ||
snapshotMetrics.createSnapshotsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getSnapshotsByState); | ||
snapshotMetrics.createSnapshotShardsByStateMetric(cachingSnapshotAndShardByStateMetricsService::getShardsByState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour (creation of the gauges) isn't tested in new code, but there are integration tests for these metrics already.
* re-calculate the metrics if the {@link SnapshotsInProgress} has changed since the last time | ||
* they were calculated. | ||
*/ | ||
public class CachingSnapshotAndShardByStateMetricsService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also probably be added to the SnapshotMetrics
class, but I feel like it contains sufficient complexity to be it's own thing. Also SnapshotMetrics
is currently immutable (a record), and I think the cache would kind-of taint that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (clusterService.lifecycleState() != Lifecycle.State.STARTED) { | ||
return List.of(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL also :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2 (one question about concurrency but it relates to behaviour that was there already)
public class CachingSnapshotAndShardByStateMetricsService { | ||
|
||
private final ClusterService clusterService; | ||
private CachedSnapshotStateMetrics cachedSnapshotStateMetrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed that the metrics requests all come from the same thread (or at least, each one strictly happens-before the next)? If not, this should be volatile
.
Metrics can be requested at any point in the node's lifecycle. We've seen examples of this happening before the initial cluster state is set.
This change moves the async shards-by-state and snapshots-by-state metric computation out of
SnapshotService
and into its own class. The new class will only ask theClusterService
for the cluster state when it is in lifecycleStateSTARTED
. This should prevent attempts to read the state before it's present.Resolves: ES-13022