Add metrics tracking shard time from unassigned to initialized/started by inespot · Pull Request #144521 · elastic/elasticsearch

inespot · 2026-03-18T19:53:39Z

Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED.

Relates to ES-14351.

There are two parts to this.

Evaluate the baseline for shard assignment latency. Add metrics to track how long it typically takes for primary/replica shards to transition out of the unassigned state and eventually become STARTED, so we can define what a "normal wait" looks like. (tackled by this PR)

Relax the yellow health check for unassigned replica shards. Keep cluster health green for a configurable duration (X) before moving to the noisy yellow state. There is already precedent for this in the codebase for the red state:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java

Lines 470 to 480 in d7f7f28

    
           private static boolean isUnassignedDueToNewInitialization(ProjectId projectId, ShardRouting routing, ClusterState state) { 
        
               if (routing.active()) { 
        
                   return false; 
        
               } 
        
               // If the primary is inactive for unexceptional events in the cluster lifecycle, both the primary and the 
        
               // replica are considered new initializations. 
        
               ShardRouting primary = routing.primary() 
        
                   ? routing 
        
                   : state.routingTable(projectId).shardRoutingTable(routing.shardId()).primaryShard(); 
        
               return primary.active() == false && getInactivePrimaryHealth(primary) == ClusterHealthStatus.YELLOW; 
        
           }

(this second piece will be tackled in a follow-up PR).

…d state Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED. Relates to ES-14351

inespot · 2026-03-18T20:36:17Z

Design alternatives considered:

Record metrics inside ShardsAvailabilityHealthIndicatorService. But then metric emission would be tied to the health pipeline's calculate cadence rather than to actual shard transitions, so we would end up with less accurate data.
Instead of the observer pattern, compare the routing table before and after each allocation round directly in AllocationService methods. This could allow more flexible filtering of which transitions to record, but requires keeping the call sites up to date as the code evolves, and makes the logic more complex.
Use the existing AllocationBalancingRoundMetrics or DesiredBalanceMetrics. The former felt like the wrong abstraction (it tracks balancing round metrics, not the current state of the cluster). The latter feels more related but tracks a snapshot of current allocation state, not event latency. It would also require more scaffolding to plug in transition times.

This approach (ShardChangesObserver) does change the class' "single logging" responsibility, but metrics and logging are closely related enough that co-locating them seems reasonable?

Open to other opinions/happy to refactor though!

elasticsearchmachine · 2026-03-19T04:28:02Z

Pinging @elastic/es-distributed (Team:Distributed)

inespot · 2026-03-19T05:40:38Z

To note

⚠️ This will also need a serverless side PR to fix the AllocationService constructor in StatelessSnapshotResiliencyTests (if we move forward with this design, see above for considered alternatives)
In case of a node restart or a node leaving the cluster (Reason.NODE_LEFT and Reason.NODE_RESTARTING), the allocation of the unassigned shard may be delayed (until the configured index.unassigned.node_left.delayed_timeout timeout) because it is more efficient to wait and see if the node returns than trigger recovery.
We add the reason for the unassignment as an attribute to the metric to be able to distinguish those cases.
The metrics will recorded even if the cluster state publication fails. That should be acceptable given the "stats/baseline" purpose of those metrics.
Only the master node emits these metrics, since the observer fires during cluster state task execution. No cleanup should needed on master demotion since unlike gauge, histograms are point-in-time recordings with no accumulated state to reset (but let me know if I missed sth here)

inespot · 2026-03-19T13:52:12Z

@DiannaHohensee, added you as an optional reviewer since this change touches the Allocation classes!

DaveCTurner · 2026-03-19T16:51:57Z

I'm supportive of more metrics in this area, but in terms of ES-14351 I was thinking we could use the index.creation_date setting as the start time. We care about shards which become unassigned later on in their lifecycles, it's particularly newly-created shards that need to be excluded from these alerts.

Thinking about it more tho there are some other reasons for a shard to be unassigned without deserving an alert, and I think we have everything we need in the routing table in UnassignedInfo#unassignedTimeMillis and UnassignedInfo.Reason (i.e. alert on ALLOCATION_FAILED, NODE_LEFT, REINITIALIZED, REALLOCATED_REPLICA, PRIMARY_FAILED or NODE_RESTARTING but allow a grace period for the others).

inespot · 2026-03-19T18:19:13Z

I'm supportive of more metrics in this area, but in terms of ES-14351 I was thinking we could use the index.creation_date setting as the start time. We care about shards which become unassigned later on in their lifecycles, it's particularly newly-created shards that need to be excluded from these alerts.

I agree those metrics go a little bit beyond the ES-14351 initial goal (avoid noise while “new index shards are still being allocated"). However, adding baseline visibility into how long allocation typically takes (UNASSIGNED -> INITIALIZED -> STARTED), and how that varies by UnassignedInfo.Reason is generally useful for both ES-14351 and if we want to start excluding more cases than "newly-created shards" in the future.

Thinking about it more tho there are some other reasons for a shard to be unassigned without deserving an alert, and I think we have everything we need in the routing table in UnassignedInfo#unassignedTimeMillis and UnassignedInfo.Reason (i.e. alert on ALLOCATION_FAILED, NODE_LEFT, REINITIALIZED, REALLOCATED_REPLICA, PRIMARY_FAILED or NODE_RESTARTING but allow a grace period for the others).

Yep, using index.creation_date for the “new index” case would require metadata lookup and doesn’t generalize as well if we move beyond that one scenario. Separately, the existing health computation already handles the “index exists but shard routings aren’t created yet” case:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/health/ClusterIndexHealth.java

Lines 108 to 110 in 0cd9c72

    
           if (shards.isEmpty()) { // might be since none has been created yet (two phase index creation) 
        
               computeStatus = ClusterHealthStatus.RED; 
        
           }

.
So it seems reasonable to treat that as a distinct signal from “shard became unassigned at time T and is still unassigned after some grace period.”? Open to other opinions though!

inespot · 2026-03-19T18:32:19Z

I also like the concept of triggering the grace/alert logic based on UnassignedInfo.Reason (and unassignedTimeMillis) because I think it will give us more flexibility over time. It's “pluggable”, we can toggle which reasons get a grace period without introducing extra plumbing. Based on the existing code in ShardsAvailabilityHealthIndicatorService, it seems like the logic would not be too complex either? But I’ll start drafting a PR to make this more concrete

DaveCTurner · 2026-03-20T12:19:43Z

handles the “index exists but shard routings aren’t created yet” case:

This logic dates back to 2016. I didn't know that was ever a case, but it's not any more at least - there is no "two-phase index creation", we populate the routing table with UNASSIGNED entries in the same cluster state update that creates the index. I think we should be able to assert shards.size() == numberOfShards here?

DiannaHohensee · 2026-03-20T22:00:55Z

The approach plugging metrics into ShardChangesObserver looks good to me 👍 Thanks for the ping!

CC'ing @nicktindall because he's been working on ES-13621, which similarly goes through Observers.

nicktindall

Yep, this all seems sensible to me 👍

inespot · 2026-03-25T03:21:53Z

Finally defeated the flaky CI tests! 🥳
@DaveCTurner @DiannaHohensee @nicktindall, from the comments above, it sounds like these metrics would be useful and that the current design looks sound.
Are we OK to merge this as-is, or would you rather wait for the linked ES-14351 PRs / ES-13621 to land first?

nicktindall · 2026-03-25T03:28:17Z

I can't speak for the others, but it's a yes from me. Dianna is on PTO today, and she's already given it a thumbs up so probably OK to merge with an approval from David?

DaveCTurner · 2026-03-25T12:43:02Z

I don't have much of an opinion here and also don't have a lot of review capacity at the moment so I'd rather delegate this to the others. I'm still broadly supportive of the idea and it seems orthogonal to ES-14351, no need to do things in either order.

inespot · 2026-03-25T13:24:33Z

That makes sense, thanks both! Given that, @nicktindall would you be ok owning the remaining steps of review and potential PR approval (given the ES-13621 / observer overlap), or would you rather someone else from Distributed own it?

nicktindall

Still LGTM, just a couple of questions/comments

nicktindall · 2026-03-26T01:05:10Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardChangesObserver.java

+
+    private static Map<UnassignedInfo.Reason, Map<String, Object>> buildAttributesByReason(boolean primary) {
+        return Arrays.stream(UnassignedInfo.Reason.values())
+            .collect(Collectors.toUnmodifiableMap(r -> r, r -> Map.of("es_shard_primary", primary, "es_shard_reason", r.name())));


I feel like in the past I've had issues with attribute values being something other than a string. I think the TestTelemetryProvider copes with it but the real one doesn't. Please just verify that because it may have changed or I may have misremembered.

r.name() is a String so that one should work.

For the es_shard_primary boolean, I traced down the code, it seems that this should be handled properly:

elasticsearch/modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/metrics/OtelHelper.java

Lines 30 to 60 in 705a330

static Attributes fromMap(String metricName, Map<String, Object> attributes) {

if (attributes == null || attributes.isEmpty()) {

return Attributes.empty();

}

MetricValidator.assertValidAttributeNames(metricName, attributes);

var builder = Attributes.builder();

attributes.forEach((k, v) -> {

if (v instanceof String value) {

builder.put(k, value);

} else if (v instanceof Long value) {

builder.put(k, value);

} else if (v instanceof Integer value) {

builder.put(k, value);

} else if (v instanceof Byte value) {

builder.put(k, value);

} else if (v instanceof Short value) {

builder.put(k, value);

} else if (v instanceof Double value) {

builder.put(k, value);

} else if (v instanceof Float value) {

builder.put(k, value);

} else if (v instanceof Boolean value) {

builder.put(k, value);

} else {

throw new IllegalArgumentException("attributes do not support value type of [" + v.getClass().getCanonicalName() + "]");

}

});

return builder.build();

}

There are also other example paths in Elasticsearch using boolean attributes, e.g the system_thread setting in:

elasticsearch/server/src/main/java/org/elasticsearch/index/search/stats/ShardSearchPhaseAPMMetrics.java

Lines 85 to 97 in 1d0f434

private static void recordPhaseLatency(

LongHistogram histogramMetric,

long tookInNanos,

ShardSearchRequest request,

Long timeRangeFilterFromMillis

) {

Map<String, Object> attributes = SearchRequestAttributesExtractor.extractAttributes(

request,

timeRangeFilterFromMillis,

request.nowInMillis()

);

histogramMetric.record(TimeUnit.NANOSECONDS.toMillis(tookInNanos), attributes);

}

Great, thanks for double-checking!

nicktindall · 2026-03-26T01:08:44Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardChangesObserver.java

+        UnassignedInfo info = unassignedShard.unassignedInfo();
+        if (info != null) {
+            long durationMillis = currentTimeMillisSupplier.getAsLong() - info.unassignedTimeMillis();
+            unassignedToInitializingDuration.record(Math.max(0, durationMillis), attributes(info, initializedShard));


I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?

Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?

I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?

Yep, that's the only reason! I went through the code and unassignedTimeMillis should always be a real timestamp.

Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?

Clock skew large enough to produce a negative value would be unusual in practice so not too worried about this. Considering that, I'd rather floor to 0 than skip the recording entirely. Otherwise we could silently be dropping actual data points would make the histogram's total count slightly undercount actual transitions, which seems worse than a rare artificial 0ms entry. Happy to reconsider if you feel strongly though!

Thanks for explaining, agree that we should just round to zero if it's due to clock skew

nicktindall

Ship it!

Add metrics tracking shard time from unassigned to initialized/starte…

838d6a7

…d state Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED. Relates to ES-14351

elasticsearchmachine added the v9.4.0 label Mar 18, 2026

inespot and others added 7 commits March 18, 2026 16:32

Small fixes and optimizations

f0ddd6e

Don't bypass MetricValidator + style changes

30d8f6b

Merge branch 'main' into es-14351/shard_allocation_metrics

232a32d

Remove delayed attribute

dd22525

Style nits

0e4fa41

Merge branch 'main' into es-14351/shard_allocation_metrics

547c8e6

Comment nits

e9b1525

inespot marked this pull request as ready for review March 19, 2026 04:27

inespot added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >non-issue labels Mar 19, 2026

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 19, 2026

Merge branch 'main' into es-14351/shard_allocation_metrics

32b30a8

inespot requested review from DaveCTurner and DiannaHohensee March 19, 2026 13:51

nicktindall reviewed Mar 23, 2026

View reviewed changes

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 23, 2026

inespot mentioned this pull request Mar 23, 2026

Health reports GREEN when provisionally unassigned replica #144773

Open

inespot added 2 commits March 23, 2026 12:47

Merge branch 'main' into es-14351/shard_allocation_metrics

60d84e7

Merge branch 'main' into es-14351/shard_allocation_metrics

65dbfcd

inespot mentioned this pull request Mar 23, 2026

Clean up obsolete ClusterIndexHealth two-phase index creation guard #144803

Open

inespot added 2 commits March 23, 2026 23:21

Merge branch 'main' into es-14351/shard_allocation_metrics

a6b4d2f

Merge branch 'main' into es-14351/shard_allocation_metrics

5048da9

nicktindall approved these changes Mar 26, 2026

View reviewed changes

inespot merged commit fab219c into elastic:main Mar 26, 2026
36 checks passed

	private static boolean isUnassignedDueToNewInitialization(ProjectId projectId, ShardRouting routing, ClusterState state) {
	if (routing.active()) {
	return false;
	}
	// If the primary is inactive for unexceptional events in the cluster lifecycle, both the primary and the
	// replica are considered new initializations.
	ShardRouting primary = routing.primary()
	? routing
	: state.routingTable(projectId).shardRoutingTable(routing.shardId()).primaryShard();
	return primary.active() == false && getInactivePrimaryHealth(primary) == ClusterHealthStatus.YELLOW;
	}

	static Attributes fromMap(String metricName, Map<String, Object> attributes) {
	if (attributes == null \|\| attributes.isEmpty()) {
	return Attributes.empty();
	}

	MetricValidator.assertValidAttributeNames(metricName, attributes);

	var builder = Attributes.builder();
	attributes.forEach((k, v) -> {
	if (v instanceof String value) {
	builder.put(k, value);
	} else if (v instanceof Long value) {
	builder.put(k, value);
	} else if (v instanceof Integer value) {
	builder.put(k, value);
	} else if (v instanceof Byte value) {
	builder.put(k, value);
	} else if (v instanceof Short value) {
	builder.put(k, value);
	} else if (v instanceof Double value) {
	builder.put(k, value);
	} else if (v instanceof Float value) {
	builder.put(k, value);
	} else if (v instanceof Boolean value) {
	builder.put(k, value);
	} else {
	throw new IllegalArgumentException("attributes do not support value type of [" + v.getClass().getCanonicalName() + "]");
	}
	});
	return builder.build();
	}

	private static void recordPhaseLatency(
	LongHistogram histogramMetric,
	long tookInNanos,
	ShardSearchRequest request,
	Long timeRangeFilterFromMillis
	) {
	Map<String, Object> attributes = SearchRequestAttributesExtractor.extractAttributes(
	request,
	timeRangeFilterFromMillis,
	request.nowInMillis()
	);
	histogramMetric.record(TimeUnit.NANOSECONDS.toMillis(tookInNanos), attributes);
	}

Conversation

inespot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inespot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 19, 2026

Uh oh!

inespot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inespot commented Mar 19, 2026

Uh oh!

DaveCTurner commented Mar 19, 2026

Uh oh!

inespot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inespot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Mar 20, 2026

Uh oh!

DiannaHohensee commented Mar 20, 2026

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

inespot commented Mar 25, 2026

Uh oh!

nicktindall commented Mar 25, 2026

Uh oh!

DaveCTurner commented Mar 25, 2026

Uh oh!

inespot commented Mar 25, 2026

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

inespot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nicktindall Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nicktindall Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

inespot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nicktindall Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

inespot commented Mar 18, 2026 •

edited

Loading

inespot commented Mar 18, 2026 •

edited

Loading

inespot commented Mar 19, 2026 •

edited

Loading

inespot commented Mar 19, 2026 •

edited

Loading

inespot commented Mar 19, 2026 •

edited

Loading