Skip to content

Add metrics tracking shard time from unassigned to initialized/started#144521

Merged
inespot merged 13 commits intoelastic:mainfrom
inespot:es-14351/shard_allocation_metrics
Mar 26, 2026
Merged

Add metrics tracking shard time from unassigned to initialized/started#144521
inespot merged 13 commits intoelastic:mainfrom
inespot:es-14351/shard_allocation_metrics

Conversation

@inespot
Copy link
Contributor

@inespot inespot commented Mar 18, 2026

Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED.

Relates to ES-14351.

There are two parts to this.

  1. Evaluate the baseline for shard assignment latency. Add metrics to track how long it typically takes for primary/replica shards to transition out of the unassigned state and eventually become STARTED, so we can define what a "normal wait" looks like. (tackled by this PR)
  2. Relax the yellow health check for unassigned replica shards. Keep cluster health green for a configurable duration (X) before moving to the noisy yellow state. There is already precedent for this in the codebase for the red state:
    private static boolean isUnassignedDueToNewInitialization(ProjectId projectId, ShardRouting routing, ClusterState state) {
    if (routing.active()) {
    return false;
    }
    // If the primary is inactive for unexceptional events in the cluster lifecycle, both the primary and the
    // replica are considered new initializations.
    ShardRouting primary = routing.primary()
    ? routing
    : state.routingTable(projectId).shardRoutingTable(routing.shardId()).primaryShard();
    return primary.active() == false && getInactivePrimaryHealth(primary) == ClusterHealthStatus.YELLOW;
    }

    (this second piece will be tackled in a follow-up PR).

…d state

Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED.

Relates to ES-14351
@inespot
Copy link
Contributor Author

inespot commented Mar 18, 2026

Design alternatives considered:

  • Record metrics inside ShardsAvailabilityHealthIndicatorService. But then metric emission would be tied to the health pipeline's calculate cadence rather than to actual shard transitions, so we would end up with less accurate data.
  • Instead of the observer pattern, compare the routing table before and after each allocation round directly in AllocationService methods. This could allow more flexible filtering of which transitions to record, but requires keeping the call sites up to date as the code evolves, and makes the logic more complex.
  • Use the existing AllocationBalancingRoundMetrics or DesiredBalanceMetrics. The former felt like the wrong abstraction (it tracks balancing round metrics, not the current state of the cluster). The latter feels more related but tracks a snapshot of current allocation state, not event latency. It would also require more scaffolding to plug in transition times.

This approach (ShardChangesObserver) does change the class' "single logging" responsibility, but metrics and logging are closely related enough that co-locating them seems reasonable?

Open to other opinions/happy to refactor though!

@inespot inespot marked this pull request as ready for review March 19, 2026 04:27
@inespot inespot added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >non-issue labels Mar 19, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 19, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@inespot
Copy link
Contributor Author

inespot commented Mar 19, 2026

To note

  • ⚠️ This will also need a serverless side PR to fix the AllocationService constructor in StatelessSnapshotResiliencyTests (if we move forward with this design, see above for considered alternatives)

  • In case of a node restart or a node leaving the cluster (Reason.NODE_LEFT and Reason.NODE_RESTARTING), the allocation of the unassigned shard may be delayed (until the configured index.unassigned.node_left.delayed_timeout timeout) because it is more efficient to wait and see if the node returns than trigger recovery.
    We add the reason for the unassignment as an attribute to the metric to be able to distinguish those cases.

  • The metrics will recorded even if the cluster state publication fails. That should be acceptable given the "stats/baseline" purpose of those metrics.

  • Only the master node emits these metrics, since the observer fires during cluster state task execution. No cleanup should needed on master demotion since unlike gauge, histograms are point-in-time recordings with no accumulated state to reset (but let me know if I missed sth here)

@inespot
Copy link
Contributor Author

inespot commented Mar 19, 2026

@DiannaHohensee, added you as an optional reviewer since this change touches the Allocation classes!

@DaveCTurner
Copy link
Contributor

I'm supportive of more metrics in this area, but in terms of ES-14351 I was thinking we could use the index.creation_date setting as the start time. We care about shards which become unassigned later on in their lifecycles, it's particularly newly-created shards that need to be excluded from these alerts.

Thinking about it more tho there are some other reasons for a shard to be unassigned without deserving an alert, and I think we have everything we need in the routing table in UnassignedInfo#unassignedTimeMillis and UnassignedInfo.Reason (i.e. alert on ALLOCATION_FAILED, NODE_LEFT, REINITIALIZED, REALLOCATED_REPLICA, PRIMARY_FAILED or NODE_RESTARTING but allow a grace period for the others).

@inespot
Copy link
Contributor Author

inespot commented Mar 19, 2026

I'm supportive of more metrics in this area, but in terms of ES-14351 I was thinking we could use the index.creation_date setting as the start time. We care about shards which become unassigned later on in their lifecycles, it's particularly newly-created shards that need to be excluded from these alerts.

I agree those metrics go a little bit beyond the ES-14351 initial goal (avoid noise while “new index shards are still being allocated"). However, adding baseline visibility into how long allocation typically takes (UNASSIGNED -> INITIALIZED -> STARTED), and how that varies by UnassignedInfo.Reason is generally useful for both ES-14351 and if we want to start excluding more cases than "newly-created shards" in the future.

Thinking about it more tho there are some other reasons for a shard to be unassigned without deserving an alert, and I think we have everything we need in the routing table in UnassignedInfo#unassignedTimeMillis and UnassignedInfo.Reason (i.e. alert on ALLOCATION_FAILED, NODE_LEFT, REINITIALIZED, REALLOCATED_REPLICA, PRIMARY_FAILED or NODE_RESTARTING but allow a grace period for the others).

Yep, using index.creation_date for the “new index” case would require metadata lookup and doesn’t generalize as well if we move beyond that one scenario. Separately, the existing health computation already handles the “index exists but shard routings aren’t created yet” case:

if (shards.isEmpty()) { // might be since none has been created yet (two phase index creation)
computeStatus = ClusterHealthStatus.RED;
}
.
So it seems reasonable to treat that as a distinct signal from “shard became unassigned at time T and is still unassigned after some grace period.”? Open to other opinions though!

@inespot
Copy link
Contributor Author

inespot commented Mar 19, 2026

I also like the concept of triggering the grace/alert logic based on UnassignedInfo.Reason (and unassignedTimeMillis) because I think it will give us more flexibility over time. It's “pluggable”, we can toggle which reasons get a grace period without introducing extra plumbing. Based on the existing code in ShardsAvailabilityHealthIndicatorService, it seems like the logic would not be too complex either? But I’ll start drafting a PR to make this more concrete

@DaveCTurner
Copy link
Contributor

handles the “index exists but shard routings aren’t created yet” case:

This logic dates back to 2016. I didn't know that was ever a case, but it's not any more at least - there is no "two-phase index creation", we populate the routing table with UNASSIGNED entries in the same cluster state update that creates the index. I think we should be able to assert shards.size() == numberOfShards here?

@DiannaHohensee
Copy link
Contributor

The approach plugging metrics into ShardChangesObserver looks good to me 👍 Thanks for the ping!

CC'ing @nicktindall because he's been working on ES-13621, which similarly goes through Observers.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this all seems sensible to me 👍

@inespot
Copy link
Contributor Author

inespot commented Mar 25, 2026

Finally defeated the flaky CI tests! 🥳
@DaveCTurner @DiannaHohensee @nicktindall, from the comments above, it sounds like these metrics would be useful and that the current design looks sound.
Are we OK to merge this as-is, or would you rather wait for the linked ES-14351 PRs / ES-13621 to land first?

@nicktindall
Copy link
Contributor

I can't speak for the others, but it's a yes from me. Dianna is on PTO today, and she's already given it a thumbs up so probably OK to merge with an approval from David?

@DaveCTurner
Copy link
Contributor

I don't have much of an opinion here and also don't have a lot of review capacity at the moment so I'd rather delegate this to the others. I'm still broadly supportive of the idea and it seems orthogonal to ES-14351, no need to do things in either order.

@inespot
Copy link
Contributor Author

inespot commented Mar 25, 2026

That makes sense, thanks both! Given that, @nicktindall would you be ok owning the remaining steps of review and potential PR approval (given the ES-13621 / observer overlap), or would you rather someone else from Distributed own it?

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM, just a couple of questions/comments


private static Map<UnassignedInfo.Reason, Map<String, Object>> buildAttributesByReason(boolean primary) {
return Arrays.stream(UnassignedInfo.Reason.values())
.collect(Collectors.toUnmodifiableMap(r -> r, r -> Map.of("es_shard_primary", primary, "es_shard_reason", r.name())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like in the past I've had issues with attribute values being something other than a string. I think the TestTelemetryProvider copes with it but the real one doesn't. Please just verify that because it may have changed or I may have misremembered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.name() is a String so that one should work.

For the es_shard_primary boolean, I traced down the code, it seems that this should be handled properly:

static Attributes fromMap(String metricName, Map<String, Object> attributes) {
if (attributes == null || attributes.isEmpty()) {
return Attributes.empty();
}
MetricValidator.assertValidAttributeNames(metricName, attributes);
var builder = Attributes.builder();
attributes.forEach((k, v) -> {
if (v instanceof String value) {
builder.put(k, value);
} else if (v instanceof Long value) {
builder.put(k, value);
} else if (v instanceof Integer value) {
builder.put(k, value);
} else if (v instanceof Byte value) {
builder.put(k, value);
} else if (v instanceof Short value) {
builder.put(k, value);
} else if (v instanceof Double value) {
builder.put(k, value);
} else if (v instanceof Float value) {
builder.put(k, value);
} else if (v instanceof Boolean value) {
builder.put(k, value);
} else {
throw new IllegalArgumentException("attributes do not support value type of [" + v.getClass().getCanonicalName() + "]");
}
});
return builder.build();
}

There are also other example paths in Elasticsearch using boolean attributes, e.g the system_thread setting in:

private static void recordPhaseLatency(
LongHistogram histogramMetric,
long tookInNanos,
ShardSearchRequest request,
Long timeRangeFilterFromMillis
) {
Map<String, Object> attributes = SearchRequestAttributesExtractor.extractAttributes(
request,
timeRangeFilterFromMillis,
request.nowInMillis()
);
histogramMetric.record(TimeUnit.NANOSECONDS.toMillis(tookInNanos), attributes);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for double-checking!

UnassignedInfo info = unassignedShard.unassignedInfo();
if (info != null) {
long durationMillis = currentTimeMillisSupplier.getAsLong() - info.unassignedTimeMillis();
unassignedToInitializingDuration.record(Math.max(0, durationMillis), attributes(info, initializedShard));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?

Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?

Yep, that's the only reason! I went through the code and unassignedTimeMillis should always be a real timestamp.

Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?

Clock skew large enough to produce a negative value would be unusual in practice so not too worried about this. Considering that, I'd rather floor to 0 than skip the recording entirely. Otherwise we could silently be dropping actual data points would make the histogram's total count slightly undercount actual transitions, which seems worse than a rare artificial 0ms entry. Happy to reconsider if you feel strongly though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, agree that we should just round to zero if it's due to clock skew

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@inespot inespot merged commit fab219c into elastic:main Mar 26, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants