Add metrics tracking shard time from unassigned to initialized/started#144521
Add metrics tracking shard time from unassigned to initialized/started#144521inespot merged 13 commits intoelastic:mainfrom
Conversation
…d state Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from UNASSIGNED to INITIALIZED to STARTED. Relates to ES-14351
|
Design alternatives considered:
This approach ( Open to other opinions/happy to refactor though! |
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
To note
|
|
@DiannaHohensee, added you as an optional reviewer since this change touches the Allocation classes! |
|
I'm supportive of more metrics in this area, but in terms of ES-14351 I was thinking we could use the Thinking about it more tho there are some other reasons for a shard to be unassigned without deserving an alert, and I think we have everything we need in the routing table in |
I agree those metrics go a little bit beyond the ES-14351 initial goal (avoid noise while “new index shards are still being allocated"). However, adding baseline visibility into how long allocation typically takes (
Yep, using So it seems reasonable to treat that as a distinct signal from “shard became unassigned at time T and is still unassigned after some grace period.”? Open to other opinions though! |
|
I also like the concept of triggering the grace/alert logic based on |
This logic dates back to 2016. I didn't know that was ever a case, but it's not any more at least - there is no "two-phase index creation", we populate the routing table with |
|
The approach plugging metrics into ShardChangesObserver looks good to me 👍 Thanks for the ping! CC'ing @nicktindall because he's been working on ES-13621, which similarly goes through Observers. |
nicktindall
left a comment
There was a problem hiding this comment.
Yep, this all seems sensible to me 👍
|
Finally defeated the flaky CI tests! 🥳 |
|
I can't speak for the others, but it's a yes from me. Dianna is on PTO today, and she's already given it a thumbs up so probably OK to merge with an approval from David? |
|
I don't have much of an opinion here and also don't have a lot of review capacity at the moment so I'd rather delegate this to the others. I'm still broadly supportive of the idea and it seems orthogonal to ES-14351, no need to do things in either order. |
|
That makes sense, thanks both! Given that, @nicktindall would you be ok owning the remaining steps of review and potential PR approval (given the ES-13621 / observer overlap), or would you rather someone else from Distributed own it? |
nicktindall
left a comment
There was a problem hiding this comment.
Still LGTM, just a couple of questions/comments
|
|
||
| private static Map<UnassignedInfo.Reason, Map<String, Object>> buildAttributesByReason(boolean primary) { | ||
| return Arrays.stream(UnassignedInfo.Reason.values()) | ||
| .collect(Collectors.toUnmodifiableMap(r -> r, r -> Map.of("es_shard_primary", primary, "es_shard_reason", r.name()))); |
There was a problem hiding this comment.
I feel like in the past I've had issues with attribute values being something other than a string. I think the TestTelemetryProvider copes with it but the real one doesn't. Please just verify that because it may have changed or I may have misremembered.
There was a problem hiding this comment.
r.name() is a String so that one should work.
For the es_shard_primary boolean, I traced down the code, it seems that this should be handled properly:
There are also other example paths in Elasticsearch using boolean attributes, e.g the system_thread setting in:
There was a problem hiding this comment.
Great, thanks for double-checking!
| UnassignedInfo info = unassignedShard.unassignedInfo(); | ||
| if (info != null) { | ||
| long durationMillis = currentTimeMillisSupplier.getAsLong() - info.unassignedTimeMillis(); | ||
| unassignedToInitializingDuration.record(Math.max(0, durationMillis), attributes(info, initializedShard)); |
There was a problem hiding this comment.
I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?
Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?
There was a problem hiding this comment.
I'm curious about the need to floor this value at 0, is this just to accommodate slight clock skew between the nodes?
Yep, that's the only reason! I went through the code and unassignedTimeMillis should always be a real timestamp.
Otherwise, if we are expecting to see missing unassigned times for some reason, is recording a 0 in that case going to dilute the histogram?
Clock skew large enough to produce a negative value would be unusual in practice so not too worried about this. Considering that, I'd rather floor to 0 than skip the recording entirely. Otherwise we could silently be dropping actual data points would make the histogram's total count slightly undercount actual transitions, which seems worse than a rare artificial 0ms entry. Happy to reconsider if you feel strongly though!
There was a problem hiding this comment.
Thanks for explaining, agree that we should just round to zero if it's due to clock skew
Extends ShardChangesObserver to emit two LongHistogram metrics that track how long a shard takes to go from
UNASSIGNEDtoINITIALIZEDtoSTARTED.Relates to ES-14351.
There are two parts to this.
STARTED, so we can define what a "normal wait" looks like. (tackled by this PR)elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java
Lines 470 to 480 in d7f7f28
(this second piece will be tackled in a follow-up PR).