Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Aug 11, 2025

The TransportNodeUsageStatsForThreadPoolsAction now
takes the max latency of any task currently queued
in the write thread pool queue AND the previously
collected max queue latency of any task dequeued
since the last call. This covers the possibility
that queue times can rise greatly before being
reflected in execution: imagine all the write
threads are stalled or have long running tasks.
This action feeds a max queue latency stat to
the ClusterInfo. Follow up from ES-12233.

Adds additional IT testing to exercise both
forms of queue latency, a followup for ES-12316.


Completing the follow up testing Henning requested previously.

The TransportNodeUsageStatsForThreadPoolsAction now
takes the max latency of any task currently queued
in the write thread pool queue AND the previously
collected max queue latency of any task dequeued
since the last call. This covers the possibility
that queue times can rise greatly before being
reflected in execution: imagine all the write
threads are stalled.

Adds additional IT testing to exercise both
forms of queue latency, a followup for ES-12316.
@DiannaHohensee DiannaHohensee self-assigned this Aug 11, 2025
@DiannaHohensee DiannaHohensee requested a review from a team as a code owner August 11, 2025 17:23
@DiannaHohensee DiannaHohensee added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Aug 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DiannaHohensee, I've created a changelog YAML for you.

* that is returned, which waits for total-write-threads + 1 callers. The caller can release the tasks by calling
* {@code barrier.await()} or interrupt them with {@code barrier.reset()}.
*/
public CyclicBarrier blockDataNodeIndexing(String dataNodeName) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stole this from some serverless testing.

);
}

public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling this from another test file, so it can be reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. ee2159a

@DiannaHohensee DiannaHohensee changed the title Add second max queue latency stat to ClusterInfo Send max of two types of max queue latency to ClusterInfo Aug 11, 2025
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a number of comments, otherwise looks good.

I wonder if we really need this in a first version, but now it is here, it is fine to get in.

);
}

public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I've addressed the feedback.

I also realized that the peek() method was returning Nanos, not Millis, so I fixed that (a23c716).

);
}

public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. ee2159a

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DiannaHohensee DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 14, 2025
@elasticsearchmachine elasticsearchmachine merged commit 97a6dc8 into elastic:main Aug 14, 2025
33 checks passed
@DiannaHohensee DiannaHohensee deleted the 2025/08/11/max-of-two-latencies branch August 14, 2025 16:27
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Aug 15, 2025
…2675)

The TransportNodeUsageStatsForThreadPoolsAction now takes the max
latency of any task currently queued in the write thread pool queue AND
the previously collected max queue latency of any task dequeued since
the last call. This covers the possibility that queue times can rise
greatly before being reflected in execution: imagine all the write
threads are stalled or have long running tasks. This action feeds a max
queue latency stat to the ClusterInfo. Follow up from ES-12233.

Adds additional IT testing to exercise both forms of queue latency, a
followup for ES-12316.

-------------

Completing the follow up testing [Henning requested
previously](https://github.com/elastic/elasticsearch/pull/131480/files#r2243610807).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants