Time framed thread-pool utilization #131898

mhl-b · 2025-07-25T03:28:23Z

This PR changes Thread Pool utilization reporting intervals from dynamic (based on caller's frequency) into static (time frame based).

Originally there was a single consumer of utilization metric that runs on specific interval, so we would calculate utilization on the fly comparing previous invocation time and current. And it worked, kind of... With growing demand on utilization metrics, from shards balancing and allocation, we have new consumers, so this poll mechanism does not scale well. Thread pool needs to create a separate tracker for each caller.

This PR introduces total execution time measurements for thread-pool per fixed time frame. By default thread-pool will measure time in a 1 second frame and keep last 30 frames in memory. When we calculate utilization we look at past 30 frames. Utilization would be at most 1 second stale.

A new FramedTimeTracker incapsulates frame tracking logic, comes with own set of tests. Also added JMH benchmark.

Few options were considered for concurrent access to FramedTimeTracker - synchronized methods, read-write lock and non-locking.
Synchronized methods uses private long fields for tracking. Read-write lock uses write-lock when we update frame and read-lock for updating atomic fields during frame. Those are in commit history. Non-locking algorithm uses frame-windows, atomic flag for update and Thread.onSpinWait when update is happening.

Currently non-locking algorithm would be preferable, it takes a bit more memory than locking to keep track of extra frames, longAdders as counters, and few atomics, but it should be a good tradeoff for low performance overhead. Which is almost identical with baseline that just runs busy CPU cycles.

Baseline is 10000 busy CPU cycles, which is about 16 microseconds. Our write-thread-pool does have small, sub-millisecond, tasks.

Latest result from my machine

./gradlew -p benchmarks run --args 'ThreadPoolUtilizationBenchmark'

Benchmark                                           (callIntervalTicks)  (frameDurationMs)  (reportingDurationMs)    Mode       Cnt       Score   Error  Units
ThreadPoolUtilizationBenchmark.StartAndEnd                        10000               1000                  10000  sample  89636316      20.173 ± 0.012  us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.00                  10000               1000                  10000  sample                16.576          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.50                  10000               1000                  10000  sample                17.152          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.90                  10000               1000                  10000  sample                23.264          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.95                  10000               1000                  10000  sample                23.360          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.99                  10000               1000                  10000  sample                23.776          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.999                 10000               1000                  10000  sample                75.008          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.9999                10000               1000                  10000  sample              1327.104          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p1.00                  10000               1000                  10000  sample             30474.240          us/op
ThreadPoolUtilizationBenchmark.baseline                           10000               1000                  10000  sample  87279514      20.947 ± 0.032  us/op
ThreadPoolUtilizationBenchmark.baseline:p0.00                     10000               1000                  10000  sample                16.032          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.50                     10000               1000                  10000  sample                16.864          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.90                     10000               1000                  10000  sample                23.040          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.95                     10000               1000                  10000  sample                23.072          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.99                     10000               1000                  10000  sample                23.200          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.999                    10000               1000                  10000  sample               256.000          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.9999                   10000               1000                  10000  sample              3907.584          us/op
ThreadPoolUtilizationBenchmark.baseline:p1.00                     10000               1000                  10000  sample            140247.040          us/op

…utilization

elasticsearchmachine · 2025-07-26T06:04:54Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-07-26T06:04:54Z

Hi @mhl-b, I've created a changelog YAML for you.

…sticsearch into framed-thread-pool-utilization

…utilization

.../java/org/elasticsearch/benchmark/common/util/concurrent/ThreadPoolUtilizationBenchmark.java

Framed thread pool utilization benchmark hacking

JeremyDahlgren · 2025-07-29T14:10:33Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+     * @param trackExecutionTime Whether to track execution stats
+     * @param trackUtilization enables thread-pool utilization metrics
+     * @param utilizationInterval when utilization is enabled, specifies interval for measurement
+     * @param trackOngoingTasks Whether to track ongoing task execution time, not just finished tasks
+     * @param trackMaxQueueLatency Whether to track max queue latency.
+     * @param executionTimeEwmaAlpha The alpha seed for execution time EWMA (ExponentiallyWeightedMovingAverage).


Nit - mixed param description capitalization and ending periods. (What is the team preference here? It looks to be inconsistent across the codebase.)

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

JeremyDahlgren · 2025-07-29T14:19:37Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+            assert intervalNano > 0;
+            this.interval = intervalNano;
+            this.timeNow = timeNow;


Nit - is it worth using a this(...) to call the other constructor and eliminate the duplicate code? (I know it is only a few lines, just wanted to mention it.)

The other way around might be nice

FramedTimeTracker(long intervalNano) { this(System::nanoTime); }

🤷‍♀️

truly nit

I find this worth addressing, we would not want these constructors to get out of sync in the future.

Your comment here makes it hard to read if you are rejecting the proposal or accepting it, hence my comment here.

Another approach would be to remove this helper constructor, seems to be used only once and it is not unreasonable nor unusual to let client code pass in the time-tracking mechanism they want to use.

No offence to Jeremy about "truly nit" :) I did talked with him about this PR.
Of course I will change it, if it catches eye of a reader, I don't have strong preference. Time provider can be in tracking config with default System::nanoTime

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

…utilization

…sticsearch into framed-thread-pool-utilization

henningandersen

I think this adds staleness, overhead and complexity. I am inclined to forego this for now and focus on the real balancing improvement work. I added a separate suggestion for how we could handle this instead, though I think we should simply live with what we have for now to ensure progress.

henningandersen · 2025-07-30T19:20:49Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+         */
+        public synchronized long previousFrameTime() {
+            updateFrame0(timeNow.get());
+            return previousTime;


I think this is up to 30s stale. I think for this to work well we'd have to track it every second or so and then sum up all the prior frames. This adds even more overhead.

I think we should instead accept the original time-tracking's inaccuracy. I think the approach currently in code is good enough, but we can consider moving out the reset part into the consuming code. The node-usages stats could then itself track the prior value, either on the node or on the master.

Addressed staleness here 36374fc.

I measure time in a 1 second frame and use 30 seconds window. So it's at most 1 second stale, yet accurate for the last 30 seconds. Updated benchmark, overhead is still negligible.

…utilization

mhl-b · 2025-07-30T22:10:00Z

@henningandersen there are few things that motivated this change rather than keeping existing logic

Utilization tracking requires own infrastructure to get into ClusterInfo, we cannot embed it into NodeStats, but essentially it's a node stat. Own infrastructure means - transport message with sender/receiver, wiring into ClusterInfo. While node stats are already available and wired into ClusterInfo.
There are errors in reported utilization numbers, we can see values with over 100%. This impacts Balancing Simulator, since we model movement by changing utilization.
I cant imagine we would live with current implementation of utilization for long, it is pretty generic metric for the thread-pool. Once we start measure utilization purely by thread-pool not by callers we have to remove all the code we would write for the balancing work. Sounds double work to me. Maybe we can deliver first version of load balancing few days faster, but then create much more work for future.

Also, time interval is configurable and can be set to lower values. It's easy to retain multiple frames for reporting purposes, for example 30 frames by 1 second interval.

What I'm saying is that it would be really nice to put it into NodeStats, it will reduce scope of changes for the write-load balancing work. But to put it into node stats it should be independent from the callers. Also latest version has non-locking algorithm that has negligible overhead.

edit:

It's easy to retain multiple frames for reporting purposes, for example 30 frames by 1 second interval.

it's done 36374fc

…utilization

nicktindall · 2025-07-31T05:36:36Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

-                throw new IllegalStateException("No operation defined for [" + utilizationTrackingPurpose + "]");
-        }
+    public double utilization() {
+        return (double) framedTimeTracker.totalTime() / (double) getMaximumPoolSize() / (double) framedTimeTracker.reportingInterval();


This bit of logic feels like it should be in the tracker itself perhaps?

I was thinking about it. I dont think frame-tracker has to know anything about threads and pool size.

nicktindall · 2025-07-31T06:28:02Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

            false,
            false,
            DEFAULT_EXECUTION_TIME_EWMA_ALPHA_FOR_TEST
        );


I know this is peripheral but I think we should move to using the builder for DEFAULT and DO_NOT_TRACK. All those flags are meaningless without mouseovers. More-so now that we've added another one.

nicktindall · 2025-07-31T07:37:46Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+            final var now = nowTime / frameDuration;
+            final var frameWindow = getWindow(now);
+            frameWindow.frames[0].ongoingTasks.increment();
+            frameWindow.frames[0].startEndDiff.add((now + 1) * frameDuration - nowTime);


nit can this be frameDuration - (nowTime % frameDuration) ? not sure if that's more efficient but it's fewer operations.

nicktindall · 2025-07-31T08:03:23Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+                    newFrames[i] = frame;
+                }
+                return new FrameWindow(newFrames);
+            }


Seems like more object creation going on than what I would have hoped for, I know its probably necessary for the safety, but if I'm reading it right I think it means we create a new

FrameWindow

Frame[]

Frame which each include 2 x LongAdder

every second for a 30s/1s tracker?

Correct, if I reuse window there is a risk of race condition. For reuse I would need to reset frames, it might happen that start/end gets window and try to update frame, and window update happens at the same moment.

So I would rather think of collapsing old frames into one big past frame, since it's unlikely to change. Only current and current-1 usually at race.

framed-time-tracker

18ec741

elasticsearchmachine added the v9.2.0 label Jul 25, 2025

mhl-b mentioned this pull request Jul 25, 2025

Move to a single shared utilization tracker #131796

Open

mhl-b added 2 commits July 25, 2025 12:48

jmh

80f792f

tests

8345da4

mhl-b force-pushed the framed-thread-pool-utilization branch from 23fe464 to 8345da4 Compare July 26, 2025 05:35

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

2421600

…utilization

mhl-b changed the title ~~framed-time-tracker~~ Time framed thread-pool utilization Jul 26, 2025

mhl-b requested review from DiannaHohensee, JeremyDahlgren, henningandersen, nicktindall and ywangd July 26, 2025 05:57

mhl-b marked this pull request as ready for review July 26, 2025 05:59

mhl-b requested a review from a team as a code owner July 26, 2025 05:59

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 26, 2025

Update docs/changelog/131898.yaml

cb531b9

github-actions bot deployed to docs-preview July 26, 2025 06:05 View deployment

mhl-b added 2 commits July 25, 2025 23:12

assertion fix

ab04cc6

Merge branch 'framed-thread-pool-utilization' of github.com:mhl-b/ela…

deea172

…sticsearch into framed-thread-pool-utilization

github-actions bot deployed to docs-preview July 26, 2025 06:13 View deployment

duration toNanos

454a871

github-actions bot deployed to docs-preview July 26, 2025 06:54 View deployment

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

5228abd

…utilization

[CI] Auto commit changes from spotless

8850789

github-actions bot deployed to docs-preview July 26, 2025 23:20 View deployment

nicktindall reviewed Jul 28, 2025

View reviewed changes

.../java/org/elasticsearch/benchmark/common/util/concurrent/ThreadPoolUtilizationBenchmark.java Show resolved Hide resolved

nicktindall and others added 3 commits July 28, 2025 18:34

Micro-ize the benchmark

3d66e86

Use average instead of sample

efa48b4

Merge pull request #2 from nicktindall/framed-thread-pool-utilization_bm

872d7cd

Framed thread pool utilization benchmark hacking

github-actions bot deployed to docs-preview July 29, 2025 00:23 View deployment

non-locking counting

33173fb

github-actions bot deployed to docs-preview July 29, 2025 07:03 View deployment

JeremyDahlgren reviewed Jul 29, 2025

View reviewed changes

mhl-b and others added 7 commits July 29, 2025 08:41

rwlock

f3c81db

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

c281e8d

…utilization

cleanup

5771c46

back to syncronized

12ef5e4

[CI] Auto commit changes from spotless

d006ed5

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

3caae03

…utilization

Merge branch 'framed-thread-pool-utilization' of github.com:mhl-b/ela…

18115c4

…sticsearch into framed-thread-pool-utilization

nicktindall mentioned this pull request Jul 30, 2025

Add utilization to the THREAD_POOL node stats #132156

Closed

henningandersen reviewed Jul 30, 2025

View reviewed changes

mhl-b added 3 commits July 30, 2025 14:45

non-locking frame windows

3816a50

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

fd0b91a

…utilization

nits

225280a

mhl-b and others added 4 commits July 30, 2025 20:06

window and frames

36374fc

[CI] Auto commit changes from spotless

3dd755e

Merge remote-tracking branch 'upstream/main' into framed-thread-pool-…

0fbbd81

…utilization

fix

c1c33ac

nicktindall reviewed Jul 31, 2025

View reviewed changes

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

Time framed thread-pool utilization #131898

Are you sure you want to change the base?

Time framed thread-pool utilization #131898

Uh oh!

Conversation

mhl-b commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 26, 2025

Uh oh!

elasticsearchmachine commented Jul 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JeremyDahlgren Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mhl-b commented Jul 25, 2025 •

edited

Loading

JeremyDahlgren Jul 29, 2025 •

edited

Loading

mhl-b Jul 30, 2025 •

edited

Loading

mhl-b commented Jul 30, 2025 •

edited

Loading