Skip to content

Conversation

@mhl-b
Copy link
Contributor

@mhl-b mhl-b commented Jul 25, 2025

This PR changes Thread Pool utilization reporting intervals from dynamic (based on caller's frequency) into static (time frame based).

Originally there was a single consumer of utilization metric that runs on specific interval, so we would calculate utilization on the fly comparing previous invocation time and current. And it worked, kind of... With growing demand on utilization metrics, from shards balancing and allocation, we have new consumers, so this poll mechanism does not scale well. Thread pool needs to create a separate tracker for each caller.

This PR introduces total execution time measurements for thread-pool per fixed time frame. By default thread-pool will measure time in a 1 second frame and keep last 30 frames in memory. When we calculate utilization we look at past 30 frames. Utilization would be at most 1 second stale.

A new FramedTimeTracker incapsulates frame tracking logic, comes with own set of tests. Also added JMH benchmark.

Few options were considered for concurrent access to FramedTimeTracker - synchronized methods, read-write lock and non-locking.
Synchronized methods uses private long fields for tracking. Read-write lock uses write-lock when we update frame and read-lock for updating atomic fields during frame. Those are in commit history. Non-locking algorithm uses frame-windows, atomic flag for update and Thread.onSpinWait when update is happening.

Currently non-locking algorithm would be preferable, it takes a bit more memory than locking to keep track of extra frames, longAdders as counters, and few atomics, but it should be a good tradeoff for low performance overhead. Which is almost identical with baseline that just runs busy CPU cycles.

Baseline is 10000 busy CPU cycles, which is about 16 microseconds. Our write-thread-pool does have small, sub-millisecond, tasks.

Latest result from my machine

./gradlew -p benchmarks run --args 'ThreadPoolUtilizationBenchmark'

Benchmark                                           (callIntervalTicks)  (frameDurationMs)  (reportingDurationMs)    Mode       Cnt       Score   Error  Units
ThreadPoolUtilizationBenchmark.StartAndEnd                        10000               1000                  10000  sample  89636316      20.173 ± 0.012  us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.00                  10000               1000                  10000  sample                16.576          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.50                  10000               1000                  10000  sample                17.152          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.90                  10000               1000                  10000  sample                23.264          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.95                  10000               1000                  10000  sample                23.360          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.99                  10000               1000                  10000  sample                23.776          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.999                 10000               1000                  10000  sample                75.008          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p0.9999                10000               1000                  10000  sample              1327.104          us/op
ThreadPoolUtilizationBenchmark.StartAndEnd:p1.00                  10000               1000                  10000  sample             30474.240          us/op
ThreadPoolUtilizationBenchmark.baseline                           10000               1000                  10000  sample  87279514      20.947 ± 0.032  us/op
ThreadPoolUtilizationBenchmark.baseline:p0.00                     10000               1000                  10000  sample                16.032          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.50                     10000               1000                  10000  sample                16.864          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.90                     10000               1000                  10000  sample                23.040          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.95                     10000               1000                  10000  sample                23.072          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.99                     10000               1000                  10000  sample                23.200          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.999                    10000               1000                  10000  sample               256.000          us/op
ThreadPoolUtilizationBenchmark.baseline:p0.9999                   10000               1000                  10000  sample              3907.584          us/op
ThreadPoolUtilizationBenchmark.baseline:p1.00                     10000               1000                  10000  sample            140247.040          us/op

@mhl-b mhl-b force-pushed the framed-thread-pool-utilization branch from 23fe464 to 8345da4 Compare July 26, 2025 05:35
@mhl-b mhl-b changed the title framed-time-tracker Time framed thread-pool utilization Jul 26, 2025
@mhl-b mhl-b marked this pull request as ready for review July 26, 2025 05:59
@mhl-b mhl-b requested a review from a team as a code owner July 26, 2025 05:59
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 26, 2025
@mhl-b mhl-b added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label labels Jul 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @mhl-b, I've created a changelog YAML for you.

Comment on lines 581 to 586
* @param trackExecutionTime Whether to track execution stats
* @param trackUtilization enables thread-pool utilization metrics
* @param utilizationInterval when utilization is enabled, specifies interval for measurement
* @param trackOngoingTasks Whether to track ongoing task execution time, not just finished tasks
* @param trackMaxQueueLatency Whether to track max queue latency.
* @param executionTimeEwmaAlpha The alpha seed for execution time EWMA (ExponentiallyWeightedMovingAverage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - mixed param description capitalization and ending periods. (What is the team preference here? It looks to be inconsistent across the codebase.)

Comment on lines 273 to 275
assert intervalNano > 0;
this.interval = intervalNano;
this.timeNow = timeNow;
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - is it worth using a this(...) to call the other constructor and eliminate the duplicate code? (I know it is only a few lines, just wanted to mention it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truly nit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way around might be nice

    FramedTimeTracker(long intervalNano) {
         this(System::nanoTime);
     }

🤷‍♀️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truly nit

I find this worth addressing, we would not want these constructors to get out of sync in the future.

Your comment here makes it hard to read if you are rejecting the proposal or accepting it, hence my comment here.

Another approach would be to remove this helper constructor, seems to be used only once and it is not unreasonable nor unusual to let client code pass in the time-tracking mechanism they want to use.

Copy link
Contributor Author

@mhl-b mhl-b Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No offence to Jeremy about "truly nit" :) I did talked with him about this PR.
Of course I will change it, if it catches eye of a reader, I don't have strong preference. Time provider can be in tracking config with default System::nanoTime

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this adds staleness, overhead and complexity. I am inclined to forego this for now and focus on the real balancing improvement work. I added a separate suggestion for how we could handle this instead, though I think we should simply live with what we have for now to ensure progress.

*/
public synchronized long previousFrameTime() {
updateFrame0(timeNow.get());
return previousTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is up to 30s stale. I think for this to work well we'd have to track it every second or so and then sum up all the prior frames. This adds even more overhead.

I think we should instead accept the original time-tracking's inaccuracy. I think the approach currently in code is good enough, but we can consider moving out the reset part into the consuming code. The node-usages stats could then itself track the prior value, either on the node or on the master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed staleness here 36374fc.

I measure time in a 1 second frame and use 30 seconds window. So it's at most 1 second stale, yet accurate for the last 30 seconds. Updated benchmark, overhead is still negligible.

@mhl-b
Copy link
Contributor Author

mhl-b commented Jul 30, 2025

@henningandersen there are few things that motivated this change rather than keeping existing logic

  1. Utilization tracking requires own infrastructure to get into ClusterInfo, we cannot embed it into NodeStats, but essentially it's a node stat. Own infrastructure means - transport message with sender/receiver, wiring into ClusterInfo. While node stats are already available and wired into ClusterInfo.
  2. There are errors in reported utilization numbers, we can see values with over 100%. This impacts Balancing Simulator, since we model movement by changing utilization.
  3. I cant imagine we would live with current implementation of utilization for long, it is pretty generic metric for the thread-pool. Once we start measure utilization purely by thread-pool not by callers we have to remove all the code we would write for the balancing work. Sounds double work to me. Maybe we can deliver first version of load balancing few days faster, but then create much more work for future.

Also, time interval is configurable and can be set to lower values. It's easy to retain multiple frames for reporting purposes, for example 30 frames by 1 second interval.

What I'm saying is that it would be really nice to put it into NodeStats, it will reduce scope of changes for the write-load balancing work. But to put it into node stats it should be independent from the callers. Also latest version has non-locking algorithm that has negligible overhead.

edit:

It's easy to retain multiple frames for reporting purposes, for example 30 frames by 1 second interval.

it's done 36374fc

throw new IllegalStateException("No operation defined for [" + utilizationTrackingPurpose + "]");
}
public double utilization() {
return (double) framedTimeTracker.totalTime() / (double) getMaximumPoolSize() / (double) framedTimeTracker.reportingInterval();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of logic feels like it should be in the tracker itself perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it. I dont think frame-tracker has to know anything about threads and pool size.

false,
false,
DEFAULT_EXECUTION_TIME_EWMA_ALPHA_FOR_TEST
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is peripheral but I think we should move to using the builder for DEFAULT and DO_NOT_TRACK. All those flags are meaningless without mouseovers. More-so now that we've added another one.

final var now = nowTime / frameDuration;
final var frameWindow = getWindow(now);
frameWindow.frames[0].ongoingTasks.increment();
frameWindow.frames[0].startEndDiff.add((now + 1) * frameDuration - nowTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit can this be frameDuration - (nowTime % frameDuration) ? not sure if that's more efficient but it's fewer operations.

newFrames[i] = frame;
}
return new FrameWindow(newFrames);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like more object creation going on than what I would have hoped for, I know its probably necessary for the safety, but if I'm reading it right I think it means we create a new

  • FrameWindow
  • Frame[]
  • Frame which each include 2 x LongAdder

every second for a 30s/1s tracker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, if I reuse window there is a risk of race condition. For reuse I would need to reset frames, it might happen that start/end gets window and try to update frame, and window update happens at the same moment.

So I would rather think of collapsing old frames into one big past frame, since it's unlikely to change. Only current and current-1 usually at race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants