Skip to content

Conversation

@nicktindall
Copy link
Contributor

Initial work after discussion around thread-pool utilization tracking.

@nicktindall nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 24, 2025
public final class TaskExecutionTimeTrackingEsThreadPoolExecutor extends EsThreadPoolExecutor {
public static final int QUEUE_LATENCY_HISTOGRAM_BUCKETS = 18;
private static final int[] LATENCY_PERCENTILES_TO_REPORT = { 50, 90, 99 };
private static final long UTILISATION_REFRESH_INTERVAL_NANOS = TimeValue.timeValueSeconds(45).nanos();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a setting perhaps so we can experiment?

* Get the most recent utilization value calculated
*/
public double getUtilization() {
return lastUtilization;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should also call recalculateUtilizationIfDue in here in case there is zero activity in the pool? probably not an issue for the write pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want that to avoid it being infinitely stale, but then you get into a task tracking issue?

Also, it seems like the value can now be 30s old instead of current?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a recalculate here, in case we end up using this. You're right, because we are recalculating independently of the polling it's possible the utilisation is up to TaskTrackingConfig#utilizationRefreshInterval old. Given that it's an average over the same interval, and our current goal of acting only on persistent hot-spots I think that's probably OK, but hopefully we can get a better utilisation measure.

private boolean trackOngoingTasks = false;
private boolean trackMaxQueueLatency = false;
private double ewmaAlpha = DEFAULT_EXECUTION_TIME_EWMA_ALPHA_FOR_TEST;
private TimeValue utilizationRefreshInterval = TimeValue.timeValueSeconds(30);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that 30s might be too short given we saw > 1.0 utilization in the APM metrics which were calculated every 60s. Perhaps we should use 60 or split the difference and do 45?

@nicktindall nicktindall marked this pull request as ready for review July 24, 2025 07:29
@nicktindall nicktindall requested a review from a team as a code owner July 24, 2025 07:29
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 24, 2025
@mhl-b
Copy link
Contributor

mhl-b commented Jul 24, 2025

I though of a different approach, when we talked outside. I'll try to summarize. Happy to send PR if you like it.

We can trade liveliness to accuracy when measuring thread-pool utilization. That means reporting
utilization from past interval (30sec/1min) is enough, since there no "real-time" actions based on this
metric.

Couple of terms I use below:
Interval - time duration for measuring utilization
Frame - a sequence number of the interval
where Frame*Interval is a frame start time

Approach is simple - pollUtilization returns execution time of previous frame, since we know exactly
all tasks that are finished and were running in previous frame. When task is finished, current and
previous frame stats are updated.

We need to consider following cases:

  • there are no tasks in frame
  • task started and finished in same frame
  • task started and still running
  • task started before current frame and finished in current
  • task started before current and still running

pseudocode:

Class variables:
currentFrame
currentFrameExecutionTime // for finished tasks
previousFrame
previousFrameExecutionTime // for finished tasks
currentTasks

afterExecute(task):
  endFrame = task.endTime / interval
  maybeResetFrame(endFrame)
  startFrame = task.startTime / interval
  if (startFrame == currentFrame):
    currentFrameExecutionTime += task.endTime - task.startTime # task started and finished in same frame
  else:
    currentFrameExecutionTime += task.endTime - currentFrame * interval
    if(startFrame == previousFrame):
      previousFrameExecutionTime += currentFrame * interval - task.startTime # task started in previous frame
    else:
      previousFrameExecutionTime += interval # task started before previous frame


# first time seeing frame or havent updated frame for a long time
maybeResetFrame(nowFrame):
  if (nowFrame == currentFrame):
    pass
  else if (nowFrame - currentFrame == 1):
    previousFrameExecutionTime = currentFrameExecutionTime
    currentFrameExecutionTime = 0
    previousFrame = currentFrame
    currentFrame = nowFrame
  else:
    previousFrameExecutionTime = 0
    currentFrameExecutionTime = 0
    currentFrame = nowFrame
    previousFrame = nowFrame -1

# returns previous frame tasks, completed and still running
pollUtilization():
  nowFrame = timeNow / interval
  maybeResetFrame(nowFrame)
  totalTime = previousFrameExecutionTime
  for (task in currentTasks):
      startFrame = task.startTime / interval
      if (startFrame == previousFrame):
        totalTime += currentFrame * interval - task.startTime # task started in previous frame, still running
      else if (startFrame < previousFrame):
        totalTime += interval # task started before previous frame, still running
  return totalTime / threadPoolSize * interval # can be cached by frame key

afterExecute is a lightweight, just updating few numbers.
pollUtilization a bit heavier because currentTasks need concurrent access, probably a copy. There is still racy condition when task is finished while we looping though currentTasks, need to work on that.

@nicktindall
Copy link
Contributor Author

afterExecute is a lightweight, just updating few numbers. pollUtilization a bit heavier because currentTasks need concurrent access, probably a copy. There is still racy condition when task is finished while we looping though currentTasks, need to work on that.

I like it but I think it's better to avoid the dependency on current tasks. Not all thread pools track running tasks and to me it seems quite expensive to do (i.e you wouldn't want to turn it on across the board). I'd be inclined to keep it simple and revisit if the accuracy is a problem?

@mhl-b
Copy link
Contributor

mhl-b commented Jul 25, 2025

as discussed in slack I will work on my proposal addressing drawbacks of tracking ongoing tasks
#131898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants