-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…k with max pool size > 0. Relates to elastic#124867, ES-10640
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Btw, here's another reproduction attempt I tried to use initially. This repeatedly fills the work pool with work and let's it complete at the same time using a private void testScalingWithEmptyCore(EsThreadPoolExecutor esExecutor) {
class Scheduler implements AutoCloseable {
final ExecutorService scheduler = Executors.newSingleThreadExecutor();
final long keepAliveNanos = esExecutor.getKeepAliveTime(TimeUnit.NANOSECONDS);
final int maximumPoolSize = esExecutor.getMaximumPoolSize();
final boolean isEsScalingQueue = esExecutor.getQueue() instanceof EsExecutors.ExecutorScalingQueue<?>;
final Semaphore success = new Semaphore(0);
final CyclicBarrier barrier = new CyclicBarrier(maximumPoolSize + 1);
volatile int remaining = resetRemaining();
private int resetRemaining() {
return between(10, 200);
}
final Runnable work = new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
fail(e);
}
@Override
protected void doRun() throws Exception {
barrier.await(); // wait for all work + testPool to be ready to proceed
}
};
final Runnable continuation = () -> {
if (remaining > 0) {
remaining--;
testPoolAsync();
} else {
remaining = resetRemaining(); // reset for next round
success.release();
}
};
final Runnable testPool = new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
fail(e);
}
@Override
protected void doRun() throws Exception {
for (int count = 0; count < maximumPoolSize;) {
esExecutor.execute(work);
if (isEsScalingQueue && removeQueuedWork()) {
Thread.yield(); // yield and try again
} else {
count++;
}
}
barrier.await(); // wait for all work to be running
if (keepAliveNanos > 0) {
var targetNanoTime = System.nanoTime() + keepAliveNanos + between(-1_000, 1_000);
while (System.nanoTime() < targetNanoTime) {
Thread.yield();
}
}
esExecutor.execute(continuation);
}
// remove work that is queued due to ExecutorScalingQueue so we can be sure all work is running
private boolean removeQueuedWork() {
boolean workWasQueued = false;
Runnable queuedWork;
while ((queuedWork = ThreadContext.unwrap(esExecutor.getQueue().poll())) != null) {
logger.trace(
"{} was queued [poolSize={}, maximumPoolSize={}, activeCount={}, remaining={}]",
queuedWork == work ? "WORK" : "OTHER", // could be EsThreadPoolExecutor.WORKER_PROBE
esExecutor.getPoolSize(),
maximumPoolSize,
esExecutor.getActiveCount(),
remaining
);
workWasQueued |= queuedWork == work;
}
return workWasQueued;
}
};
public void testPoolAsync() {
scheduler.execute(testPool);
}
@Override
public void close() {
success.release();
scheduler.shutdownNow();
}
}
try (var scheduler = new Scheduler()) {
for (int i = 0; i < 5000; i++) {
scheduler.testPoolAsync();
safeAcquire(scheduler.success);
}
} finally {
ThreadPool.terminate(esExecutor, 1, TimeUnit.SECONDS);
}
} |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with this being hard-to-reproduce - in practice we run thousands of iterations of these things every day, we will notice a bug eventually.
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
|
Thanks for the feedback @DaveCTurner ... was a bit distracted this week with on week, but will wrap this up mid next week 👍 |
|
@DaveCTurner I've addressed your feedback |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (couple of nits)
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
Outdated
Show resolved
Hide resolved
…EsExecutorsTests.java Co-authored-by: David Turner <[email protected]>
…EsExecutorsTests.java Co-authored-by: David Turner <[email protected]>
Reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1.
However, this doesn't reproduce well and requires lots of iterations (tests.iters) to catch the bug for max pool size > 1 (and hard ever reproduces if using keep alive). Open for suggestions / ideas, but leaning towards considering this to be sufficient enough.
Note, a seemingly obvious approach would be to block (max - 1) threads in the pool. This will cause work to starve in a very similar way to the case core=0/max=1. However, there’s a significant difference, despite the pool having capacity for a spare worker, the work is rather starved by blocked workers and not by the fact that no worker is running at all. As soon as any of the workers is unblocked, the pool will continue processing the task queue. The fact that the pool is queueing despite having spare capacity isn’t optimal either, but I’m considering this to be a separate issue.
Relates to #124867, ES-10640