Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045

mosche · 2025-03-17T17:30:51Z

Reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1.
However, this doesn't reproduce well and requires lots of iterations (tests.iters) to catch the bug for max pool size > 1 (and hard ever reproduces if using keep alive). Open for suggestions / ideas, but leaning towards considering this to be sufficient enough.

Note, a seemingly obvious approach would be to block (max - 1) threads in the pool. This will cause work to starve in a very similar way to the case core=0/max=1. However, there’s a significant difference, despite the pool having capacity for a spare worker, the work is rather starved by blocked workers and not by the fact that no worker is running at all. As soon as any of the workers is unblocked, the pool will continue processing the task queue. The fact that the pool is queueing despite having spare capacity isn’t optimal either, but I’m considering this to be a separate issue.

Relates to #124867, ES-10640

…k with max pool size > 0. Relates to elastic#124867, ES-10640

elasticsearchmachine · 2025-03-17T17:31:29Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

mosche · 2025-03-21T10:54:09Z

Btw, here's another reproduction attempt I tried to use initially. This repeatedly fills the work pool with work and let's it complete at the same time using a CyclicBarrier. My expectation was this would lead to the starvation issue for both core size = 0 and core size > 0. However, it looks like we're not running into the issue if multiple threads expire at around the same time.

private void testScalingWithEmptyCore(EsThreadPoolExecutor esExecutor) {
        class Scheduler implements AutoCloseable {
            final ExecutorService scheduler = Executors.newSingleThreadExecutor();
            final long keepAliveNanos = esExecutor.getKeepAliveTime(TimeUnit.NANOSECONDS);
            final int maximumPoolSize = esExecutor.getMaximumPoolSize();
            final boolean isEsScalingQueue = esExecutor.getQueue() instanceof EsExecutors.ExecutorScalingQueue<?>;

            final Semaphore success = new Semaphore(0);
            final CyclicBarrier barrier = new CyclicBarrier(maximumPoolSize + 1);

            volatile int remaining = resetRemaining();

            private int resetRemaining() {
                return between(10, 200);
            }

            final Runnable work = new AbstractRunnable() {
                @Override
                public void onFailure(Exception e) {
                    fail(e);
                }

                @Override
                protected void doRun() throws Exception {
                    barrier.await(); // wait for all work + testPool to be ready to proceed
                }
            };

            final Runnable continuation = () -> {
                if (remaining > 0) {
                    remaining--;
                    testPoolAsync();
                } else {
                    remaining = resetRemaining(); // reset for next round
                    success.release();
                }
            };

            final Runnable testPool = new AbstractRunnable() {
                @Override
                public void onFailure(Exception e) {
                    fail(e);
                }

                @Override
                protected void doRun() throws Exception {
                    for (int count = 0; count < maximumPoolSize;) {
                        esExecutor.execute(work);
                        if (isEsScalingQueue && removeQueuedWork()) {
                            Thread.yield(); // yield and try again
                        } else {
                            count++;
                        }
                    }
                    barrier.await(); // wait for all work to be running
                    if (keepAliveNanos > 0) {
                        var targetNanoTime = System.nanoTime() + keepAliveNanos + between(-1_000, 1_000);
                        while (System.nanoTime() < targetNanoTime) {
                            Thread.yield();
                        }
                    }
                    esExecutor.execute(continuation);
                }

                // remove work that is queued due to ExecutorScalingQueue so we can be sure all work is running
                private boolean removeQueuedWork() {
                    boolean workWasQueued = false;
                    Runnable queuedWork;
                    while ((queuedWork = ThreadContext.unwrap(esExecutor.getQueue().poll())) != null) {
                        logger.trace(
                            "{} was queued [poolSize={}, maximumPoolSize={}, activeCount={}, remaining={}]",
                            queuedWork == work ? "WORK" : "OTHER", // could be EsThreadPoolExecutor.WORKER_PROBE
                            esExecutor.getPoolSize(),
                            maximumPoolSize,
                            esExecutor.getActiveCount(),
                            remaining
                        );
                        workWasQueued |= queuedWork == work;
                    }
                    return workWasQueued;
                }
            };

            public void testPoolAsync() {
                scheduler.execute(testPool);
            }

            @Override
            public void close() {
                success.release();
                scheduler.shutdownNow();
            }
        }

        try (var scheduler = new Scheduler()) {
            for (int i = 0; i < 5000; i++) {
                scheduler.testPoolAsync();
                safeAcquire(scheduler.success);

            }
        } finally {
            ThreadPool.terminate(esExecutor, 1, TimeUnit.SECONDS);
        }
    }

DaveCTurner

I don't have a problem with this being hard-to-reproduce - in practice we run thousands of iterations of these things every day, we will notice a bug eventually.

server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java

mosche · 2025-03-28T16:22:29Z

Thanks for the feedback @DaveCTurner ... was a bit distracted this week with on week, but will wrap this up mid next week 👍

mosche · 2025-04-04T11:24:23Z

@DaveCTurner I've addressed your feedback

DaveCTurner

LGTM (couple of nits)

server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java

…EsExecutorsTests.java Co-authored-by: David Turner <[email protected]>

Improve reproduction of scaling EsExecutors bug elastic#124667 to wor…

5c08f6f

…k with max pool size > 0. Relates to elastic#124867, ES-10640

mosche added >test Issues or PRs that are addressing/adding tests :Core/Infra/Core Core issues without another label labels Mar 17, 2025

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v9.1.0 labels Mar 17, 2025

mosche requested a review from a team March 18, 2025 07:43

mosche marked this pull request as draft March 21, 2025 13:50

DaveCTurner reviewed Mar 24, 2025

View reviewed changes

mosche added 2 commits April 3, 2025 11:14

PR feedback, improved reproduction with keep alive

07457c1

Merge branch 'main' into ktlo/esExecutorBug_reproduction

2b46fdf

mosche marked this pull request as ready for review April 3, 2025 09:29

DaveCTurner approved these changes Apr 4, 2025

View reviewed changes

server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java Outdated Show resolved Hide resolved

server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java Outdated Show resolved Hide resolved

mosche and others added 3 commits April 7, 2025 08:55

Update server/src/test/java/org/elasticsearch/common/util/concurrent/…

102a7b2

…EsExecutorsTests.java Co-authored-by: David Turner <[email protected]>

Update server/src/test/java/org/elasticsearch/common/util/concurrent/…

639fa80

…EsExecutorsTests.java Co-authored-by: David Turner <[email protected]>

Merge branch 'main' into ktlo/esExecutorBug_reproduction

f3952df

mosche added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 7, 2025

Merge branch 'main' into ktlo/esExecutorBug_reproduction

9d02018

mosche changed the title ~~Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 0.~~ Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. Apr 7, 2025

mosche merged commit 0360db2 into elastic:main Apr 7, 2025
16 of 17 checks passed

mosche deleted the ktlo/esExecutorBug_reproduction branch April 7, 2025 13:40

mosche mentioned this pull request Apr 29, 2025

Extend reproduction for starvation bug of scaling EsExecutors to work if max pool size > 1 #124867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045

Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045

Uh oh!

mosche commented Mar 17, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 17, 2025

Uh oh!

mosche commented Mar 21, 2025 •

edited

Loading

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mosche commented Mar 28, 2025

Uh oh!

mosche commented Apr 4, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045

Improved reproduction of scaling EsExecutors bug #124667 to work with max pool size > 1. #125045

Uh oh!

Conversation

mosche commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 17, 2025

Uh oh!

mosche commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mosche commented Mar 28, 2025

Uh oh!

mosche commented Apr 4, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mosche commented Mar 17, 2025 •

edited

Loading

mosche commented Mar 21, 2025 •

edited

Loading