Skip to content

Conversation

@joker-eph
Copy link
Collaborator

Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple pools on windows when we want to actually use more workers.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a workaround for Windows multiprocessing limitations that cap worker processes at 60 per pool. Instead of restricting the total number of workers, the solution creates multiple process pools when more than 60 workers are requested on Windows.

Key changes:

  • Removed the hard limit of 60 workers from usable_core_count() function
  • Implemented multi-pool architecture in test execution that distributes workers and tests across multiple pools
  • Added logging to inform users when the workaround is active

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
llvm/utils/lit/lit/util.py Removes Windows-specific worker count limitation from usable_core_count()
llvm/utils/lit/lit/run.py Implements multi-pool workaround for Windows 60-worker limit with test distribution logic
Comments suppressed due to low confidence (1)

llvm/utils/lit/lit/run.py:1

  • The test indexing is incorrect when using multiple pools. The idx from async_results enumeration doesn't correspond to the correct index in self.tests because tests are distributed across pools with different starting indices.
import multiprocessing

@llvmbot
Copy link
Member

llvmbot commented Sep 9, 2025

@llvm/pr-subscribers-testing-tools

Author: Mehdi Amini (joker-eph)

Changes

Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple pools on windows when we want to actually use more workers.


Full diff: https://github.com/llvm/llvm-project/pull/157759.diff

2 Files Affected:

  • (modified) llvm/utils/lit/lit/run.py (+45-10)
  • (modified) llvm/utils/lit/lit/util.py (-5)
diff --git a/llvm/utils/lit/lit/run.py b/llvm/utils/lit/lit/run.py
index 62070e824e87f..b24a50911eb71 100644
--- a/llvm/utils/lit/lit/run.py
+++ b/llvm/utils/lit/lit/run.py
@@ -72,25 +72,60 @@ def _execute(self, deadline):
             if v is not None
         }
 
-        pool = multiprocessing.Pool(
-            self.workers, lit.worker.initialize, (self.lit_config, semaphores)
+        # Windows has a limit of 60 workers per pool, so we need to use multiple pools
+        # if we have more than 60 workers requested
+        max_workers_per_pool = 60 if os.name == "nt" else self.workers
+        num_pools = max(
+            1, (self.workers + max_workers_per_pool - 1) // max_workers_per_pool
         )
+        workers_per_pool = min(self.workers, max_workers_per_pool)
 
-        async_results = [
-            pool.apply_async(
-                lit.worker.execute, args=[test], callback=self.progress_callback
+        if num_pools > 1:
+            self.lit_config.note(
+                "Using %d pools with %d workers each (Windows worker limit workaround)"
+                % (num_pools, workers_per_pool)
             )
-            for test in self.tests
-        ]
-        pool.close()
+
+        # Create multiple pools
+        pools = []
+        for i in range(num_pools):
+            pool = multiprocessing.Pool(
+                workers_per_pool, lit.worker.initialize, (self.lit_config, semaphores)
+            )
+            pools.append(pool)
+
+        # Distribute tests across pools
+        tests_per_pool = (len(self.tests) + num_pools - 1) // num_pools
+        async_results = []
+        test_to_pool_map = {}
+
+        for pool_idx, pool in enumerate(pools):
+            start_idx = pool_idx * tests_per_pool
+            end_idx = min(start_idx + tests_per_pool, len(self.tests))
+            pool_tests = self.tests[start_idx:end_idx]
+
+            for test in pool_tests:
+                ar = pool.apply_async(
+                    lit.worker.execute, args=[test], callback=self.progress_callback
+                )
+                async_results.append(ar)
+                test_to_pool_map[ar] = pool
+
+        # Close all pools
+        for pool in pools:
+            pool.close()
 
         try:
             self._wait_for(async_results, deadline)
         except:
-            pool.terminate()
+            # Terminate all pools on exception
+            for pool in pools:
+                pool.terminate()
             raise
         finally:
-            pool.join()
+            # Join all pools
+            for pool in pools:
+                pool.join()
 
     def _wait_for(self, async_results, deadline):
         timeout = deadline - time.time()
diff --git a/llvm/utils/lit/lit/util.py b/llvm/utils/lit/lit/util.py
index b03fd8bc22693..b1552385ccc53 100644
--- a/llvm/utils/lit/lit/util.py
+++ b/llvm/utils/lit/lit/util.py
@@ -121,11 +121,6 @@ def usable_core_count():
     except AttributeError:
         n = os.cpu_count() or 1
 
-    # On Windows with more than 60 processes, multiprocessing's call to
-    # _winapi.WaitForMultipleObjects() prints an error and lit hangs.
-    if platform.system() == "Windows":
-        return min(n, 60)
-
     return n
 
 def abs_path_preserve_drive(path):

@joker-eph joker-eph force-pushed the lit_windows branch 2 times, most recently from 6c7d15a to 5328d27 Compare September 16, 2025 20:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

@joker-eph joker-eph force-pushed the lit_windows branch 2 times, most recently from 2ba9107 to b3e4997 Compare September 16, 2025 20:46
@joker-eph joker-eph requested a review from Copilot September 16, 2025 20:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to add unit tests for the logic that figures out the number of pools to create/workers per pool, and the logic that distributes the tests? It looks correct to me after spending some time thinking about it, but I would rather not rely on my thinking being correct.

This adds a bit of complexity, but I think it's very much worth it given high core count systems are becoming increasingly popular. Thanks for working on this.

@joker-eph
Copy link
Collaborator Author

Are you able to add unit tests for the logic that figures out the number of pools to create/workers per pool, and the logic that distributes the tests? It looks correct to me after spending some time thinking about it, but I would rather not rely on my thinking being correct.

Right, I tried it locally by changing the WINDOWS_MAX_WORKERS_PER_POOL value to a smaller number, using various number of workers, and checking the distribution of workers in the pools :)

I added a test!

@joker-eph joker-eph force-pushed the lit_windows branch 2 times, most recently from 6238641 to c9e4f40 Compare September 17, 2025 21:26
Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for pushing this through.

If you have benchmarking numbers on a high core count Windows machine, it would be nice if you could throw them in this thread.

@joker-eph
Copy link
Collaborator Author

I lost access to the machines unfortunately :(

Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using
multiple pools on windows when we want to actually use more workers.
@joker-eph joker-eph enabled auto-merge (squash) December 1, 2025 11:09
@joker-eph joker-eph merged commit 577cd6f into llvm:main Dec 1, 2025
9 of 10 checks passed
@joker-eph joker-eph deleted the lit_windows branch December 1, 2025 11:45
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 1, 2025

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-sie-ubuntu-fast running on sie-linux-worker while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/144/builds/41237

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'lit :: windows-pools.py' FAILED ********************
Exit Code: 127

Command Output (stdout):
--
# RUN: at line 3
rm -Rf /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir && mkdir -p /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# executed command: rm -Rf /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# note: command had no output on stdout or stderr
# executed command: mkdir -p /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# note: command had no output on stdout or stderr
# RUN: at line 4
python -c "for i in range(20): open(rf'/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt', 'w').write('RUN:')"
# executed command: python -c 'for i in range(20): open(rf'"'"'/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt'"'"', '"'"'w'"'"').write('"'"'RUN:'"'"')'
# .---command stderr------------
# | 'python': command not found
# `-----------------------------
# error: command failed with exit status: 127

--

********************


@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 1, 2025

LLVM Buildbot has detected a new failure on builder clang-m68k-linux-cross running on suse-gary-m68k-cross while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/27/builds/19658

Here is the relevant piece of the build log for the reference
Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'lit :: windows-pools.py' FAILED ********************
Exit Code: 127

Command Output (stdout):
--
# RUN: at line 3
rm -Rf /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir && mkdir -p /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir
# executed command: rm -Rf /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir
# executed command: mkdir -p /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir
# RUN: at line 4
python -c "for i in range(20): open(rf'/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt', 'w').write('RUN:')"
# executed command: python -c 'for i in range(20): open(rf'"'"'/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt'"'"', '"'"'w'"'"').write('"'"'RUN:'"'"')'
# .---command stderr------------
# | 'python': command not found
# `-----------------------------
# error: command failed with exit status: 127

--

********************


@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 1, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-ubuntu-fast running on as-builder-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/27268

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'lit :: windows-pools.py' FAILED ********************
Exit Code: 127

Command Output (stdout):
--
# RUN: at line 3
rm -Rf /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir && mkdir -p /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# executed command: rm -Rf /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# executed command: mkdir -p /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir
# RUN: at line 4
python -c "for i in range(20): open(rf'/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt', 'w').write('RUN:')"
# executed command: python -c 'for i in range(20): open(rf'"'"'/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/utils/lit/tests/Output/windows-pools.py.tmp.dir/file{i}.txt'"'"', '"'"'w'"'"').write('"'"'RUN:'"'"')'
# .---command stderr------------
# | 'python': command not found
# `-----------------------------
# error: command failed with exit status: 127

--

********************


aahrun pushed a commit to aahrun/llvm-project that referenced this pull request Dec 1, 2025
Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple
pools on windows when we want to actually use more workers.
@asb
Copy link
Contributor

asb commented Dec 1, 2025

It looks to me like we need logic similar to that in mlir/test/lit/cfg.py for finding the python executable lit is configured to use and doing

tools.extend(
    [
        ToolSubst("%PYTHON", python_executable, unresolved="ignore"),
    ]
)

Then the lit test can use %PYTHON in the relevant RUN line.

@joker-eph
Copy link
Collaborator Author

Fixed in 235d44d

@asb
Copy link
Contributor

asb commented Dec 1, 2025

Thank you! I can confirm that has worked for the bots I administer.

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Dec 3, 2025
Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple
pools on windows when we want to actually use more workers.
kcloudy0717 pushed a commit to kcloudy0717/llvm-project that referenced this pull request Dec 4, 2025
Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple
pools on windows when we want to actually use more workers.
honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
Python multiprocessing is limited to 60 workers at most:

https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

The limit being per thread pool, we can work around it by using multiple
pools on windows when we want to actually use more workers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants