Parallelize CI test suites by taalexander · Pull Request #4111 · NVIDIA/cuda-quantum

taalexander · 2026-03-06T19:31:15Z

Based on a discussion with @Renaud-K and testing bottlenecks identified in #3960 this work attempts to enable better concurrent testing. The issue with existing tests and setting -j is that they would then internally use parallelism (either the lit tests being parallel, or even worse parallel lit tests using openmp to parallelize simulations). This would lead to massive contention and overall slower tests.

This PR attempts to structure concurrency with the hope it speeds up CI overall:

Teach ctest how many cores we will use for each job limiting to 2 cores for OpenMP jobs
Balance test parallelism (-j) with 2 cores per-OpenMP job to balance concurrency
Parallelize python tests using pytest-xdist and also consolidate them into a single pytest invocation to avoid loading overhead
Continue to limit GPU tests to being sequential to avoid GPU contention.

github-actions · 2026-03-08T04:04:58Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions · 2026-03-08T16:40:55Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

mitchdz · 2026-03-09T15:43:01Z

I see for comparison:

new 58m 45s
old 1h 6m 46s

Which is not as great of a speedup as I hoped. I think the parallelism doesn't add that great of a speedup as the longer tests are the ones that have to be sequential. Also, since build and test is usually not the last job done, this PR doesn't have a noticeable impact on the E2E time of the CI workflow, since we are usually gated by the installer workflow.

unittests/backends/extra_payload_provider/HorizonServerHelper.cpp

taalexander · 2026-03-09T22:18:27Z

@mitchdz

Which is not as great of a speedup as I hoped

Agreed. It does seem like a marginal speed up. Do we want to proceed? IMO, it's "free" provided it remains stable (which I'm not sure we have enough info yet to ascertain).

I wonder if this should be paired with a follow up pass through the existing tests to identify bottlenecks/turn down any knobs (eg. Shots, qubits) they may have (provided they are not explicitly testing these capabilities)?

unittests/CMakeLists.txt

Renaud-K · 2026-03-10T01:04:27Z

I just completed a serial run. Here are the test sorted in decreasing runtimes:

sort -k3 -nr Testing/Temporary/CTestCostData.txt
ctest-targettests 1 1139.14
tensornet_fp32_BuilderTester.checkExplicitMeasurements 1 357.25
tensornet_fp32_AsyncTester.checkExplicitMeasurements 1 344.6
pycudaq_EvolveDynamicsOperatorBatching 1 214.8
tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements 1 183.923
tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements 1 181.58
nvqpp_Dynamics_Operator_Batching_Snippets 1 180.729
tensornet_mps_BuilderTester.checkExplicitMeasurements 1 166.489
tensornet_mps_AsyncTester.checkExplicitMeasurements 1 163.463
tensornet_BuilderTester.checkExplicitMeasurements 1 133.659
tensornet_AsyncTester.checkExplicitMeasurements 1 129.677
tensornet_BuilderTester.checkExplicitMeasurements_PathReuse 1 123.449
ctest-nvqpp 1 51.4938

And here is a breakdown computed by codex which, I checked, adds up to the total.

 1. tensornet: 2640.671 s (53.41%)
  2. ctest: 1190.634 s (24.08%)
  3. nvqpp: 343.187 s (6.94%)
  4. pycudaq: 255.116 s (5.16%)
  5. braket: 71.275 s (1.44%)
  6. custatevec: 61.723 s (1.25%)
  7. EvolveTester: 58.537 s (1.18%)
  8. qci: 47.780 s (0.97%)
  9. quantinuum: 38.406 s (0.78%)
  10. BatchedEvolveTester: 29.987 s (0.61%)
  11. Other (all remaining families): 206.822 s (4.18%)

So, tensornet and the targettests are taking the most time. Unfortunately, parallelizing tensornet is not trivial. There are a lot of very short tests and a few long tests, so there is a high probability for the long tests to run at the same time.

If I just run with -j2 these tests we quickly get diminishing returns:

ctest -j 2  -R Tester.checkExplicitMeasurements
Test project /workspaces/cuda-quantum/build/Release
      Start  851: tensornet_fp32_BuilderTester.checkExplicitMeasurements
      Start  878: tensornet_fp32_AsyncTester.checkExplicitMeasurements
 1/18 Test  #878: tensornet_fp32_AsyncTester.checkExplicitMeasurements ....................   Passed  1077.25 sec
      Start  988: tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements
 2/18 Test  #851: tensornet_fp32_BuilderTester.checkExplicitMeasurements ..................   Passed  1089.28 sec
      Start 1016: tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements
 3/18 Test  #988: tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements ..............   Passed  310.94 sec
      Start  710: tensornet_mps_BuilderTester.checkExplicitMeasurements
 4/18 Test #1016: tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements ................   Passed  313.91 sec
      Start  738: tensornet_mps_AsyncTester.checkExplicitMeasurements

They take a lot more time than their sum.

These tests seem to be GPU bound:

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.112                Driver Version: 581.95         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 3500 Ada Gene...    On  |   00000000:01:00.0  On |                  Off |
| N/A   60C    P4             26W /   55W |   11977MiB /  12282MiB |     80%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

There is an 80% utilization and the memory is almost full. So, I am reluctant to apply an even higher number for -j.

I think we need to look at each test family separately because each family may benefit from different techniques. My experience with targettests was with the remote runs, based on MPI, forking child processes for each CPU. @1tnguyen may have some valuable input on tensornet. But I don't think we should have a blanket propagation of OMP_NUM_THREADS throughout the build system.

1tnguyen · 2026-03-10T03:29:09Z

Continue to limit GPU tests to being sequential to avoid GPU contention.

I haven't looked at the code changes in detail, but I fully agree with this point from the description, which should partly address the concern about parallelizing tensornet tests (#4111 (comment)).

With respect to checkExplicitMeasurements test time, we could specialize the test condition for tensornet backends (e.g., reduce the number of shots and number of rounds). This would keep the coverage while reducing the test time. We did a round of test time reduction for tensornet backends a long time ago (#903). Perhaps it's time to do another round of trimming :)

Renaud-K · 2026-03-11T16:49:45Z

I looked into the targettests lit tests. We would need to allow OMP_NUM_THREADS through on this line otherwise the setting will have no effect.

Then asking codex for a small parameter sweep on OMP_NUM_THREADS and -j. I got this back:

  - (OMP_NUM_THREADS=2, -j10): 371s
  - (4,10): 372s
  - (2,12): 392s
  - (1,20): 403s
  - (3,8): 415s
  - (4,8): 426s
  - (2,8): 429s
  - (4,5): 546s
  - (5,4): 641s
  - (6,4): 655s

I have 20 cpus on my WSL machine, so half the cpus, twice the threads seem the optimal choice.
This takes it from 1139.14s to 371, a 3x speed-up, saving 12 minutes.

The slowest tests are

  - TargetConfig/check_compile.cpp: 365.87s
  - Remote-Sim/args_synthesis.cpp: 124.82s
  - execution/state_preparation_vector_sizes.cpp: 74.64s
  - execution/qspan_slices.cpp: 61.26s
  - execution/state_preparation_vector.cpp: 58.44s
  - execution/load_value.cpp: 57.63s
  - execution/uccsd.cpp: 57.04s

They are slow because they are constructed as follows:

// RUN: for target in $(nvq++ --list-targets); do echo "Testing target: ${target}"; nvq++ --library-mode --target ${target} %s; done
// RUN: for target in $(nvq++ --list-targets); do echo "Testing target: ${target}"; nvq++ --enable-mlir --target ${target} %s; done

or

// RUN: nvq++ %s -o %t && %t | FileCheck %s

// Quantum emulators
// RUN: nvq++ --target infleqtion --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target quantinuum --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target ionq       --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target iqm        --emulate %s -o %t && IQM_QPU_QA=%iqm_tests_dir/Crystal_5.txt  %t | FileCheck %s
// RUN: nvq++ --target oqc        --emulate %s -o %t && %t | FileCheck %s
// RUN: if %braket_avail; then nvq++ --target braket --emulate %s -o %t && %t | FileCheck %s; fi
// RUN: if %qci_avail; then nvq++ --target qci --emulate %s -o %t && %t | FileCheck %s; fi
// RUN: if %quantum_machines_avail; then nvq++ --target quantum_machines --emulate %s -o %t && %t | FileCheck %s; fi

They run tests serially explicitly or in a loop. The next job could be to break them up so they can be parallelized. If can break up the longest poles which are 365.87s and 124.82s, we should be able to bring the overall time down to maybe 80s.

copy-pr-bot · 2026-03-18T01:58:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

- Add PROCESSORS ${NPROC} to ctest-nvqpp and ctest-targettests so ctest reserves all cores when these lit suites run (matches existing pycudaq-mlir behavior). - Add RESOURCE_LOCK "gpu" to all gpu_required gtest tests so ctest serializes GPU tests even under ctest -j N without label filtering. - Make CUDAQ_LIT_JOBS dynamic: defaults to min(nproc, 8) instead of hardcoded 8. Still overridable via -DCUDAQ_LIT_JOBS=<n>. - Add comment explaining CUDAQ_TEST_OMP_SLOTS rationale for reviewer. Signed-off-by: Thomas Alexander <talexander@nvidia.com>

GPU tests now have RESOURCE_LOCK "gpu" in CMakeLists.txt, so ctest serializes them automatically. The separate -j 1 GPU phase is no longer needed. Signed-off-by: Thomas Alexander <talexander@nvidia.com>

The Quantinuum backend tests had two xdist-unsafe patterns: 1. Three mock-server files shared port 62440 and a session-scoped fixture that was copy-pasted identically across all three. 2. Two LocalEmulation files shared $HOME/FakeConfig2.config with unguarded os.remove in teardown. Fix by extracting shared fixtures into conftest.py and using xdist_group markers to keep files that share resources on the same worker: - quantinuum_mock_server: session-scoped fixture (server + creds) - quantinuum_emulation_creds: function-scoped fixture (creds file) - xdist_group("quantinuum_mock") for the three mock-server files - xdist_group("quantinuum_emulation") for the two emulation files Signed-off-by: Thomas Alexander <talexander@nvidia.com>

github-actions · 2026-03-20T14:13:14Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions · 2026-03-20T15:41:40Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

The targettests lit config sanitizes the environment and only passes through explicitly listed variables. OMP_NUM_THREADS was missing, so the thread budget set by run_tests.sh had no effect on target tests. Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Renaud-K

Thank you.

taalexander requested review from Renaud-K and mitchdz March 6, 2026 19:31

taalexander requested review from 1tnguyen, bettinaheim and bmhowe23 as code owners March 6, 2026 19:31

copy-pr-bot bot temporarily deployed to ghcr-ci March 6, 2026 19:31 Inactive

copy-pr-bot bot had a problem deploying to ghcr-ci March 6, 2026 19:31 Error

Renaud-K reviewed Mar 9, 2026

View reviewed changes

unittests/backends/extra_payload_provider/HorizonServerHelper.cpp Show resolved Hide resolved

Renaud-K reviewed Mar 9, 2026

View reviewed changes

unittests/CMakeLists.txt Show resolved Hide resolved

taalexander added 14 commits March 20, 2026 09:48

Configure ctest/run_test parallism.

2c2b43d

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Add gpu detection.

58a60e5

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Group tests with pytest.

b04973a

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

add python pytest xdist.

7bc4cb2

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Update pycudaq validation to run notebooks in parallel.

92c9b15

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Add build logic for core counts.

216f5e8

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Update commands to be one at a time to xargs.

f8ace04

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Isolate ports.

7280f0f

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Address lit parallism problem.

5e047dc

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Don't split xargs on spaces.

3179405

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Fix echo.

99c7454

Signed-off-by: Thomas Alexander <talexander@nvidia.com>

Remove separate GPU ctest phase; RESOURCE_LOCK handles serialization

72e537e

GPU tests now have RESOURCE_LOCK "gpu" in CMakeLists.txt, so ctest serializes them automatically. The separate -j 1 GPU phase is no longer needed. Signed-off-by: Thomas Alexander <talexander@nvidia.com>

taalexander mentioned this pull request Mar 20, 2026

Enable ccache for CI builds with GHCR caching #3960

Open

Renaud-K approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into feature/parallelize-ci

fcdd7b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize CI test suites#4111

Parallelize CI test suites#4111
taalexander wants to merge 16 commits intoNVIDIA:mainfrom
taalexander:feature/parallelize-ci

taalexander commented Mar 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

mitchdz commented Mar 9, 2026

Uh oh!

Uh oh!

taalexander commented Mar 9, 2026

Uh oh!

Uh oh!

Renaud-K commented Mar 10, 2026 •

edited

Loading

Uh oh!

1tnguyen commented Mar 10, 2026

Uh oh!

Renaud-K commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Renaud-K left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

taalexander commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

mitchdz commented Mar 9, 2026

Uh oh!

Uh oh!

taalexander commented Mar 9, 2026

Uh oh!

Uh oh!

Renaud-K commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1tnguyen commented Mar 10, 2026

Uh oh!

Renaud-K commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Renaud-K left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taalexander commented Mar 6, 2026 •

edited

Loading

Renaud-K commented Mar 10, 2026 •

edited

Loading