Skip to content

Parallelize CI test suites#4111

Open
taalexander wants to merge 16 commits intoNVIDIA:mainfrom
taalexander:feature/parallelize-ci
Open

Parallelize CI test suites#4111
taalexander wants to merge 16 commits intoNVIDIA:mainfrom
taalexander:feature/parallelize-ci

Conversation

@taalexander
Copy link
Copy Markdown
Collaborator

@taalexander taalexander commented Mar 6, 2026

Based on a discussion with @Renaud-K and testing bottlenecks identified in #3960 this work attempts to enable better concurrent testing. The issue with existing tests and setting -j is that they would then internally use parallelism (either the lit tests being parallel, or even worse parallel lit tests using openmp to parallelize simulations). This would lead to massive contention and overall slower tests.

This PR attempts to structure concurrency with the hope it speeds up CI overall:

  • Teach ctest how many cores we will use for each job limiting to 2 cores for OpenMP jobs
  • Balance test parallelism (-j) with 2 cores per-OpenMP job to balance concurrency
  • Parallelize python tests using pytest-xdist and also consolidate them into a single pytest invocation to avoid loading overhead
  • Continue to limit GPU tests to being sequential to avoid GPU contention.

@taalexander taalexander requested review from Renaud-K and mitchdz March 6, 2026 19:31
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@mitchdz
Copy link
Copy Markdown
Collaborator

mitchdz commented Mar 9, 2026

I see for comparison:

Which is not as great of a speedup as I hoped. I think the parallelism doesn't add that great of a speedup as the longer tests are the ones that have to be sequential. Also, since build and test is usually not the last job done, this PR doesn't have a noticeable impact on the E2E time of the CI workflow, since we are usually gated by the installer workflow.

@taalexander
Copy link
Copy Markdown
Collaborator Author

@mitchdz

Which is not as great of a speedup as I hoped

Agreed. It does seem like a marginal speed up. Do we want to proceed? IMO, it's "free" provided it remains stable (which I'm not sure we have enough info yet to ascertain).

I wonder if this should be paired with a follow up pass through the existing tests to identify bottlenecks/turn down any knobs (eg. Shots, qubits) they may have (provided they are not explicitly testing these capabilities)?

@Renaud-K
Copy link
Copy Markdown
Collaborator

Renaud-K commented Mar 10, 2026

I just completed a serial run. Here are the test sorted in decreasing runtimes:

sort -k3 -nr Testing/Temporary/CTestCostData.txt
ctest-targettests 1 1139.14
tensornet_fp32_BuilderTester.checkExplicitMeasurements 1 357.25
tensornet_fp32_AsyncTester.checkExplicitMeasurements 1 344.6
pycudaq_EvolveDynamicsOperatorBatching 1 214.8
tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements 1 183.923
tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements 1 181.58
nvqpp_Dynamics_Operator_Batching_Snippets 1 180.729
tensornet_mps_BuilderTester.checkExplicitMeasurements 1 166.489
tensornet_mps_AsyncTester.checkExplicitMeasurements 1 163.463
tensornet_BuilderTester.checkExplicitMeasurements 1 133.659
tensornet_AsyncTester.checkExplicitMeasurements 1 129.677
tensornet_BuilderTester.checkExplicitMeasurements_PathReuse 1 123.449
ctest-nvqpp 1 51.4938

And here is a breakdown computed by codex which, I checked, adds up to the total.

 1. tensornet: 2640.671 s (53.41%)
  2. ctest: 1190.634 s (24.08%)
  3. nvqpp: 343.187 s (6.94%)
  4. pycudaq: 255.116 s (5.16%)
  5. braket: 71.275 s (1.44%)
  6. custatevec: 61.723 s (1.25%)
  7. EvolveTester: 58.537 s (1.18%)
  8. qci: 47.780 s (0.97%)
  9. quantinuum: 38.406 s (0.78%)
  10. BatchedEvolveTester: 29.987 s (0.61%)
  11. Other (all remaining families): 206.822 s (4.18%)

So, tensornet and the targettests are taking the most time. Unfortunately, parallelizing tensornet is not trivial. There are a lot of very short tests and a few long tests, so there is a high probability for the long tests to run at the same time.

If I just run with -j2 these tests we quickly get diminishing returns:

ctest -j 2  -R Tester.checkExplicitMeasurements
Test project /workspaces/cuda-quantum/build/Release
      Start  851: tensornet_fp32_BuilderTester.checkExplicitMeasurements
      Start  878: tensornet_fp32_AsyncTester.checkExplicitMeasurements
 1/18 Test  #878: tensornet_fp32_AsyncTester.checkExplicitMeasurements ....................   Passed  1077.25 sec
      Start  988: tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements
 2/18 Test  #851: tensornet_fp32_BuilderTester.checkExplicitMeasurements ..................   Passed  1089.28 sec
      Start 1016: tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements
 3/18 Test  #988: tensornet_mps_fp32_BuilderTester.checkExplicitMeasurements ..............   Passed  310.94 sec
      Start  710: tensornet_mps_BuilderTester.checkExplicitMeasurements
 4/18 Test #1016: tensornet_mps_fp32_AsyncTester.checkExplicitMeasurements ................   Passed  313.91 sec
      Start  738: tensornet_mps_AsyncTester.checkExplicitMeasurements

They take a lot more time than their sum.

These tests seem to be GPU bound:

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.112                Driver Version: 581.95         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 3500 Ada Gene...    On  |   00000000:01:00.0  On |                  Off |
| N/A   60C    P4             26W /   55W |   11977MiB /  12282MiB |     80%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

There is an 80% utilization and the memory is almost full. So, I am reluctant to apply an even higher number for -j.

I think we need to look at each test family separately because each family may benefit from different techniques. My experience with targettests was with the remote runs, based on MPI, forking child processes for each CPU. @1tnguyen may have some valuable input on tensornet. But I don't think we should have a blanket propagation of OMP_NUM_THREADS throughout the build system.

@1tnguyen
Copy link
Copy Markdown
Collaborator

  • Continue to limit GPU tests to being sequential to avoid GPU contention.

I haven't looked at the code changes in detail, but I fully agree with this point from the description, which should partly address the concern about parallelizing tensornet tests (#4111 (comment)).

With respect to checkExplicitMeasurements test time, we could specialize the test condition for tensornet backends (e.g., reduce the number of shots and number of rounds). This would keep the coverage while reducing the test time. We did a round of test time reduction for tensornet backends a long time ago (#903). Perhaps it's time to do another round of trimming :)

@Renaud-K
Copy link
Copy Markdown
Collaborator

I looked into the targettests lit tests. We would need to allow OMP_NUM_THREADS through on this line otherwise the setting will have no effect.

Then asking codex for a small parameter sweep on OMP_NUM_THREADS and -j. I got this back:

  - (OMP_NUM_THREADS=2, -j10): 371s
  - (4,10): 372s
  - (2,12): 392s
  - (1,20): 403s
  - (3,8): 415s
  - (4,8): 426s
  - (2,8): 429s
  - (4,5): 546s
  - (5,4): 641s
  - (6,4): 655s

I have 20 cpus on my WSL machine, so half the cpus, twice the threads seem the optimal choice.
This takes it from 1139.14s to 371, a 3x speed-up, saving 12 minutes.

The slowest tests are

  - TargetConfig/check_compile.cpp: 365.87s
  - Remote-Sim/args_synthesis.cpp: 124.82s
  - execution/state_preparation_vector_sizes.cpp: 74.64s
  - execution/qspan_slices.cpp: 61.26s
  - execution/state_preparation_vector.cpp: 58.44s
  - execution/load_value.cpp: 57.63s
  - execution/uccsd.cpp: 57.04s

They are slow because they are constructed as follows:

// RUN: for target in $(nvq++ --list-targets); do echo "Testing target: ${target}"; nvq++ --library-mode --target ${target} %s; done
// RUN: for target in $(nvq++ --list-targets); do echo "Testing target: ${target}"; nvq++ --enable-mlir --target ${target} %s; done

or

// RUN: nvq++ %s -o %t && %t | FileCheck %s

// Quantum emulators
// RUN: nvq++ --target infleqtion --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target quantinuum --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target ionq       --emulate %s -o %t && %t | FileCheck %s
// RUN: nvq++ --target iqm        --emulate %s -o %t && IQM_QPU_QA=%iqm_tests_dir/Crystal_5.txt  %t | FileCheck %s
// RUN: nvq++ --target oqc        --emulate %s -o %t && %t | FileCheck %s
// RUN: if %braket_avail; then nvq++ --target braket --emulate %s -o %t && %t | FileCheck %s; fi
// RUN: if %qci_avail; then nvq++ --target qci --emulate %s -o %t && %t | FileCheck %s; fi
// RUN: if %quantum_machines_avail; then nvq++ --target quantum_machines --emulate %s -o %t && %t | FileCheck %s; fi

They run tests serially explicitly or in a loop. The next job could be to break them up so they can be parallelized. If can break up the longest poles which are 365.87s and 124.82s, we should be able to bring the overall time down to maybe 80s.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
- Add PROCESSORS ${NPROC} to ctest-nvqpp and ctest-targettests so ctest
  reserves all cores when these lit suites run (matches existing
  pycudaq-mlir behavior).
- Add RESOURCE_LOCK "gpu" to all gpu_required gtest tests so ctest
  serializes GPU tests even under ctest -j N without label filtering.
- Make CUDAQ_LIT_JOBS dynamic: defaults to min(nproc, 8) instead of
  hardcoded 8. Still overridable via -DCUDAQ_LIT_JOBS=<n>.
- Add comment explaining CUDAQ_TEST_OMP_SLOTS rationale for reviewer.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
GPU tests now have RESOURCE_LOCK "gpu" in CMakeLists.txt, so ctest
serializes them automatically. The separate -j 1 GPU phase is no
longer needed.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
The Quantinuum backend tests had two xdist-unsafe patterns:
1. Three mock-server files shared port 62440 and a session-scoped
   fixture that was copy-pasted identically across all three.
2. Two LocalEmulation files shared $HOME/FakeConfig2.config with
   unguarded os.remove in teardown.

Fix by extracting shared fixtures into conftest.py and using
xdist_group markers to keep files that share resources on the
same worker:
- quantinuum_mock_server: session-scoped fixture (server + creds)
- quantinuum_emulation_creds: function-scoped fixture (creds file)
- xdist_group("quantinuum_mock") for the three mock-server files
- xdist_group("quantinuum_emulation") for the two emulation files

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
@github-actions
Copy link
Copy Markdown

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@github-actions
Copy link
Copy Markdown

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

The targettests lit config sanitizes the environment and only passes
through explicitly listed variables. OMP_NUM_THREADS was missing, so
the thread budget set by run_tests.sh had no effect on target tests.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Copy link
Copy Markdown
Collaborator

@Renaud-K Renaud-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants