Conversation
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
1 similar comment
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
|
I see for comparison: Which is not as great of a speedup as I hoped. I think the parallelism doesn't add that great of a speedup as the longer tests are the ones that have to be sequential. Also, since build and test is usually not the last job done, this PR doesn't have a noticeable impact on the E2E time of the CI workflow, since we are usually gated by the installer workflow. |
Agreed. It does seem like a marginal speed up. Do we want to proceed? IMO, it's "free" provided it remains stable (which I'm not sure we have enough info yet to ascertain). I wonder if this should be paired with a follow up pass through the existing tests to identify bottlenecks/turn down any knobs (eg. Shots, qubits) they may have (provided they are not explicitly testing these capabilities)? |
|
I just completed a serial run. Here are the test sorted in decreasing runtimes: And here is a breakdown computed by codex which, I checked, adds up to the total. So, tensornet and the targettests are taking the most time. Unfortunately, parallelizing tensornet is not trivial. There are a lot of very short tests and a few long tests, so there is a high probability for the long tests to run at the same time. If I just run with -j2 these tests we quickly get diminishing returns: They take a lot more time than their sum. These tests seem to be GPU bound: There is an 80% utilization and the memory is almost full. So, I am reluctant to apply an even higher number for I think we need to look at each test family separately because each family may benefit from different techniques. My experience with targettests was with the remote runs, based on MPI, forking child processes for each CPU. @1tnguyen may have some valuable input on tensornet. But I don't think we should have a blanket propagation of OMP_NUM_THREADS throughout the build system. |
I haven't looked at the code changes in detail, but I fully agree with this point from the description, which should partly address the concern about parallelizing tensornet tests (#4111 (comment)). With respect to |
|
I looked into the targettests lit tests. We would need to allow Then asking codex for a small parameter sweep on I have 20 cpus on my WSL machine, so half the cpus, twice the threads seem the optimal choice. The slowest tests are They are slow because they are constructed as follows: or They run tests serially explicitly or in a loop. The next job could be to break them up so they can be parallelized. If can break up the longest poles which are 365.87s and 124.82s, we should be able to bring the overall time down to maybe 80s. |
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
- Add PROCESSORS ${NPROC} to ctest-nvqpp and ctest-targettests so ctest
reserves all cores when these lit suites run (matches existing
pycudaq-mlir behavior).
- Add RESOURCE_LOCK "gpu" to all gpu_required gtest tests so ctest
serializes GPU tests even under ctest -j N without label filtering.
- Make CUDAQ_LIT_JOBS dynamic: defaults to min(nproc, 8) instead of
hardcoded 8. Still overridable via -DCUDAQ_LIT_JOBS=<n>.
- Add comment explaining CUDAQ_TEST_OMP_SLOTS rationale for reviewer.
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
GPU tests now have RESOURCE_LOCK "gpu" in CMakeLists.txt, so ctest serializes them automatically. The separate -j 1 GPU phase is no longer needed. Signed-off-by: Thomas Alexander <talexander@nvidia.com>
The Quantinuum backend tests had two xdist-unsafe patterns:
1. Three mock-server files shared port 62440 and a session-scoped
fixture that was copy-pasted identically across all three.
2. Two LocalEmulation files shared $HOME/FakeConfig2.config with
unguarded os.remove in teardown.
Fix by extracting shared fixtures into conftest.py and using
xdist_group markers to keep files that share resources on the
same worker:
- quantinuum_mock_server: session-scoped fixture (server + creds)
- quantinuum_emulation_creds: function-scoped fixture (creds file)
- xdist_group("quantinuum_mock") for the three mock-server files
- xdist_group("quantinuum_emulation") for the two emulation files
Signed-off-by: Thomas Alexander <talexander@nvidia.com>
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
|
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
The targettests lit config sanitizes the environment and only passes through explicitly listed variables. OMP_NUM_THREADS was missing, so the thread budget set by run_tests.sh had no effect on target tests. Signed-off-by: Thomas Alexander <talexander@nvidia.com>
Based on a discussion with @Renaud-K and testing bottlenecks identified in #3960 this work attempts to enable better concurrent testing. The issue with existing tests and setting -j is that they would then internally use parallelism (either the lit tests being parallel, or even worse parallel lit tests using openmp to parallelize simulations). This would lead to massive contention and overall slower tests.
This PR attempts to structure concurrency with the hope it speeds up CI overall:
ctesthow many cores we will use for each job limiting to 2 cores for OpenMP jobs-j) with 2 cores per-OpenMP job to balance concurrencypytest-xdistand also consolidate them into a singlepytestinvocation to avoid loading overhead