This repository uses two distinct mechanisms to speed up computation. They address different bottlenecks and should not be confused.
Where: experiments/parallel_solve_batch.py, invoked from main_experiment_workflow.py when solving many on-disk instances.
Idea: Run different instances (different JSON inputs) in different OS processes using multiprocessing spawn and concurrent.futures.ProcessPoolExecutor. Each worker executes _parallel_solve_one_worker: it imports the solver class from _PARALLEL_SOLVER_IMPORTS, calls solver.solve(instance, run_config), and writes a standard solution payload via build_solution_record.
Supported solver_name values in the pool: cudaq, cirq, brute_force, simulated_annealing (see _PARALLEL_SOLVER_IMPORTS in the same module).
Configuration:
- CUDA-Q:
cudaq_max_parallel_instancesin merged experiment YAML, overridden byHTSP_CUDAQ_MAX_PARALLEL_INSTANCESwhen set. - CPU backends (
cirq,brute_force,simulated_annealing):cpu_max_parallel_instances/HTSP_CPU_MAX_PARALLEL_INSTANCES.
Important limits:
- QAOA inside one instance is still sequential — only different instances overlap in time.
- CUDA-Q: each process holds its own GPU context and memory footprint; reduce parallel count if you hit OOM.
- CPU: set
OMP_NUM_THREADS=1(and limit MKL/OpenBLAS threads if applicable) so many workers do not oversubscribe cores — see development.md.
Reproducibility: effective parallel counts are recorded under solver_config in solution JSON (e.g. cudaq_max_parallel_instances_effective, cpu_max_parallel_instances_effective).
Related: calibration tool estimate_lambdas.py uses ProcessPoolExecutor only for certain CPU solvers across instances; CUDA-Q runs sequentially there.
Where:
- utils/costs_batch.py —
batch_qubo_costs,batch_tqudo_costs: vectorized objective for many assignments at once. - solvers/brute_force/solver.py — enumerates assignments in chunks (
_BRUTE_FORCE_ASSIGNMENT_CHUNK_SIZE, default 8192), decodes bit patterns withunpack_qubo_bitmatrix, evaluates a batch of costs per chunk.
Idea: Within one brute-force solve, evaluate thousands of candidate assignments per NumPy operation instead of a Python loop per assignment.
What this does not do: it does not parallelize across CPU cores by itself (single process). It reduces per-assignment overhead. Cross-instance parallelism (above) is still the way to run many brute-force jobs concurrently.
| Mechanism | Parallelizes | Typical use |
|---|---|---|
Process pool (run_parallel_solve_batch) |
Multiple instances / multiple processes | Large experiment grids on disk |
costs_batch + brute-force chunks |
Many assignments in one process | Fast exact search for one instance |
For reusing the process-pool pattern outside QAOA, see extending_quantum_solvers.md and parallel_workers_general_workloads.md.