-
Notifications
You must be signed in to change notification settings - Fork 6
Parallel processing and performance tuning
This page explains how parallel processing works in current CSUBST releases and
how to benchmark csubst search reproducibly.
The options below mainly affect csubst search, especially:
- substitution-tensor generation
- expected-state projection
- branch-combination generation
-
cb/cbsreducer workloads - branch-distance calculations
Core controls:
--threads--parallel_backend auto|multiprocessing|threading--parallel_chunk_factor--parallel_chunk_factor_reducer--sub_tensor_backend auto|dense|sparse--sub_tensor_sparse_density_cutoff--sub_tensor_auto_sparse_min_elements--output_stat
Threshold-style controls:
--parallel_min_items_sub_tensor--parallel_min_items_per_job_sub_tensor--parallel_min_items_node_union--parallel_min_items_per_job_node_union--parallel_min_items_nc_matrix--parallel_min_items_per_job_nc_matrix--parallel_min_items_cb--parallel_min_items_per_job_cb--parallel_min_rows_cbs--parallel_min_rows_per_job_cbs--parallel_min_items_branch_dist--parallel_min_items_per_job_branch_dist--parallel_min_items_expected_state--parallel_min_items_per_job_expected_state
Current defaults worth remembering:
-
--parallel_backend autoresolves tomultiprocessing --parallel_chunk_factor 1--parallel_chunk_factor_reducer 4--sub_tensor_backend auto--output_stat any2any,any2dif,any2spe
--output_stat materially changes workload size, so always keep it fixed while
benchmarking.
csubst search \
--alignment_file alignment.fa.gz \
--rooted_tree_file tree.nwk \
--foreground foreground.txt \
--threads 1 \
--output_stat any2any,any2dif,any2spe \
--branch_dist no \
--calibrate_longtail noCompare --threads 1,2,4,8 with all other flags fixed.
On macOS:
/usr/bin/time -l -p csubst search \
--alignment_file alignment.fa.gz \
--rooted_tree_file tree.nwk \
--foreground foreground.txt \
--threads 4 \
--branch_dist no \
--calibrate_longtail no \
--sub_tensor_backend sparse \
--parallel_backend autoTrack:
-
realfor wall-clock runtime -
maximum resident set sizefor peak RAM
Try the following in order:
- keep
--parallel_backend multiprocessing(orauto) - compare
--sub_tensor_backend dense,sparse, andauto - tune
--parallel_chunk_factorand--parallel_chunk_factor_reducer - raise or lower the
--parallel_min_items_*thresholds so small workloads do not pay process-startup overhead
Each --parallel_min_items_* or --parallel_min_rows_* option defines when a
given workload is large enough to parallelize. The matching
--parallel_min_items_per_job_* option defines the minimum work that should be
assigned to each worker.
That means you can tune two things independently:
- when parallelism turns on
- how coarsely the workload is split once parallelism is enabled
After changing backend or chunking settings, compare outputs from:
- dense vs sparse reducers
- one thread vs many threads
Recommended comparisons on csubst_cb_2.tsv:
omegaC*- core
OC*,EC*,dNC*,dSC*columns - branch-level substitution counters (
S_sub_*,N_sub_*)
Expected behavior:
-
omegaC*and the core OC/EC/dN/dS statistics should match exactly or differ only by tiny floating-point noise - branch-level accumulations may show extremely small differences due to reduction order
- Benchmark the CLI directly rather than through inline
python - <<'PY'snippets when using multiprocessing. - On macOS,
/usr/bin/time -lis the easiest way to collect peak-RAM numbers.
| sub_tensor_backend | threads | parallel_backend | real sec | peak RAM bytes |
|---|---|---|---|---|
| dense | 1 | auto | ... | ... |
| dense | 2 | auto | ... | ... |
| sparse | 1 | auto | ... | ... |
| sparse | 2 | auto | ... | ... |
Keeping this table in issue comments or release notes makes regressions much easier to spot.