Parallel processing and performance tuning

This page explains how parallel processing works in current CSUBST releases and how to benchmark csubst search reproducibly.

Scope

The options below mainly affect csubst search, especially:

substitution-tensor generation
expected-state projection
branch-combination generation
cb / cbs reducer workloads
branch-distance calculations

Relevant CLI options

Core controls:

--threads
--parallel_backend auto|multiprocessing|threading
--parallel_chunk_factor
--parallel_chunk_factor_reducer
--sub_tensor_backend auto|dense|sparse
--sub_tensor_sparse_density_cutoff
--sub_tensor_auto_sparse_min_elements
--output_stat

Threshold-style controls:

--parallel_min_items_sub_tensor
--parallel_min_items_per_job_sub_tensor
--parallel_min_items_node_union
--parallel_min_items_per_job_node_union
--parallel_min_items_nc_matrix
--parallel_min_items_per_job_nc_matrix
--parallel_min_items_cb
--parallel_min_items_per_job_cb
--parallel_min_rows_cbs
--parallel_min_rows_per_job_cbs
--parallel_min_items_branch_dist
--parallel_min_items_per_job_branch_dist
--parallel_min_items_expected_state
--parallel_min_items_per_job_expected_state

Current defaults worth remembering:

--parallel_backend auto resolves to multiprocessing
--parallel_chunk_factor 1
--parallel_chunk_factor_reducer 4
--sub_tensor_backend auto
--output_stat any2any,any2dif,any2spe

--output_stat materially changes workload size, so always keep it fixed while benchmarking.

Practical tuning workflow

1. Establish a single-thread baseline

csubst search \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt \
  --threads 1 \
  --output_stat any2any,any2dif,any2spe \
  --branch_dist no \
  --calibrate_longtail no

2. Sweep thread counts

Compare --threads 1,2,4,8 with all other flags fixed.

3. Measure runtime and peak RAM together

On macOS:

/usr/bin/time -l -p csubst search \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt \
  --threads 4 \
  --branch_dist no \
  --calibrate_longtail no \
  --sub_tensor_backend sparse \
  --parallel_backend auto

Track:

real for wall-clock runtime
maximum resident set size for peak RAM

4. If scaling is poor

Try the following in order:

keep --parallel_backend multiprocessing (or auto)
compare --sub_tensor_backend dense, sparse, and auto
tune --parallel_chunk_factor and --parallel_chunk_factor_reducer
raise or lower the --parallel_min_items_* thresholds so small workloads do not pay process-startup overhead

What the thresholds do

Each --parallel_min_items_* or --parallel_min_rows_* option defines when a given workload is large enough to parallelize. The matching --parallel_min_items_per_job_* option defines the minimum work that should be assigned to each worker.

That means you can tune two things independently:

when parallelism turns on
how coarsely the workload is split once parallelism is enabled

Accuracy guardrails after tuning

After changing backend or chunking settings, compare outputs from:

dense vs sparse reducers
one thread vs many threads

Recommended comparisons on csubst_cb_2.tsv:

omegaC*
core OC*, EC*, dNC*, dSC* columns
branch-level substitution counters (S_sub_*, N_sub_*)

Expected behavior:

omegaC* and the core OC/EC/dN/dS statistics should match exactly or differ only by tiny floating-point noise
branch-level accumulations may show extremely small differences due to reduction order

Environment notes

Benchmark the CLI directly rather than through inline python - <<'PY' snippets when using multiprocessing.
On macOS, /usr/bin/time -l is the easiest way to collect peak-RAM numbers.

Minimal benchmark matrix template

sub_tensor_backend	threads	parallel_backend	real sec	peak RAM bytes
dense	1	auto	...	...
dense	2	auto	...	...
sparse	1	auto	...	...
sparse	2	auto	...	...

Keeping this table in issue comments or release notes makes regressions much easier to spot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing and performance tuning

Parallel processing and performance tuning

Scope

Relevant CLI options

Practical tuning workflow

1. Establish a single-thread baseline

2. Sweep thread counts

3. Measure runtime and peak RAM together

4. If scaling is poor

What the thresholds do

Accuracy guardrails after tuning

Environment notes

Minimal benchmark matrix template

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally