Skip to content

Parallel processing and performance tuning

Kenji Fukushima edited this page Mar 6, 2026 · 7 revisions

Parallel processing and performance tuning

This page explains how parallel processing works in current CSUBST releases and how to benchmark csubst search reproducibly.

Scope

The options below mainly affect csubst search, especially:

  • substitution-tensor generation
  • expected-state projection
  • branch-combination generation
  • cb / cbs reducer workloads
  • branch-distance calculations

Relevant CLI options

Core controls:

  • --threads
  • --parallel_backend auto|multiprocessing|threading
  • --parallel_chunk_factor
  • --parallel_chunk_factor_reducer
  • --sub_tensor_backend auto|dense|sparse
  • --sub_tensor_sparse_density_cutoff
  • --sub_tensor_auto_sparse_min_elements
  • --output_stat

Threshold-style controls:

  • --parallel_min_items_sub_tensor
  • --parallel_min_items_per_job_sub_tensor
  • --parallel_min_items_node_union
  • --parallel_min_items_per_job_node_union
  • --parallel_min_items_nc_matrix
  • --parallel_min_items_per_job_nc_matrix
  • --parallel_min_items_cb
  • --parallel_min_items_per_job_cb
  • --parallel_min_rows_cbs
  • --parallel_min_rows_per_job_cbs
  • --parallel_min_items_branch_dist
  • --parallel_min_items_per_job_branch_dist
  • --parallel_min_items_expected_state
  • --parallel_min_items_per_job_expected_state

Current defaults worth remembering:

  • --parallel_backend auto resolves to multiprocessing
  • --parallel_chunk_factor 1
  • --parallel_chunk_factor_reducer 4
  • --sub_tensor_backend auto
  • --output_stat any2any,any2dif,any2spe

--output_stat materially changes workload size, so always keep it fixed while benchmarking.

Practical tuning workflow

1. Establish a single-thread baseline

csubst search \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt \
  --threads 1 \
  --output_stat any2any,any2dif,any2spe \
  --branch_dist no \
  --calibrate_longtail no

2. Sweep thread counts

Compare --threads 1,2,4,8 with all other flags fixed.

3. Measure runtime and peak RAM together

On macOS:

/usr/bin/time -l -p csubst search \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt \
  --threads 4 \
  --branch_dist no \
  --calibrate_longtail no \
  --sub_tensor_backend sparse \
  --parallel_backend auto

Track:

  • real for wall-clock runtime
  • maximum resident set size for peak RAM

4. If scaling is poor

Try the following in order:

  1. keep --parallel_backend multiprocessing (or auto)
  2. compare --sub_tensor_backend dense, sparse, and auto
  3. tune --parallel_chunk_factor and --parallel_chunk_factor_reducer
  4. raise or lower the --parallel_min_items_* thresholds so small workloads do not pay process-startup overhead

What the thresholds do

Each --parallel_min_items_* or --parallel_min_rows_* option defines when a given workload is large enough to parallelize. The matching --parallel_min_items_per_job_* option defines the minimum work that should be assigned to each worker.

That means you can tune two things independently:

  • when parallelism turns on
  • how coarsely the workload is split once parallelism is enabled

Accuracy guardrails after tuning

After changing backend or chunking settings, compare outputs from:

  • dense vs sparse reducers
  • one thread vs many threads

Recommended comparisons on csubst_cb_2.tsv:

  • omegaC*
  • core OC*, EC*, dNC*, dSC* columns
  • branch-level substitution counters (S_sub_*, N_sub_*)

Expected behavior:

  • omegaC* and the core OC/EC/dN/dS statistics should match exactly or differ only by tiny floating-point noise
  • branch-level accumulations may show extremely small differences due to reduction order

Environment notes

  • Benchmark the CLI directly rather than through inline python - <<'PY' snippets when using multiprocessing.
  • On macOS, /usr/bin/time -l is the easiest way to collect peak-RAM numbers.

Minimal benchmark matrix template

sub_tensor_backend threads parallel_backend real sec peak RAM bytes
dense 1 auto ... ...
dense 2 auto ... ...
sparse 1 auto ... ...
sparse 2 auto ... ...

Keeping this table in issue comments or release notes makes regressions much easier to spot.

Clone this wiki locally