Benchmarking the Two-Locus Framework and LD Calculator

For the purposes of assessing how the two-locus framework performs relative to the LD Calculator, I've performed a series of benchmarks. In our initial benchmarks, the two-locus framework was rather slow relative to the LD Calculator (up to 10x slower). We went ahead and improved some of the low-hanging inefficiencies in the code and now I'm happy to report that the two-locus framework is within 2x the speed of the LD Calculator, and in some cases faster. I have created a document that attempts to thoroughly detail the changes made and the reasoning behind them: [benchmarks.pdf](https://github.com/user-attachments/files/22598838/benchmarks.pdf). For brevity, I've listed some of the highlights below for a high-level overview.

Here is a brief description of the changes that we made to increase performance:
1. **Base case**: The original code that is currently in tskit ([link](https://github.com/lkirk/tskit/commit/9821725e4706f61baa005e3def6fad354083973c)).
2. **Malloc out of hot loop**: Preallocate a structure of arrays used for
   temporary calculations instead of allocating and freeing for every pair of
   sites ([link](https://github.com/lkirk/tskit/pull/12/commits/462b6d936e088b5b72e4e2197a2201ee7d134670)).
3. **Refactor Bit Arrays**: Refactor bit array interface, removing the need for
   temporary arrays in many cases. All functions now take a row index as a
   parameter ([link](https://github.com/lkirk/tskit/pull/13/commits/8971aa9ef1d0c8bc422034bc1fb152746dfd24f3)).
4. **Precompute Allele Counts and Biallelic Summary Function**: Store precomputed
   bit arrays for each sample set and allele and the count of samples for each
   allele. Introduce a biallelic summary function that avoids multiple redundant
   computations, leaving the original normalized summary function for
   multiallelic sites ([link](https://github.com/lkirk/tskit/pull/14/commits/eb568405b371b8314e9ad6d4a1f928169b38b9eb)).

Python and C tests are all passing for each of these patches.

We benchmark each change (in our benchmarks, each change is layered on the next) on a set of tree sequences with the following parameters:
|parameter|value|
|---|---|
|r|1e-8|
|$N_e$|1e4|
|L|2e6|

With `ploidy=1` and sample sizes ranging from $10^2$ - $10^5$. We sample 15 replicates to obtain these results:

<img width="1624" height="952" alt="Image" src="https://github.com/user-attachments/assets/11c38cc3-6a60-47ef-9d8b-fe41cd4f55c5" />

In these plots the `Relative difference` is $(\textrm{tl} - \textrm{ldc}) / \textrm{ldc}$ where ldc is the LD Calculator and tl is the two-locus framework.

Next, we generated a larger tree sequence with `L=6e6` and sample sizes ranging from $10^3$ - $10^6$. Comparing all optimizations on the smaller (panel **A**) and larger (panel **B**), we get the following results:

<img width="1139" height="1129" alt="Image" src="https://github.com/user-attachments/assets/e5dfa9e2-0cf4-4dbf-a6dc-38bd0ecaf7d2" />

I'm not sure that there's any desire to deprecate/remove the LD Calculator at this point, but at least this provides some real-world numbers on the relative performance between these two methods. I'd like to incorporate these patches into the codebase if at all possible.

cc @jeromekelleher @petrelharp @apragsdale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking the Two-Locus Framework and LD Calculator #3290

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parameter	value
r	1e-8
$N_e$	1e4
L	2e6

Benchmarking the Two-Locus Framework and LD Calculator #3290

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions