Skip to content

Commit 7ec1ff3

Browse files
lkirkjeromekelleher
authored andcommitted
Implement a two-locus C API for count statistics.
We also fully implement two-site statistics, mirroring the functionality of the LdCalculator. There are a number of tasks that need to be done before this is production ready, but it is a start. We've opened a number of issues in github for improving the work that was done for this PR and for implementing the python API. The general approach for the two-site statistics is to gather all of the samples below each mutation over the whole tree sequence (we will be implementing windows in the near future). To do this, we iterate over each tree in the tree sequence once and perform an in-order traversal starting at each mutation's node, storing the samples we encounter along the way. We then perform another step that examines the parentage of each mutation and removes samples that exist in child mutations. With the samples for each allele, we perform intersections to get haplotype weights, which are then passed to summary functions that fill out a stats matrix. We also implement a few happy-path tests that test the functionality of sample sets and multi-allelic sites. We also cover all of the non memory-related error cases in the C api. We plan to do more in-depth testing once the python API is implemented. This PR also introduces the concept of a "bit array" in the form of tsk_bit_array_t, which needs some further refinement. In essence, this data structure stores all of the possible samples in a bit set where one sample takes up one bit of information and storage space. This enables a few functions that are critical to this code: 1) it's a relatively memory efficient method of storing sample identities 2) we can efficiently intersect the arrays (I believe much of this is autovectorized by gcc) 3) we can use tricks like SWAR (SIMD within an array) to count the number of samples in a set. It's important to note that this is also where the algorithm spends ~50% of it's time. It can use more improvement in both the API design and the performance characteristics.
1 parent d57ec83 commit 7ec1ff3

File tree

6 files changed

+1473
-1
lines changed

6 files changed

+1473
-1
lines changed

c/tests/test_core.c

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -526,6 +526,84 @@ test_avl_random(void)
526526
validate_avl(sizeof(keys) / sizeof(*keys), keys);
527527
}
528528

529+
static void
530+
test_bit_arrays(void)
531+
{
532+
// NB: This test is only valid for the 32 bit implementation of bit arrays. If we
533+
// were to change the chunk size of a bit array, we'd need to update these tests
534+
tsk_bit_array_t arr;
535+
tsk_bit_array_init(&arr, 90, 1);
536+
for (tsk_bit_array_value_t i = 0; i < 20; i++) {
537+
tsk_bit_array_add_bit(&arr, i);
538+
}
539+
tsk_bit_array_add_bit(&arr, 63);
540+
tsk_bit_array_add_bit(&arr, 65);
541+
542+
// these assertions are only valid for 32-bit values
543+
CU_ASSERT_EQUAL_FATAL(arr.data[0], 1048575);
544+
CU_ASSERT_EQUAL_FATAL(arr.data[1], 2147483648);
545+
CU_ASSERT_EQUAL_FATAL(arr.data[2], 2);
546+
547+
// verify our assumptions about bit array counting
548+
CU_ASSERT_EQUAL_FATAL(tsk_bit_array_count(&arr), 22);
549+
550+
tsk_bit_array_free(&arr);
551+
552+
// create a length-2 array with 64 bit capacity
553+
tsk_bit_array_init(&arr, 64, 2);
554+
tsk_bit_array_t arr_row1, arr_row2;
555+
556+
// select the first and second row
557+
tsk_bit_array_get_row(&arr, 0, &arr_row1);
558+
tsk_bit_array_get_row(&arr, 1, &arr_row2);
559+
560+
// fill the first 50 bits of the first row
561+
for (tsk_bit_array_value_t i = 0; i < 50; i++) {
562+
tsk_bit_array_add_bit(&arr_row1, i);
563+
}
564+
// fill bits 20-40 of the second row
565+
for (tsk_bit_array_value_t i = 20; i < 40; i++) {
566+
tsk_bit_array_add_bit(&arr_row2, i);
567+
}
568+
569+
// verify our assumptions about row selection
570+
CU_ASSERT_EQUAL_FATAL(arr.data[0], 4294967295);
571+
CU_ASSERT_EQUAL_FATAL(arr.data[1], 262143);
572+
CU_ASSERT_EQUAL_FATAL(arr_row1.data[0], 4294967295);
573+
CU_ASSERT_EQUAL_FATAL(arr_row1.data[1], 262143);
574+
575+
CU_ASSERT_EQUAL_FATAL(arr.data[2], 4293918720);
576+
CU_ASSERT_EQUAL_FATAL(arr.data[3], 255);
577+
CU_ASSERT_EQUAL_FATAL(arr_row2.data[0], 4293918720);
578+
CU_ASSERT_EQUAL_FATAL(arr_row2.data[1], 255);
579+
580+
// subtract the second from the first row, store in first
581+
tsk_bit_array_subtract(&arr_row1, &arr_row2);
582+
583+
// verify our assumptions about subtraction
584+
CU_ASSERT_EQUAL_FATAL(arr_row1.data[0], 1048575);
585+
CU_ASSERT_EQUAL_FATAL(arr_row1.data[1], 261888);
586+
587+
tsk_bit_array_t int_result;
588+
tsk_bit_array_init(&int_result, 64, 1);
589+
590+
// their intersection should be zero
591+
tsk_bit_array_intersect(&arr_row1, &arr_row2, &int_result);
592+
CU_ASSERT_EQUAL_FATAL(int_result.data[0], 0);
593+
CU_ASSERT_EQUAL_FATAL(int_result.data[1], 0);
594+
595+
// now, add them back together, storing back in a
596+
tsk_bit_array_add(&arr_row1, &arr_row2);
597+
598+
// now, their intersection should be the subtracted chunk (20-40)
599+
tsk_bit_array_intersect(&arr_row1, &arr_row2, &int_result);
600+
CU_ASSERT_EQUAL_FATAL(int_result.data[0], 4293918720);
601+
CU_ASSERT_EQUAL_FATAL(int_result.data[1], 255);
602+
603+
tsk_bit_array_free(&int_result);
604+
tsk_bit_array_free(&arr);
605+
}
606+
529607
static void
530608
test_meson_version(void)
531609
{
@@ -554,6 +632,7 @@ main(int argc, char **argv)
554632
{ "test_avl_sequential", test_avl_sequential },
555633
{ "test_avl_interleaved", test_avl_interleaved },
556634
{ "test_avl_random", test_avl_random },
635+
{ "test_bit_arrays", test_bit_arrays },
557636
{ "test_meson_version", test_meson_version },
558637
{ NULL, NULL },
559638
};

0 commit comments

Comments
 (0)