Skip to content

Commit 097b928

Browse files
arthur gillyclaude
andcommitted
Fix dichotomization method, add tiered precision, allele harmonization in first pass
- Add --use-beta-sign flag implementing the Southam et al. 2017 dichotomization method (sign of beta instead of p-value transform) for correlation estimation. The old method severely underestimates inter-study correlation under the null. - Add tiered arbitrary precision: long double -> 200 -> 500 -> 1000 digits, with safe fallbacks for p-values too extreme for any tier. - Add allele harmonization in first pass so that --use-beta-sign works correctly when effect alleles differ between input files. - Fix missing return in initialise() (undefined behavior, caused crashes). - Update README with new build instructions and --use-beta-sign option. - Add test_data/ with GIANT Height analysis notebook, null simulation script, and generate.sh to reproduce all data files from scratch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ba474d5 commit 097b928

File tree

8 files changed

+1021
-61
lines changed

8 files changed

+1021
-61
lines changed

CHANGELOG.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Changelog
2+
3+
## v1.1.0
4+
5+
### Bug fix: dichotomization method for correlation estimation
6+
7+
The original implementation used `b_transform(p) = Phi^{-1}(1-p) <= 0` to dichotomize
8+
p-values before computing tetrachoric correlation between studies. This ignores the
9+
direction of effect and severely underestimates the inter-study correlation,
10+
leading to undercorrection for sample overlap.
11+
12+
The paper (Southam et al. 2017) describes a different procedure:
13+
`z_k = Phi^{-1}(p_k/2) * sgn(beta_k)`, then dichotomize on the sign of the
14+
resulting z-score. This method correctly recovers the inter-study correlation
15+
under the null hypothesis.
16+
17+
**Null simulation results** (100,000 variants under the null):
18+
19+
| True rho | Old method rho-hat | New method rho-hat | Old lambda_GC | New lambda_GC |
20+
|----------|--------------------|--------------------|---------------|---------------|
21+
| 0.0 | -0.005 | 0.006 | 1.017 | 1.011 |
22+
| 0.1 | 0.008 | 0.097 | 1.090 | 1.042 |
23+
| 0.3 | 0.054 | 0.284 | 1.258 | 1.129 |
24+
| 0.5 | 0.159 | 0.475 | 1.375 | 1.181 |
25+
| 0.7 | 0.360 | 0.670 | 1.401 | 1.185 |
26+
27+
The new method is enabled with `--use-beta-sign` and will become the default in a
28+
future version. The old method remains the default for backwards compatibility.
29+
30+
Credit: Brady Ryan (University of Michigan) identified the discrepancy between the
31+
paper and the implementation.
32+
33+
### Bug fix: allele harmonization in first pass
34+
35+
The first pass (correlation matrix estimation) did not harmonize alleles between
36+
input files. When effect alleles were swapped between studies, the sign of beta
37+
was incorrect, producing wrong tetrachoric correlation estimates. The first pass
38+
now detects allele flips and negates beta accordingly, mirroring the allele
39+
harmonization already present in the second pass (meta-analysis).
40+
41+
### Bug fix: missing return in `initialise()`
42+
43+
The `initialise()` function (`int inline initialise(int, char**)`) had no `return`
44+
statement. This is undefined behavior in C++ and caused crashes (double-free) on
45+
some compilers with `-O3`.
46+
47+
### Improvement: tiered arbitrary precision arithmetic
48+
49+
The previous implementation had a single fallback from `long double` to
50+
`cpp_dec_float<200>`, which overflowed for p-values below ~1e-200.
51+
52+
All four transform functions (`b_transform`, `z_transform` x2, `z_transform_fess`)
53+
now use a 4-tier precision cascade:
54+
55+
```
56+
long double -> cpp_dec_float<200> -> cpp_dec_float<500> -> cpp_dec_float<1000> -> safe fallback
57+
```
58+
59+
The final fallback caps z-scores at +/-38 (equivalent to p ~ 1e-315), which is
60+
reached only for the most extreme p-values in GWAS (e.g. GPC3 height locus at
61+
p = 2.8e-1135).
62+
63+
### Build changes
64+
65+
- Makefile now uses system Boost (`/usr/include`, `lib/`) instead of hardcoded
66+
Sanger cluster paths. Original paths preserved as comments.
67+
- Removed `-static` flag (dynamic linking). For static builds, pass
68+
`CFLAGS="-O3 -std=c++11 -static"` to make.

CLAUDE.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
METACARPA (META-analysis in C++ Accounting for Relatedness, using arbitrary Precision Arithmetic) is a C++11 bioinformatics tool for meta-analysing genetic association studies with overlapping or related samples when the degree of overlap is unknown. It combines the p-value correction method of Province and Borecki (2013) with the effect-size method of Lin and Sullivan (2009).
8+
9+
## Build
10+
11+
The project is a single C++ source file compiled via Make. Boost 1.60.0+ is required (program_options, serialization, multiprecision, uBLAS, math).
12+
13+
```bash
14+
# Build (produces static Linux x64 binary at src/metacarpa)
15+
make -C src/
16+
17+
# The Makefile IDIR and LDIR variables must point to your Boost headers and libs
18+
```
19+
20+
Compiler flags: `-O3 -std=c++11 -static`. Links against: `boost_program_options`, `boost_serialization`, `pthread`.
21+
22+
## Architecture
23+
24+
The entire implementation lives in a single file: `src/metacarpa.cpp` (~1850 lines).
25+
26+
**Two-phase algorithm:**
27+
1. **Matrix calculation** — Reads all input files, identifies overlapping variant positions across studies, and computes a variance-covariance correlation matrix using tetrachoric correlation of binarized p-values. The matrix is serialized to disk (`.matrix.txt`) for reuse across runs.
28+
2. **Single-point meta-analysis** — Re-reads input files and performs per-variant meta-analysis using the precomputed correlation matrix, producing corrected effect sizes, Z-scores, and p-values.
29+
30+
**Key components in `metacarpa.cpp`:**
31+
- `studies_correlation` class — Stores and manages correlation matrices indexed by study "mask" (bitmask of which studies contribute to a variant). Handles Boost serialization (XML/text).
32+
- `count_data` struct (nested in `studies_correlation`) — 2x2 contingency tables for tetrachoric correlation.
33+
- `output_record` struct — Holds per-variant meta-analysis results.
34+
- `meta_analyse()` — Core computation: inverts correlation submatrix, computes corrected Z-scores and effect sizes.
35+
- `z_transform()` / `z_transform_fess()` — P-value to Z-score conversion with automatic fallback to `cpp_dec_float<200>` arbitrary precision for extremely small p-values (< ~1e-29).
36+
- `InvertMatrix()` — LU-decomposition matrix inversion via Boost uBLAS.
37+
- `tetrachoric()` — Estimates correlation between studies from binary-transformed p-value vectors.
38+
39+
**Data flow:** Parse CLI args → open sorted input files → first pass builds correlation matrix → serialize matrix → second pass performs per-variant meta-analysis → write tab-delimited output.
40+
41+
## Key Constraints
42+
43+
- Input files must be sorted by chromosome then position.
44+
- Alleles are case-sensitive (expects capitals); allele flipping is supported but strand flipping is not.
45+
- Sample sizes can be per-variant (via `--size-col`) or constant (appended to filename: `-I file,N`).
46+
- The correlation matrix can be precomputed with `-x` (stop after matrix) and reused with `-m` for different traits or chunked data.
47+
48+
## No Test Suite
49+
50+
There is no automated test framework. Validation is done externally via the [metacarpa-simulation](https://bitbucket.org/agilly/metacarpa-simulation) project.

README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,24 @@ METACARPA is compiled as a static Linux x64 executable. You can download it in t
1515
1616
### Compilation
1717

18-
The src directory contains a single source file and Makefile. The [Makefile](src/Makefile) is written for a compiler that supports C++ 11. Please note that you will also need a copy of the [Boost](http://www.boost.org) mathematical libraries on your machine (the binaries were compiled using Boost 1.60.0).
18+
The src directory contains a single source file and Makefile. A C++11 compiler and [Boost](http://www.boost.org) 1.60.0+ are required (headers + `program_options` and `serialization` libraries).
1919

20-
After compiling the Boost libraries, you will be provided with the `-I` and `-L` arguments to change in the Makefile.
20+
On Debian/Ubuntu:
21+
```bash
22+
sudo apt-get install libboost-program-options-dev libboost-serialization-dev
23+
```
24+
25+
Then build:
26+
```bash
27+
cd src && make
28+
```
29+
30+
The Makefile expects Boost headers in `/usr/include` and libraries in `lib/` (relative to `src/`). Adjust `IDIR` and `LDIR` in the Makefile if your Boost installation is elsewhere.
31+
32+
For a static binary (no runtime dependencies):
33+
```bash
34+
cd src && c++ -O3 -std=c++11 -static -I/usr/include -L/usr/lib/x86_64-linux-gnu -o metacarpa metacarpa.cpp -lboost_program_options -lboost_serialization -lpthread
35+
```
2136

2237
## Program Options
2338

@@ -54,6 +69,9 @@ Options description :
5469
-x [ --stop ] Stop METACARPA after generating the matrix.
5570
-d [ --debug ] Toggles an extremely verbose output, for debugging
5671
purposes only.
72+
--use-beta-sign Use sign of beta for dichotomization in correlation
73+
matrix calculation (Southam et al. 2017), instead of
74+
the default p-value transform. Recommended.
5775
```
5876

5977
## Data preparation

src/Makefile

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
#IDIR=/nfs/users/nfs_a/ag15/boost_1_55_0
2-
IDIR=/nfs/team144/software/boost_1.6/boost_1_60_0
2+
#IDIR=/nfs/team144/software/boost_1.6/boost_1_60_0
3+
IDIR=/usr/include
34
CC=c++
45
#LDIR=/nfs/users/nfs_a/ag15/boost_1_55_0/stage/lib
5-
LDIR=/nfs/team144/software/boost_1.6/boost_1_60_0/stage/lib
6-
CFLAGS=-O3 -std=c++11 -static -I$(IDIR) -L$(LDIR)
6+
#LDIR=/nfs/team144/software/boost_1.6/boost_1_60_0/stage/lib
7+
LDIR=lib
8+
CFLAGS=-O3 -std=c++11 -I$(IDIR) -L$(LDIR)
79
LIBS=-lboost_program_options -lboost_serialization -lpthread
810

911
metacapa: metacarpa.cpp

0 commit comments

Comments
 (0)