Skip to content

Commit 3966deb

Browse files
Merge pull request #70 from ArndtLab/diagnostics
improve interface and diagnostics
2 parents 93c6304 + 7b8df27 commit 3966deb

File tree

13 files changed

+528
-311
lines changed

13 files changed

+528
-311
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name = "HetDister"
22
uuid = "50651ce3-0423-45d2-b99c-8ea4267d2717"
3-
version = "0.10.7"
3+
version = "0.11.0"
44
authors = ["Tommaso Stentella <stentell@molgen.mpg.de> and contributors"]
55

66
[deps]

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ makedocs(;
1515
pages=[
1616
"Home" => "index.md",
1717
"Tutorial" => "tutorial.md",
18+
"Diagnostics" => "diagnostics.md"
1819
],
1920
warnonly=[:missing_docs],
2021
)

docs/src/diagnostics.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Diagnostics
2+
3+
Let's say that we inferred some demographic models and we have
4+
obtained the most probable as explained in the [Tutorial](@ref)
5+
```julia
6+
results = demoinfer(segments, 1:8, mu, rho)
7+
best = compare_models(results.fits)
8+
```
9+
First we can print
10+
```julia
11+
best.converged
12+
```
13+
which indicates whether the maximum likelihood optimization
14+
converged. If that is not the case, we can inspect further
15+
```julia
16+
best.opt.optim_result.original
17+
```
18+
to get more details and decide whether this flag is correct.
19+
In case the non convergence is true, a more greedy search might
20+
be needed (see [`FitOptions`](@ref)) and larger number of iterations
21+
and/or optimization time allowed.
22+
23+
We can also compute z-score residuals to assess goodness of fit
24+
```
25+
wth = wth = yth .* diff(h.edges[1])
26+
resid = (h.weights .- wth) ./ sqrt.(wth)
27+
```
28+
Because the probabilistic model is Poisson, it might be that bins
29+
in the tail have a skewed distribution of residuals, but the others
30+
should closely follow a standard normal.
31+
We can also assess the correlation structure of neighboring residuals
32+
```julia
33+
ps = HetDister.residstructure(resid)
34+
```
35+
This function return a vector of right tail p values from t-tests for
36+
correlation of neighbouring residuals (see the function doc).

docs/src/tutorial.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Tutorial
1+
# Tutorial
22

33
To run the package, first install julia ([here](https://julialang.org/downloads/)).
44
To create a local environment with the package `cd` into your work directory and
@@ -18,7 +18,7 @@ You can now load the installed packages:
1818
using HetDister, HistogramBinnings, CSV, DataFrames, DataFramesMeta
1919
```
2020

21-
### Preparing input data
21+
## Preparing input data
2222
For example, suppose you have a `.vcf` file with called variants you want to analyze. Then, in the most basic case, you may compute distances between heterozygous SNPs as follows:
2323
```julia
2424
f = "/myproject/myfavouritespecies.vcf"
@@ -41,10 +41,26 @@ ils = df.POS[2:end] .- df.POS[1:end-1]
4141
```
4242
Now we have a vector of intervals `ils`.
4343

44-
### Fitting demographic models
44+
## Fitting and comparing demographic models
4545
The tool require three inputs: a vector of IBS segments lengths, a mutation rate and
46-
a recombination rate (both per bp per generation). First, IBSs need to be placed into
47-
a histogram.
46+
a recombination rate (both per bp per generation). Additionally, we need to choose
47+
a range of demographic models with epochs of piecewise constant effective size.
48+
49+
In the simplest use case we can just call
50+
```julia
51+
mu = 1e-8
52+
rho = 1e-8
53+
results = demoinfer(ils, 1:8, mu, rho)
54+
```
55+
and the 8 models will be saved in `results.fits`.
56+
Then we can obtain the most probable model in the set with
57+
```julia
58+
best = compare_models(results.fits)
59+
```
60+
61+
### More advanced options
62+
First IBS spectrum is obtained as an histogram and the binning can be
63+
controlled by the function `adapt_histogram` and it keyword arguments
4864
```julia
4965
h = adapt_histogram(ils)
5066
mu = 1e-8
@@ -53,10 +69,12 @@ rho = 1e-8
5369
Then we set up the `FitOptions` object that contains several parameters for the optimization.
5470
We stick with default values and only initialize with the three required inputs:
5571
```julia
56-
fop = FitOptions(sum(ils), mu, rho)
72+
fop = FitOptions(sum(ils), length(ils), mu, rho)
5773
```
5874
And we fit 8 different model, with a number of epochs in the range 1 to 8:
5975
```julia
6076
results = demoinfer(h_obs, 1:8, fop)
6177
```
62-
The fitted models can be accessed with `results.fits`.
78+
The fitted models can be accessed with `results.fits`.
79+
80+
See [Diagnostics](@ref) to inspect the result and assess goodness of fit and optimization convergence.

src/HetDister.jl

Lines changed: 52 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,16 @@ include("mle_optimization.jl")
2020
include("sequential_fit.jl")
2121
include("corrections.jl")
2222

23-
export pre_fit, pre_fit!, demoinfer, compare_models, correctestimate!,
24-
get_para, evd, sds, pop_sizes, durations,
23+
export pre_fit!, demoinfer, compare_models, sample_model_epochs!,
24+
correctestimate!,
25+
get_para, evd, sds, pop_sizes, durations, get_covar,
2526
compute_residuals,
2627
adapt_histogram,
2728
FitResult, FitOptions,
28-
laplacekingman, mldsmcp
29+
laplacekingman, mldsmcp,
30+
extbps,
31+
lineages, cumulative_lineages, crediblehistory,
32+
sampleN, quantilesN
2933

3034

3135
function integral_ws(edges::AbstractVector{<:Real}, mu::Float64, TN::Vector)
@@ -85,6 +89,25 @@ function compute_residuals(h1::Histogram, h2::Histogram; fc1 = 1.0, fc2 = 1.0)
8589
return residuals
8690
end
8791

92+
"""
93+
residstructure(residuals::AbstractVector{<:Real}; frame::Int = length(residuals)÷20)
94+
95+
Compute the p-values for the correlation between adjacent residuals in a sliding window of size `frame`.
96+
The p-value is the right tail of the t-distribution.
97+
"""
98+
function residstructure(residuals::AbstractVector{<:Real};
99+
frame::Int = length(residuals)÷20
100+
)
101+
ps = ones(length(residuals)-frame)
102+
for i in eachindex(ps)
103+
c = cor(residuals[i+1:i+frame],residuals[i:i+frame-1])
104+
t = c * sqrt((frame - 2)/(1-c^2))
105+
p = StatsAPI.pvalue(Distributions.TDist(frame - 2), t; tail=:right)
106+
ps[i] = p
107+
end
108+
ps
109+
end
110+
88111
function CustomEdgeVector(; lo = 1, hi = 10, nbins::Integer)
89112
@assert (lo > 0) && (hi > 0) && (nbins > 0) && (hi > lo)
90113
lo = floor(Int, lo)
@@ -103,14 +126,23 @@ function CustomEdgeVector(; lo = 1, hi = 10, nbins::Integer)
103126
end
104127

105128
"""
106-
adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=800, tailthr::Int=1)
129+
adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=0, tailthr::Int=0)
107130
108131
Build an histogram from `segments` logbinned between `lo` and `hi`
109-
with `nbins` bins.
132+
with `nbins` bins. `nbins` is automatically determined by default.
110133
111-
The upper limit is adapted to ensure logspacing with the requested `nbins`.
134+
The upper limit is adapted to ensure logspacing with the requested `nbins`. The adaptive strategy is such that the
135+
last bin has at least `tailthr` segments.
112136
"""
113-
function adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=800, tailthr::Int=1)
137+
function adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=0, tailthr::Int=0)
138+
if iszero(nbins)
139+
if length(segments) > 1e7
140+
nbins = 1600
141+
else
142+
nbins = 800
143+
end
144+
end
145+
@assert nbins > 0
114146
h_obs = Histogram(CustomEdgeVector(;lo, hi, nbins))
115147
@assert !isempty(segments)
116148
append!(h_obs, segments)
@@ -149,10 +181,18 @@ function compare_mlds(segs1::AbstractVector{<:Integer}, segs2::AbstractVector{<:
149181
append!(h2, segs2)
150182
return compare_mlds!(h1, h2, theta1, theta2)
151183
end
152-
function compare_mlds!(h1, h2, theta1, theta2) # add !
184+
185+
"""
186+
compare_mlds!(h1::Histogram, h2::Histogram, theta1::Float64, theta2::Float64)
187+
188+
The same as `compare_mlds`, except that it takes two histograms and mutates them.
189+
Return values are the same as `compare_mlds`.
190+
"""
191+
function compare_mlds!(h1, h2, theta1, theta2)
153192
# 1 is the target lattice, i.e. with biggest theta
154193
length(h1.weights) == length(h2.weights) && @assert any(h1.weights .!= h2.weights)
155194
@assert theta1 != theta2
195+
@assert any(h1.weights .> 0) && any(h2.weights .> 0)
156196
swap = false
157197
if theta1 < theta2
158198
temp = deepcopy(h1)
@@ -166,9 +206,9 @@ function compare_mlds!(h1, h2, theta1, theta2) # add !
166206
tw = zeros(Float64, length(h1.weights))
167207
w2 = h2.weights
168208
factor = sum(h1.weights) / sum(h2.weights)
169-
t = 1
170-
f = 1
171-
while t < length(edges1) && f < length(edges2) && h1.weights[t] > 0 && w2[f] > 0
209+
t = 1 # target
210+
f = 1 # following
211+
while t < length(edges1) && f < length(edges2)
172212
st, en = edges1[t], edges1[t+1]
173213
width = edges2[f+1] - edges2[f]
174214
if st <= edges2[f] < edges2[f+1] < en
@@ -200,7 +240,7 @@ function compare_mlds!(h1, h2, theta1, theta2) # add !
200240

201241
rs = midpoints(h1.edges[1]) * theta1
202242
sigmasq = h1.weights .+ tw * factor^2
203-
maxl = min(t,f,length(h1.weights),length(tw))
243+
maxl = min(findlast(w2 .> 0), findlast(h1.weights .> 0))
204244
if swap
205245
return rs[1:maxl], tw[1:maxl] * factor, h1.weights[1:maxl], sigmasq[1:maxl]
206246
else

0 commit comments

Comments
 (0)