petrelharp/euroibd
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
This is the collection of codes used in analysis of european POPRES data (see Nelson et al, http://www.ncbi.nlm.nih.gov/pubmed/18760391; obtain access to data at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v1.p1 .) for the paper "The geography of recent genetic ancestry across Europe", by Peter Ralph and Graham Coop, published May 7, 2013, in PLoS Biology, at http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001555 To see what makes what, see the makefile. Most of the files are pretty clean, but a few have some clutter. None will actually do anything without obtaining the POPRES data and running BEAGLE http://faculty.washington.edu/browning/beagle/beagle.html on them. Please use these if they are useful, and credit us if appropriate. -------------- If you are looking to use the inference method, look for the functions sinv() and squoosh(). Also, linv() gives the least-squares approximation, useful for finding a good starting point. For instance, see the block in inversion-writeup-plots.R that sets things up to infer histories for all the pairs of populations: # Set-up: the operator L and the false positive rate estimates: gens <- cumsum( rep(1:36, each=10) ) L <- error.disc.trans( lenbins=lenbins, gens=gens, chrlens=.chrlens[setdiff(seq_along(.chrlens),omitchroms)] )) # this takes a while!! do it once and save. fp <- disc.fp(lenbins, chrlens=.chrlens) fp <- runmed(fp,k=15) # Set-up for intial least-squares estimate: S <- as.vector( smooth( hist( blocks$maplen, breaks=c(lenbins,100), plot=FALSE )$counts ) ) S <- S/mean(S) S[lenbins>20] <- exp( predict( lm( log(S) ~ lenbins + I(lenbins^2), subset=(lenbins>20) & (S>0) ), newdata=list(lenbins=lenbins[lenbins>20]) ) ) est.Sigma <- function ( np, X, alpha=100 ) { # estimate Sigma by combining the population average with the smoothed, observed data # where Sigma is the covariance matrix of X/np smX <- as.vector( smooth(X) ) ppp <- (alpha*np/(2543640+alpha*np)) # 2543640 = total # pairs Sigma <- ppp*smX + (1-ppp)*sum(smX)*S/sum(S) return( Sigma/np^2 ) } Ksvd <- svd( 1/sqrt(as.vector(S)) * L ) # pre-compute SVD of L total.npairs <- choose( length(unique(blocks$id1)), 2 ) linverse <- linv( X=lendist/total.npairs, S=as.vector(S), Sigma=est.Sigma(np=total.npairs,X=lendist), Ksvd=Ksvd, maxk=30, fp=fp ) sminverse <- smoothed.nonneg( lS=linverse, Ksvd=Ksvd, bestk=15, lam=1 ) # Here is where we get the actual estimate: ans <- sinv( lendist, sminverse, L, fp, npairs=total.npairs, lam=0, gamma=1e6 ) # squoosh iterates sinv, looking for the smoothest solution that drops no more than 2 units of log-likelihood smooth = squoosh( lendist, xinit=ans, L, fp, npairs=ans$npairs, minlen=minlen, gamma=100, lam=0, fitfn="loglik", relthresh=2/ans$npairs, twid=.05 ) -------------- Notes: * The .gmap files provide a mapping from the endpoint indices that BEAGLE uses to genetic distances (e.g. in centiMorgans). Important: BEAGLE uses 0-based indices. * BEAGLE 4.0 is out, and improved over the version we used (3.3).