Skip to content

Commit 9ed60b4

Browse files
authored
Merge pull request #1 from CSB5/master
Update accordingly
2 parents 636563a + c2bc8ed commit 9ed60b4

37 files changed

+3605
-9637
lines changed

README.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
# BEEM
2-
BEEM is an approach to infer models for microbial community dynamics based on metagenomic sequencing data (16S or shotgun-metagenomics). It is based on the commonly used [generalized Lotka-Volterra modelling](https://en.wikipedia.org/wiki/Generalized_Lotka–Volterra_equation) (gLVM) framework. BEEM uses an iterative EM-like algorithm to simultaneously infer scaling factors (microbial biomass) and model parameters (microbial growth rate and interaction terms) and can thus work directly with the relative abundance values that are obtained with metagenomic sequencing. A preprint describing this work will be posted on bioRxiv soon.
32

4-
Note: BEEM stands for **B**iomass **E**stimation and model inference with an **E**xpectation **M**aximization-like algorithm.
3+
<img src="logo.png" height="200" align="right" />
4+
5+
BEEM is an approach to infer models for microbial community dynamics based on metagenomic sequencing data (16S or shotgun-metagenomics). It is based on the commonly used [generalized Lotka-Volterra modelling](https://en.wikipedia.org/wiki/Generalized_Lotka–Volterra_equation) (gLVM) framework. BEEM uses an iterative EM-like algorithm to simultaneously infer scaling factors (microbial biomass) and model parameters (microbial growth rate and interaction terms) from **longitudinal** data and can thus work directly with the relative abundance values that are obtained with metagenomic sequencing.
6+
7+
Note: BEEM stands for **B**iomass **E**stimation and model inference with an **E**xpectation **M**aximization-like algorithm. We have now extended the BEEM framework to be able to work with cross-sectional data (BEEM-static, check out our R package [here](https://github.com/CSB5/BEEM-static)).
58

69
## Dependencies
710

@@ -12,10 +15,19 @@ BEEM was written in R (>3.3.1) and requires the following packages:
1215
- pspline
1316
- monomvn
1417

15-
BEEM scripts can be loaded with the following command in R:
18+
The BEEM functions can be loaded in R directly with the following commands:
19+
20+
```r
21+
beem = RCurl::getURL("https://raw.githubusercontent.com/CSB5/BEEM/master/emFunctions.r", ssl.verifypeer = FALSE)
22+
eval(parse(text = beem))
23+
```
24+
25+
Alternatively the repository together with the example data can be cloned/downloaded. The functions are then loaded with the following commands in R:
26+
1627
```r
17-
source('path/to/this/repo/emFunctions.r')
28+
source('local/path/to/beem/emFunctions.r')
1829
```
30+
1931
## Input data
2032

2133
The input files for BEEM should have the same format as described in the manual for [MDSINE](https://bitbucket.org/MDSINE/mdsine/). The following two files are required by BEEM:
@@ -41,21 +53,21 @@ We have provided several sample input files that were also analyzed in our manus
4153

4254
#### Data from [Props et. al. (2016)](https://www.nature.com/articles/ismej2016117)
4355

44-
- OTU count `table: isme_analysis/counts.sel.txt`
45-
- Metadata: `isme_analysis/metadata.sel.txt`
56+
- OTU count `table: props_et_al_analysis/counts.sel.txt`
57+
- Metadata: `props_et_al_analysis/metadata.sel.txt`
4658

4759
#### Data from [Gibbons et. al. (2017)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005364)
4860

49-
- OTU count table: `time_series_analysis/{DA,DB,M3,F4}.counts.txt`
50-
- Metadata: `time_series_analysis/{DA,DB,M3,F4}.metadata.txt`
61+
- OTU count table: `gibbons_et_al_analysis/{DA,DB,M3,F4}.counts.txt`
62+
- Metadata: `gibbons_et_al_analysis/{DA,DB,M3,F4}.metadata.txt`
5163

5264
## Usage
5365

5466
### Basic Usage (R commands)
5567

5668
```r
5769
## Load functions
58-
source('path/to/this/repo/emFunctions.r')
70+
source('emFunctions.r')
5971
## Read inputs
6072
counts <- read.table('counts.txt', head=F, row.names=1)
6173
metadata <- read.table('metadata.txt', head=T)
@@ -79,10 +91,10 @@ BEEM estimated parameters is an R `data.frame` (a table) with the following colu
7991

8092
### Analyses in the manuscript
8193

82-
The commands for reproducing the analysis reportd in the manuscript are presented as two jupyter notebooks: (1) [notebook for Props et. al.](https://github.com/CSB5/BEEM/blob/master/isme.ipynb) and (2) [notebook for Gibbons et. al.](https://github.com/CSB5/BEEM/blob/master/time_series_meta.ipynb).
94+
The commands for reproducing the analysis reportd in the manuscript are presented as two jupyter notebooks: (1) [notebook for Props et. al.](https://github.com/CSB5/BEEM/blob/master/props_et_al.ipynb) and (2) [notebook for Gibbons et. al.](https://github.com/CSB5/BEEM/blob/master/gibbons_et_al.ipynb).
8395

8496
## Citation
85-
C Li, L Tucker-Kellogg & N Nagarajan. (2018). System Biology Modeling with Compositional Microbiome Data Reveals Personalized Gut Microbial Dynamics and Keystone Species. [*BioRxiv*](https://www.biorxiv.org/content/early/2018/03/27/288803).
97+
C Li, L Tucker-Kellogg & N Nagarajan. (2018). An expectation-maximization-like algorithm enables accurate ecological modeling using longitudinal metagenome sequencing data [*BioRxiv*](https://www.biorxiv.org/content/early/2018/07/17/288803).
8698

8799
## Contact
88100
Please direct any questions or feedback to Chenhao Li (lich@gis.a-star.edu.sg) and Niranjan Nagarajan (nagarajann@gis.a-star.edu.sg).

emFunctions.r

Lines changed: 51 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -536,7 +536,7 @@ NORM <- function(tss, gradients, perturbInd, metadata, rmSp, params, ncpu=10, sc
536536
list(w.tmp, mse)
537537
}
538538

539-
mse.mat <- (matrix(unlist((w.list[[2]])), p-1)) ##* apply(abs(Xs),2,function(x)x/sum(x))
539+
mse.mat <- (matrix(unlist((w.list[[2]])), nrow(tss)-1)) ##* apply(abs(Xs),2,function(x)x/sum(x))
540540
w <- unlist(w.list[[1]])
541541

542542
mse <- mean(c(mse.mat), na.rm = TRUE)
@@ -594,8 +594,10 @@ suggestRefs <- function(dat, meta, scaling=1){
594594
cv <- apply(cv,2,mean)
595595
}
596596
message("The following species is recommended as the reference:")
597-
message(names(sort(cv[!(fil1 | fil2 | fil3)])[1]))
598-
data.frame(cv=cv, index=1:length(cv), hasZero=fil1, isTooHigh=fil3, isTooLow=fil2)[order(cv),]
597+
sel <- names(sort(cv[!(fil1 | fil2 | fil3)])[1])
598+
message(sel)
599+
list(table=data.frame(cv=cv, index=1:length(cv), hasZero=fil1, isTooHigh=fil3, isTooLow=fil2)[order(cv),],
600+
selected=which(sps==sel))
599601
}
600602

601603
#' Function to estimate biomass and parameters simultaneously
@@ -623,13 +625,13 @@ EM <- function(dat, meta, forceBreak=NULL, useSpline=TRUE,
623625
}
624626
refRank <- suggestRefs(dat, meta)
625627
if(is.null(refSp)){
626-
message("No input for reference species, selecting one with the lowest coefficient of variation...")
627-
refSp <- refRank$index[1]
628+
message("BEEM selecting reference species as default...")
629+
refSp <- refRank$selected
628630
}
629631
if(sum(dat[refSp,]==0)>0){
630632
message("[!]: The reference species has zero abundance in some samples. This will treated as non-zeros by adding a pseudo count.")
631633
}
632-
if(refRank[refSp, 1]>0.9) {
634+
if(refRank$table[rownames(dat)[refSp], 1]>0.9) {
633635
message("[!]: The reference species has high CV (>90%). Parameter estiamtes might be inaccurate (check the trace of weighted mse for convergence).")
634636
}
635637
message(paste0("Reference species: ", rownames(dat)[refSp]))
@@ -714,10 +716,10 @@ EM <- function(dat, meta, forceBreak=NULL, useSpline=TRUE,
714716
#' @param biomass biomass data following MDSINE's biomass data format
715717
#' @param forceBreak force to break the trajectory to handle pulsed perturbation (or species invasion) (default: NULL)
716718
#' @param dev deviation (dev * mad) from the median to be considered as outliers (default:Inf, no filtering)
717-
#' @param ncpu maximal number of CPUs used (default:10)
719+
#' @param ncpu maximal number of CPUs used (default:4)
718720
#' @param infer_flag run inference (default:TRUE)
719721
param.infer <- function(dat, metadata, biomass,
720-
forceBreak=NULL, dev=Inf, ncpu=10, infer_flag=TRUE){
722+
forceBreak=NULL, dev=Inf, ncpu=4, infer_flag=TRUE){
721723
registerDoMC(ncpu)
722724
log.transform <- function(x){
723725
tmp <- log(x)
@@ -756,6 +758,36 @@ param.infer <- function(dat, metadata, biomass,
756758
BLASSO(X=Xs, P=NULL, Ys=Ys, Fs=isOutlier, ncpu=ncpu, rmSp=0, vnames=rownames(dat))
757759
}
758760

761+
#' Diagnose the EM process
762+
#' @param beem.obj BEEM output list
763+
#' @param counts counts data following MDSINE's OTU table
764+
inspectEM <- function(beem.obj, counts){
765+
if(NROW(counts) <7){
766+
warning('You have less than 7 species. The estimation of parameters might be inaccurate.')
767+
}
768+
trace.mse.weighted <- beem.obj$trace.mse.weighted
769+
idx <- round(length(trace.mse.weighted)/25):length(trace.mse.weighted)
770+
if(min(trace.mse.weighted) > trace.mse.weighted[2]){
771+
warning('Optimization failed.') ## worse fit than CSS (first iteration)
772+
return(NA)
773+
}
774+
if( sum(trace.mse.weighted[idx] > 1e-5)/length(idx) > 0.3 ){
775+
warning('Poor fitting detected.') ## too many iterations with large MSE
776+
return(NA)
777+
}
778+
if( cor(trace.mse.weighted[idx], 1:length(idx), method='spearman') < -0.5){
779+
## dicreasing error
780+
return(0)
781+
}
782+
fit <- lm(trace.mse.weighted[idx]/mad(trace.mse.weighted[idx]) ~ idx)
783+
if(sqrt(median(resid(fit)^2)) > 0.5 && coef(fit)[2]>0 ) {
784+
warning('Poor convergence detected.') ## not smooth enough
785+
return(NA)
786+
}
787+
return(0)
788+
}
789+
790+
759791
#' Inferring biomass from BEEM results
760792
#' @param beem.obj BEEM output list
761793
biomassFromEM <- function(beem.obj){
@@ -770,13 +802,12 @@ biomassFromEM <- function(beem.obj){
770802
#' @param beem.obj BEEM output list
771803
#' @param counts counts data following MDSINE's OTU table
772804
#' @param metadata metadata following MDSINE's metadata format
805+
#' @param sparse use the sparse mode to estimate the parameters (default: TRUE)
773806
#' @param forceBreak force to break the trajectory to handle pulsed perturbation (or species invasion) (default: NULL)
774-
#' @param ncpu maximal number of CPUs used (default:10)
807+
#' @param ncpu maximal number of CPUs used (default:4)
775808
#' @param enforceLogistic re-estimate the self-interaction parameters (enforce to negative values)
776-
paramFromEM <- function(beem.obj, counts, metadata, forceBreak=NULL, ncpu=10, enforceLogistic=FALSE){
777-
if(NROW(counts) <7){
778-
warning('You have less than 7 species. The estimation of parameters might be inaccurate.')
779-
}
809+
paramFromEM <- function(beem.obj, counts, metadata, sparse=TRUE, forceBreak=NULL, ncpu=4, enforceLogistic=FALSE){
810+
inspectEM(beem.obj, counts)
780811
registerDoMC(ncpu)
781812
trace.mse <- beem.obj$trace.mse
782813
min.mse <- min(trace.mse)
@@ -791,6 +822,13 @@ paramFromEM <- function(beem.obj, counts, metadata, forceBreak=NULL, ncpu=10, en
791822
beem.biomass <- apply(beem.obj$trace.biomass[,em.idx],1,median)
792823
beem.param <- apply(beem.obj$trace.params[,em.idx],1,median)
793824
}
825+
826+
if(!sparse){
827+
message('Re-estimating parameters with the non-sparse mode...')
828+
return(param.infer(dat=counts, metadata=metadata, biomass=beem.biomass,
829+
forceBreak=forceBreak, ncpu=ncpu)$mdsine)
830+
}
831+
794832
p <- nrow(counts)
795833
## solve for interaction matrix
796834
beem.a <- beem.param[1:p]

0 commit comments

Comments
 (0)