Skip to content

Latest commit

 

History

History
245 lines (180 loc) · 13.6 KB

File metadata and controls

245 lines (180 loc) · 13.6 KB

Cufflinks Lung transcriptome analysis: Lung, HLO, Definitive endoderm, and ES cells

David R. Hill, PhD. Department of Internal Medicine Division of Gastroenterology. University of Michigan Medical Center 1150 W. Medical Center Dr. Ann Arbor MI 48109

hilldr@med.umich.edu

Methods

Sequencing of HLOs was performed by the UM DNA Sequencing Core, using Illumina Hi-Seq platform. All HLO sequences were deposited in the EMBL-EBI ArrayExpress database using Annotare 2.0 and are catlogued unter the accession number E-MTAB-3339. The UM Bioinformatics Core downloaded the reads files from the Sequencing Core storage, and concatenated those into a single .fastq file for each sample. The UM Bioinformatics Core also downloaded reads files from EBI-AE database (Adult lung Samples) and NCBI-GEO (SRA) database (Fetal lung samples). We evaluated the quality of the raw reads data for each sample using FastQC (version 0.10.1) to identify features of the data that may indicate quality problems (e.g. low quality scores, over-represented sequences, inappropriate GC content, etc.). Initial QC report indicated over-representation of Illumina adapter sequences in samples from EBI-AE data set and NCBI-GEO data set. We trimmed off the adapter sequences from the reads using Cutadapt (version 0.9.5). We used the software package Tuxedo Suite for alignment, differential expression analysis, and post-analysis diagnostics. Briefly, we aligned reads to the reference transcriptome (UCSC hg19) using TopHat (version 2.0.9) and Bowtie (version 2.1.0.0)\citep{Langmead:2009}. We used default parameter settings for alignment, with the exception of: “–b2-very-sensitive” telling the software to spend extra time searching for valid alignments, as well as “–no-coverage-search” and “–no-novel-juncs” to limit the read mapping to known transcripts. In addition, we used FastQC for a second round of quality control (post-alignment), to ensure that only high quality data would be input to expression quantitation and differential expression analysis. Sequence retrieval, concatenation, and quality control analysis were conducted using 64-bit Red Hat Enterprise Linux version 6.3 at the UM Advanced Research Computing center.

We used Cufflinks/CuffNorm (version 2.2.1) for expression quantitation and differential expression analysis \citep{Trapnell:2012}, using UCSC hg19.fa as the reference genome sequence and UCSC hg19.gtf as the reference transcriptome annotation. For this analysis, we used parameter settings: “–multi-read-correct” to adjust expression calculations for reads that map in more than one locus, as well as “–compatible-hits-norm” and “–upper-quartile –norm” for normalization of expression values. Normalized FPKM tables were generated using the CuffNorm function found in Cufflinks. Transcriptional quantitation analysis in Cufflinks was conducted using the 64-bit Debian Linux stable version 7.8 (“Wheezy”) platform.

The complete FPKM matrix, containing frequency counts for all 24,010 genes contained in the reference genome for all 23 RNAseq samples, was evaluated using unscaled principle component analysis (PCA) to visualize and quantify multi-dimensional variation between samples. Of the 24,010 genes annotated in the reference genome, 2,815 (11.7%) were not detected in the RNAseq analysis of any of the 23 samples. Principle components were calculated using the function ‘prcomp’ found in the R (version 3.1.2) statistical programming language and plotted using the R package ‘ggplot2’.

Hierarchical cluster analysis based on the Canberra distance \citep{Lance:1966} between FPKM vectors was used to classify discrete RNAseq samples according to the degree of total transcriptional dissimilarity indicated by the normalized FPKM values. Bootstrap analysis was used to assesses the uncertainty in the assigned hierarchical clustering relationships. 10,000 bootstraping iterations were generated by repeatedly randomly sampling the FPKM dataset. The bootstrap probability (BP) of a cluster is defined as the frequency of a given relationship among the bootstrap replicates. Multiscale bootstrap resampling was used to calculate an approximately unbiased (AU) p-value for a given relationship, with AU > 95 indicating a high degree of statistical significance. Analyses were conducted using R package ‘pvclust’. Spearman correlation was applied as an additional assessment of the cumulative degree of correlation among RNAseq datasets. In addition, we computed Spearman rank correlation coefficients (rho) in a pairwise manner among all 23 RNAseq samples using the complete normalized FPKM data. The Spearman coefficients were plotted as a heatmap using the function ‘heatmap.2’ in the R package ‘gplots’.

Results

Figures

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
## Clear temporary memory
rm(list=ls())

library(ggplot2)
library(RColorBrewer)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Principle component analysis of FPKM matrix

## FMPK matrix input
data <- read.table("./DATA/normout021215/genes.fpkm_table",header=TRUE,sep="\t")

data1 <-data

attr.table <- read.table("./DATA/normout021215/genes.attr_table",header=TRUE,sep="\t")

data1$gene_short_name <- attr.table$gene_short_name

write.csv(data1, file="./DATA/base_dataset_cufflinksFPKM021215.csv")

#col.names <- c("gene_id","HIO_0","HIO_1","HIO_2","HLO_0","HLO_1","HLO_2","HLO_3","HLO_4","HLO_5","Lung.A_0","SI.Dist_0","Lung.A_1","SI.Dist_1","Lung.A_2","SI.Dist_2","Kidney.A_0","SI.Dist_3","Lung.A_3","Lung.A_4","SI.Duo_0","Kidney.A_1","SI.Duo_1","Kidney.A_2","Kidney.A_3","DefEnd_0","DefEnd_1","DefEnd_2","ES_0","ES_1","ES_2","SI.F_0","SI.F_1","SI.F_2","Kidney.F_0","Kidney.F_1","Kidney.F_2","SI.F_3","SI.F_4","Kidney.F_3","Kidney.F_4","Lung.F_0","Lung.F_1","Lung.F_2","Lung.F_3","Lung.F_4","Lung.F_5")

#colnames(data) <- col.names
num.data <- data[,sapply(data,is.numeric)]


HIO <- grep("HIO",colnames(num.data),ignore.case=T)
HLO <- grep("HLO",colnames(num.data),ignore.case=T)
aL <- grep("Lung.A",colnames(num.data),ignore.case=T)
aS <- grep("Stomach.A",colnames(num.data),ignore.case=T)
aC <- grep("Colon.A",colnames(num.data),ignore.case=T)
aSI <- grep("SI.D",colnames(num.data), ignore.case=T)
DE <- grep("DE",colnames(num.data), ignore.case=T)
ES <- grep("ES",colnames(num.data), ignore.case=T)
fL <- grep("Lung.F",colnames(num.data), ignore.case=T)
fS <- grep("Stomach.F",colnames(num.data), ignore.case=T)
fSI <-grep("SI.F",colnames(num.data),ignore.case=T)
st <- grep("Stomach", colnames(num.data),ignore.case=T)

group <- gsub('.{2}$', '', colnames(num.data))


pca.data <- num.data[apply(num.data, 1, sd, na.rm=TRUE) != 0,]
data.n <- data1[apply(data1, 1, sd, na.rm=TRUE) != 0,]
rownames(data.n) <- make.names(data.n$gene_short_name,unique=TRUE)
data.n <-data.n[,sapply(data.n,is.numeric)]
#pca.data <- num.data[,-st]



pca <- prcomp(t(num.data),scale=FALSE,center=TRUE,na.action=na.omit)

scores <- data.frame(colnames(pca.data), pca$x[,1:ncol(pca$x)],group)
scores$color <- c("#377EB8","#377EB8","#377EB8","#E41A1C","#E41A1C","#E41A1C","#984EA3","#984EA3","#984EA3","#984EA3","#984EA3","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#FF7F00","#FF7F00","#FF7F00","#FF7F00","#FF7F00","#FF7F00")
#scores <- subset(scores, scores$colnames != "SI.F_0")

library(ggplot2)
theme <- theme(legend.position="right",legend.title=element_blank(),legend.background = element_rect(fill="white", size=.5, linetype="dotted"),panel.grid.minor=element_blank(), panel.grid.major=element_blank())

pc1.2 <- qplot(x=PC1, y=PC2, data=scores) + theme + 
 # scale_fill_brewer(palette="Set3")+
  scale_fill_manual(values = colorRampPalette(brewer.pal(length(unique(scores$group)), "Set1"))(length(unique(scores$group)))) +
  geom_point(shape=21,aes(fill=factor(group)), size=8)  

pc1.3 <- qplot(x=PC1, y=PC3, data=scores) + theme +
 # scale_fill_brewer(palette="Set3")+
  scale_fill_manual(values = colorRampPalette(brewer.pal(length(unique(scores$group)), "Set1"))(length(unique(scores$group)))) +
  geom_point(shape=21,aes(fill=factor(group)), size=8) 

pc2.3 <- qplot(x=PC2, y=PC3, data=scores) + theme +
  #scale_fill_brewer(palette="Set3")+
  scale_fill_manual(values = colorRampPalette(brewer.pal(length(unique(scores$group)), "Set1"))(length(unique(scores$group)))) +
  geom_point(shape=21,aes(fill=factor(group)), size=8) 

multiplot(pc1.2,pc1.3,pc2.3,cols=3)

./DATA/normout021215/DATA/pcaplot.jpg

print(pc1.2)

./DATA/normout021215/DATA/pcaplot1a.jpg

3D scatterplot of unscaled PCA

library(scatterplot3d)
scatterplot3d(scores$PC1,scores$PC2,scores$PC3,bg=scores$color,pch=21,grid=TRUE,angle=45,axis=TRUE,cex.symbols=3,xlab="PC1",ylab="PC2",zlab="PC3",lty.axis=1,lty.hide=2,lty.hplot=2,label.tick.marks=FALSE)

./DATA/normout021215/DATA/3Dpcaplot1.jpg

Hierarchical cluster analysis of FPKM matrix

# outsourced to debian server as pvclust1.r
library(pvclust)
result <- pvclust(pca.data,method.dist="canberra",method.hclust="complete",nboot=10000)
save(result,file="./DATA/normout021215/DATA/result")
load(file="./DATA/normout021215/DATA/result")
print(plot(result,hang=-1,float=0.008,cex.pv=0.3,font.pv=4,print.num=FALSE,print.pv=TRUE,col.pv=c(2,3,8),lwd=1,cex=0.6,main="",col="grey30",ylab="Distance"))

./DATA/normout021215/DATA/pvclust1.jpg

Gene expression correlation matrix

cor1 <- cor(pca.data,method="spearman")

## Load libraries
library(matrixStats)
library(gplots)
library(RColorBrewer)

# creates a color palette from blue to white to red in "n" increments
my_palette <- colorRampPalette(c("white","white","blue"))(n = 299)

# defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(0,0.5,length=100),  #blue
               seq(0.5,0.9,length=100),
               seq(0.9,1,length=100))   #red    

cor.mat <- as.matrix(cor1)

result <- heatmap.2(cor.mat,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Correlation",
                    key.title="",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(8,8),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="both",    # do not draw dendrogram
                    Colv=T,
                    Rowv=T,
                    srtCol=45,
                    cexRow= 0.5,
                    cexCol = 0.5,
                    keysize=0.25,
                    labRow = rownames(cor.mat),
                    labCol = colnames(cor.mat),
                    #rowsep=(order(GOterm$V3)[!duplicated(sort(GOterm$V3))]-1),
                    lwid = c(1.2,5),
                    lhei = c(1.2,5)
                    )          

print(result)

./DATA/normout021215/DATA/corrmatrix1.jpg