-RNA-Seq measures RNA expression levels in a biological sample using high-throughput cDNA sequencing, producing counts of the number of reads aligning to each gene. Noise in RNA-Seq read count data is commonly modelled as following a negative binomial distribution, where the variance is a quadratic function of the expression level. However many statistical, machine learning, and visualization methods work best when the noise in a data set has equal variance. Varistran is an R package that uses Anscombe's [-@Anscombe1948] variance stabilizing transformation for the negative binomial distribution to transform RNA-Seq count data, so that the noise has equal variance across all measured gene expression levels. The transformed data may be treated as log~2~ transformed gene expression levels, but with variability reduced at low read counts. Varistran also includes a function to open a Shiny report with simple diagnostic visualizations, including a plot to assess how effective the variance stabilization was, a biplot of samples and genes, and a heatmap. This allows defective samples, sample mislabling, and batch effects to be easily identified.
0 commit comments