-
Notifications
You must be signed in to change notification settings - Fork 6
Site statistics
When accumulate identifies a putative mutation it produces a set of summary statistics that might assist in identifying false positives. A brief description of these statistics is given in page describing the program's output. This page gives a more detailed description of these statistics and a guide to their interpretation.
The first five statistical columns are simply counts of a specific type of sequencing read at a given site.
-
Depth (Column 11).
- This is simply the total sequencing depth at a given site after the mapping and base-quality filters have been applied. Sites with unusually high sequencing coverage may represent repeat elements or gene duplications that are not included in the reference genome.
-
Number of FWD (Column 12) and REV (Column 13) orientation reads supporting a mutation.
- These columns represent the number of sequencing reads in the putatively mutant sample that support the putative mutation (i.e. the most likely allele of the mutant). The counts are separated into those mapping to the forward strand of the reference genome and those mapping to the reverse strand. A preponderance of one or other class of read may suggest an abnormality up- or downstream of the putative mutation is causing mapping errors that produce a false positive mutation.
-
N. ancestral in mutant (Column 14)
- This is the number of reads coming from the mutant sample and matching the most probable ancestral allele. You would expect very few such reads in true mutations from an experiment with haploid descendant lines. On the other hand, diploid heterozygous mutants arising from a homozygous diploid ancestor should contain approximately equal numbers of ancestral and non-ancestral alleles.
-
N. mutant in WT (Column 15)
- This is the number of reads coming from a WT sample (ancestor and all non-mutant lines) that contain a putatively mutant allele. The presence of putatively mutant alleles in multiple samples suggest mismapping in this region is producing a false positive mutation.
The final four columns are the result of one of two statistical tests. More details about these tests are provided below.
-
Mapping quality difference (Column 16).
- An Anderson-Darling test statistic comparing the distribution of mapping quality scores in all reads that contain an apparently-mutant site with the mapping quality scores of reads that contain an ancestral read. A large score for this statistic might suggest an apparent mutation is caused by mismapped reads.
-
Insert size difference (Column 17).
- If paired-end sequencing is used in an experiment then the value for this column is an Anderson-Darling test statistic for a difference in inferred insert size (i.e. the genomic distance between FWD and REV reads from the same molecule) between all read-pairs containing an apparently mutant allele and those containing an ancestral allele. As with the MQ difference, a large score for this statistic might suggest an apparent mutation is caused by mismapped reads. In particular the presence of indels or copy number variants close to the apparent mutation could lead to large scores for this statistic.
-
Strand bias (Column 18).
- Fisher's exact test for an association between the strand to which reads supporting the mutant and ancestral alleles at a given site map. This is the same comparison as is used to describe the Fisher's test below.
-
Pair-mapping rate difference (Column 19).
- If paired-end sequencing is used in an experiment then the value for this column is an p-value of a Fisher's exact test for an association between the allele that a read supports (ancestral or mutant) and whether or not that read's pair was successfully mapped to the reference genome.
The Two-Sample Anderson-Darling test is a non-parametric test of the hypothesis that two samples could have be drawn from the same underlying distribution. Compared with similar tests, it has the advantage of not requiring a particular underlying distribution to be assumed. accuMUlate reports the test statistic produced by the Anderson Darling test. Though there is no closed form formula for converting these statistics into p-values it is possible to find statistics that correspond to critical values via simulation. The following table should give you a feel for to interpret the A-D statistic.
| critical value | 0.05 | 0.01 | 0.001 | 0.0001 |
|---|---|---|---|---|
| A-D test statistic | 1.964 | 3.784 | 6.497 | 9.308 |
The R package kSamples provides a function, ad.pval, which allows you obtain a p-value for A-D statistic. For the two-sample test used here you should set m, the degrees of freedom in the test, to 1. Scholz and Stephens (1987) contains a detailed description of the A-D test and its derivation:
Scholz, F. W and Stephens, M. A. (1987), Journal of the American Statistical Association, 82(918-924). doi: http://dx.doi.org/10.1080/01621459.1987.10478517
Fisher's exact test is used to test for association between two different categorical variables. The test produces a p-value, which gives the probability of the observed data arising if there was no association between the two categorical variables.
accuMUlate uses this test to compare reads supporting the mutant and ancestral alleles at a given site. For instance, consider a putatively mutant site at which most of the reads supporting an ancestral allele where mapped to the forward strand of the reference genome while reads containing the ancestral allele are evenly distributed among the forward and reverse strands.
| FWD | REV | |
|---|---|---|
| Mutant | 10 | 1 |
| Ancestral | 50 | 50 |
Using this data, Fisher's exact test gives a p-value of ~0.01. Thus, the excess of forward-strand reads among the mutant-supporting reads is statistically significant. On the other hand, this more balanced data produces a p-value of ~0.53
| FWD | REV | |
|---|---|---|
| Mutant | 4 | 7 |
| Ancestral | 50 | 50 |