Releases: jsh58/Genrich
Version 0.6.1
Version 0.6
- New default peak-calling parameters:
-p 0.01 -a 200.0. This also means that, by default, q-values are not calculated. - In ATAC-seq mode, the ends of fragments are adjusted by +5/-5 to account for Tn5 transposase occupancy, when determining the centers of cut site intervals. This adjustment can be avoided with
-D. - The number of warning messages of reads extending off the ends of chromosomes has been limited to 128 (per input file).
- A convenience option (
-S) was added, which prevents Genrich from checking the sort order of the input SAM/BAM file. The input file is still parsed under the assumption that it is sorted by queryname, so this option should be used judiciously.
Version 0.5
-
The null statistical model was changed to the log-normal distribution. This was determined to be a more appropriate choice than other distributions (including the previously used exponential) from examining ChIP-seq control samples.
-
The minimum length (
-l) is now a regular peak-calling parameter, not an alternative peak-calling method. The default value is 0bp. Those who prefer peak-calling by minimum length and not AUC can run-a 0 -l <int>. -
Two convenience options were added:
-X: option to skip peak-calling, but perform all alignment parsing (including identifying PCR duplicates)-P: option to call peaks directly from a bedgraph-ish log file (-f) produced by a previous Genrich run
Version 0.4
-
Option to remove PCR duplicates (
-r). The procedure is explained here. Incorporates ideas fromfindDups9.py, in particular the process of evaluating multimapping reads/fragments.- A log file listing duplicates can be produced via
-R <file>.
- A log file listing duplicates can be produced via
-
A reasonably complete README was added to the homepage.
-
Input SAM/BAM files are now required to be queryname-sorted (
samtools sort -n). -
A bug in alignment parsing (paired secondary alignments with matching positions) was fixed.
Version 0.3
-
New default peak-calling method: area under the curve (AUC). For a peak to be called, the total significance of the region must exceed a minimum value (
-a <float>, default 20.0).- The total significance is calculated as the sum of the -log(q) values above the
-qthreshold over the length of the region (i.e. the area under the -log(q) "curve"). If a-pthreshold is specified, the area under the -log(p) curve is calculated. - The maximum gap parameter (
-g <int>) still allows multiple regions to be linked. - No minimum length is required for a peak to be called.
- Can be overridden by specifying
-l <int>, in which case peak-calling reverts to the previous method, with the given minimum length for peaks.
- The total significance is calculated as the sum of the -log(q) values above the
-
Option to provide a BED file of genomic regions to exclude from analysis (
-E <file>).- The regions will affect peak calls, such that no peak may extend into or around an excluded region.
- The regions' lengths will be subtracted from the genome length calculated by the program.
- In the output log files, excluded regions will have treatment/control pileup values of
0.0and p-/q-values ofNA. - Multiple BED files can be specified, comma-separated (or space-separated, in quotes).
-
Accessory script
findNs.pyto produce a BED file of 'N' homopolymers from a fasta file (e.g. a reference genome). The output can (and should) be given to Genrich via-E(above). -
Option to keep unpaired alignments, with lengths changed to a given value, has been changed to
-w <int>(formerly-a <int>).
Version 0.2
-
Multiple replicates are now analyzed separately, with p-values calculated for each. At each position, the multiple p-values are then combined by Fisher's method, before conversion to q-values.
-
Three optional output log files:
-f <file>: with one replicate, it lists treatment/control pileups, p- and q-values, and significance for each interval; with multiple replicates, it lists p-values of each replicate, combined p-value, q-value, and significance for each interval.-k <file>: for each replicate, sequentially, it lists a header line (# treatment file: <name>; control file: <name>), followed by treatment/control pileups and a p-value for each interval. This is the way to examine pileup values with multiple replicates, since the-ffile will not supply them.-b <file>: an unsorted BED file of the reads/fragments analyzed. The 4th column gives the read name, number of alignments, 'T'reatment or 'C'ontrol, and sample number (0-based), e.g.SRR5427885.57_2_T_0.
-
Option to analyze reads in ATAC-seq mode (
-j). Instead of analyzing full fragments, the program uses "cut site intervals" centered on the ends of fragments. The interval lengths are determined by the-dparameter (def. 100bp). -
Control sample pileups are now scaled to match the treatment, based on fragment/interval lengths.
-
Fixed bug related to sorting alignments by scores.
-
Fixed bugs related to reference sequences (chromosomes) missing from a SAM/BAM file. Reads aligning to chromosomes either on the "skipped" list (
-eargument) or not appearing in the header of the treatment SAM/BAM file will not be analyzed.
Version 0.1
- Analyzes secondary alignments:
- Keeps alignments whose alignment scores (
AS) are within a specified value (-s <float>) of the best alignment. The default value of-s 0means that only alignments judged as equivalent by the aligner will be kept. - Each of the
nalignments for a read/fragment is counted as1/nfor the pileup. - To avoid excessive memory usage and the imprecision inherent in floating-point values, a maximum of 10 alignments per read is analyzed. Reads with more than 10 alignments will be subsampled based on the best alignment scores; in the case of ties, alignments appearing first in the SAM/BAM are favored.
- Keeps alignments whose alignment scores (
- Input SAM/BAM files are required to be name sorted (
samtools sort -n) - A simple hashtable is implemented for efficiently compiling p-values to calculate q-values. The q-value calculation is still based on the Benjamini-Hochberg procedure.
Version 0.0
Peak-calling for genomic enrichment assays of treatment sample(s) (-t [<file>]+), with or without control sample(s) (-c [<file>]+). Input files must be in SAM/BAM format. Incorporates ideas from SAMtoBED, removeChrom, and MACS2 v2.1.2_dev.
- Control of analysis of alignments:
- Properly paired alignments only (default); fragments are inferred appropriately
- Also keeping unpaired alignments:
- As they appear in the SAM/BAM file (
-y) - Length increased to specified value (
-a <int>) - Length increased to average value of paired alignments (
-x)
- As they appear in the SAM/BAM file (
- Filtering options:
- Chromosomes (reference sequences) to ignore (
-e <arg>) - Minimum mapping quality (
-m <int>)
- Chromosomes (reference sequences) to ignore (
- Peak-calling options:
- Maximum q-value (
-q <float>, def. 0.05) - Maximum p-value (
-p <float>) - Minimum length of a peak (
-l <int>, def. 100bp) - Maximum distance between significant sites (
-g <int>, def. 100bp)
- Maximum q-value (
- Output options:
- Output peak file, in ENCODE narrowPeak format (
-o <file) - Output bedgraph file listing treatment/control pileups and p/q values (
-b <file>)
- Output peak file, in ENCODE narrowPeak format (