Skip to content

Commit 40f0496

Browse files
authored
Update README.md
1 parent cbe443c commit 40f0496

File tree

1 file changed

+9
-8
lines changed

1 file changed

+9
-8
lines changed

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -66,25 +66,26 @@ This workflow offers several key advantages for RNA-seq analysis over existing p
6666
The workflow performs the following steps that produce the outlined results:
6767

6868
- **Processing:**
69-
- Automatically verifies (using samtools view) that the `read_typ` (single/paired-end) specified in the annotation matches the actual flags within the input BAM files, preventing downstream errors (`.check_read_type/{sample}.done`).
69+
- Automatically verifies (using samtools view) that the `read_type` (single/paired-end) specified in the annotation matches the actual flags within the input BAM files, preventing downstream errors (`.check_read_type/{sample}.done`).
7070
- Combines multiple input raw/unaligned/unmapped [uBAM](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format) files per sample into a single stream (using `samtools merge`).
7171
- Converts the merged BAM stream into FASTQ format, handling paired-end interleaving (using `samtools fastq`).
7272
- Processes the FASTQ stream for adapter trimming and quality filtering using `fastp`, generating QC reports (`fastp/{sample}/`).
73-
> [!NOTE]
74-
> `fastp` adapter auto-detection is disabled because we use STDIN mode (i.e., stream the data through pipes) to be disk space efficient.
7573
- De-interleaves the filtered FASTQ stream into separate compressed R1 and R2 files for paired-end data, or compresses directly for single-end data using shell commands and `pigz`.
74+
> [!NOTE]
75+
> `fastp` adapter auto-detection is disabled because we use STDIN mode (i.e., stream the data through pipes) to be disk space efficient.
7676
- **Quantification:**
7777
- Uses STAR `GeneCounts` to quantify reads per gene based on the specified Ensembl reference genome and annotation (`star/{sample}/`).
7878
- Handles unstranded, forward-stranded, and reverse-stranded library protocols based on the `strandedness` column.
7979
- Aggregates counts into a single matrix (`counts/counts.csv`).
8080
- **Annotation:**
8181
- Outputs gene annotations (`counts/gene_annotation.csv`).
8282
- Retrieves gene annotations (Ensembl ID, gene symbol, biotype, description) from Ensembl using `biomaRt`.
83-
- Calculates **exon-based** GC content and cumulative exon length for each gene, suitable for poly(A) selected libraries (e.g., .
84-
> [!NOTE]
85-
> Gene annotation can take a while since it depends on the availability of external data sources accessed via `biomaRt`.
86-
> GC-content and length are **exon-based**: In poly(A)‑selected libraries (such as Illumina TruSeq, Smart-seq or QuantSeq), the sequencing reads mainly come from exonic regions. Therefore, potential correction for GC bias and gene length should ideally use exon‑level GC content and effective exon length rather than whole‑gene metrics that include introns.
87-
- Outputs a sample annotation table containing sample-wise general MultiQC statistics (`counts/sample_annotation.csv`).
83+
- Calculates **exon-based** GC content and cumulative exon length for each gene, suitable for poly(A) selected libraries.
84+
- Outputs a sample annotation table containing sample-wise general MultiQC statistics (`counts/sample_annotation.csv`).
85+
> [!NOTE]
86+
> Gene annotation can take a while since it depends on the availability of external data sources accessed via `biomaRt`.
87+
>
88+
> GC-content and length are **exon-based**: In poly(A)‑selected libraries (such as Illumina TruSeq, Smart-seq or QuantSeq), the sequencing reads mainly come from exonic regions. Therefore, potential correction for GC bias and gene length should ideally use exon‑level GC content and effective exon length rather than whole‑gene metrics that include introns.
8889
- **QC & Reporting:**
8990
- Employs RSeQC tools to generate key quality metrics like strand specificity and read distribution across genomic features (`rseqc/`).
9091
- Aggregates QC metrics from fastp, STAR and RSeQC into a single report using MultiQC (`report/multiqc_report.html`) with [AI summaries](https://seqera.io/blog/ai-summaries-multiqc/).

0 commit comments

Comments
 (0)