Skip to content

cellgeni/nf-irods-to-fastq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

188 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-irods-to-fastq

Overview

This Nextflow pipeline retrieves samples from iRODS storage, converts CRAM/BAM files to FASTQ format, and optionally uploads the results to FTP servers. The pipeline supports comprehensive metadata management and provides three main operations: metadata discovery, CRAM-to-FASTQ conversion, and FTP upload.

Quick start

If you just need to get fastq from irods, then first create a actions/samples.csv file with single column:

sample
rPCNSL14736682
rPCNSL14736778
...

and run

nextflow run cellgeni/nf-irods-to-fastq --cram2fastq --samples actions/samples.csv

Contents of Repo

  • main.nf — the main Nextflow pipeline that orchestrates all workflows
  • nextflow.config — configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parameters
  • subworkflows/ — collection of subworkflows for different pipeline stages
  • modules/ — collection of reusable modules for various tasks
  • configs/ — configuration files for different pipeline components
  • examples/ — example input files demonstrating various input formats

Pipeline Workflow

  1. Sample Discovery: Reads sample information from CSV, TSV, or JSON input files
  2. Metadata Retrieval: Searches iRODS for CRAM files associated with samples and retrieves metadata
  3. File Download: Downloads CRAM/BAM files from iRODS storage
  4. Format Conversion: Converts CRAM/BAM files to FASTQ format using samtools
  5. Quality Control: Calculates read lengths and applies ATAC-seq specific formatting if needed
  6. File Concatenation: Combines FASTQ files by sample and read type
  7. Checksum Calculation: Generates MD5 checksums for data integrity verification
  8. FTP Upload: Optionally uploads processed FASTQ files to specified FTP servers

Pipeline Parameters

Required Parameters (choose one):

  • --samples — Path to a CSV, TSV, or JSON file containing sample information with a sample or sample_id column
  • --crams — Path to a CSV or TSV file containing CRAM file information with columns: sample, cram_path, fastq_prefix
  • --fastqs — Path to a CSV file containing FASTQ file information with columns: sample, path

Operation Flags:

  • --cram2fastq — Enable CRAM-to-FASTQ conversion (used with --samples or --crams)
  • --toftp — Enable FTP upload (used with --fastqs)

Optional Parameters:

  • --output_dir — Output directory for pipeline results (default: "results")
  • --publish_mode — File publishing mode (default: "copy")
  • --index_format — Index format formula for samtools (default: "i*i*")
  • --format_atac — Apply ATAC-seq specific formatting (default: true)
  • --ignore_patterns — Comma-separated patterns to ignore when finding CRAMs (default: "*_phix.cram,*yhuman*,*#888.cram")
  • --irods_zone — iRODS zone to search (default: "seq")

FTP Parameters (required when using --toftp):

  • --ftp_host — FTP server hostname (default: "ftp-private.ebi.ac.uk")
  • --username — FTP username
  • --password — FTP password
  • --ftp_path — Target path on FTP server

Note: When using --toftp, you must also provide --fastqs with a CSV file containing FASTQ paths.

Input File Formats

The pipeline supports multiple input formats for different operation modes:

Option 1: Sample Discovery (--samples)

Specify sample or sample_id along with other useful metadata columns to find CRAM files on iRODS.

CSV format:

sample,study_title
4861STDY7135911,Study_Name
4861STDY7135912,Study_Name
Human_colon_16S8000511,Human_colon_16S

TSV format:

sample	study_title
4861STDY7135911	Study_Name
4861STDY7135912	Study_Name

JSON format:

[
  {"sample": "4861STDY7135911", "study_title": "Study_Name"},
  {"sample": "4861STDY7135912", "study_title": "Study_Name"}
]

Option 2: Direct CRAM Processing (--crams)

Specify sample, cram_path, and fastq_prefix columns to directly process known CRAM files.

CSV format:

sample,cram_path,fastq_prefix
4861STDY7135911,/seq/24133/24133_1#4.cram,4861STDY7135911_S1_L001
4861STDY7135911,/seq/24133/24133_2#2.cram,4861STDY7135911_S1_L002

Option 3: FASTQ Upload (--fastqs)

Specify sample and path columns for FASTQ files to upload. Note: this requires a CSV file, not a directory path.

sample,path
4861STDY7135911,results/fastqs/4861STDY7135911/4861STDY7135911_S1_L001_I1_001.fastq.gz
4861STDY7135911,results/fastqs/4861STDY7135911/4861STDY7135911_S1_L001_R1_001.fastq.gz

Examples

System Requirements Setup

Prepare your environment on Sanger's farm22:

module load cellgen/nextflow/24.10.0
module load cellgen/irods
module load cellgen/singularity
module load python-3.11.6
export LSB_DEFAULT_USERGROUP=<YOURGROUP>

Initialize iRODS connection:

iinit

Basic Usage Examples

1. Sample Metadata Discovery:

nextflow run main.nf --samples ./examples/samples.csv

This generates a metadata/ directory with:

metadata/
├── getmetadata.log     # warnings and processing information
└── metadata.tsv       # sample metadata from iRODS

2. CRAM-to-FASTQ Conversion:

nextflow run main.nf --cram2fastq --crams metadata/metadata.tsv

3. Complete Pipeline (Discovery + Conversion):

nextflow run main.nf --samples ./examples/samples.csv --cram2fastq

Note: The pipeline does not currently support end-to-end operation combining CRAM conversion with FTP upload in a single command. To upload converted FASTQ files, you must first run the conversion step, then use the generated fastqs.csv file for FTP upload in a separate command.

4. FTP Upload:

nextflow run main.nf --toftp --fastqs ./examples/fastqs.csv --username "annotare" --password "annotare1" --ftp_host "ftp-private.ebi.ac.uk" --ftp_path "/path/to/ftp/dir"

5. End-to-End Pipeline (two-step process):

# Step 1: Discovery and conversion
nextflow run main.nf --samples ./examples/samples.csv --cram2fastq

# Step 2: Upload the generated fastqs.csv (after step 1 completes)
nextflow run main.nf --toftp --fastqs ./results/fastqs.csv --username "annotare" --password "annotare1" --ftp_host "ftp-private.ebi.ac.uk" --ftp_path "/path/to/ftp/dir"

Advanced Usage Examples

Custom Output Directory:

nextflow run main.nf \
    --samples ./examples/samples.csv \
    --cram2fastq \
    --output_dir "my_results"

Disable ATAC Formatting:

nextflow run main.nf \
    --samples ./examples/samples.csv \
    --cram2fastq \
    --format_atac false

Expected Output Structure

After Metadata Discovery:

metadata/
├── getmetadata.log
└── metadata.tsv

After CRAM-to-FASTQ Conversion:

results/
├── fastqs/
│   └── {sample}/
│       ├── {sample}_S1_L001_I1_001.fastq.gz
│       ├── {sample}_S1_L001_R1_001.fastq.gz
│       ├── {sample}_S1_L001_R2_001.fastq.gz
│       └── ...
├── fastqs.csv                    # Generated CSV file listing all FASTQ paths
└── metadata_final.tsv            # Final metadata file

After FTP Upload:

Additional files in results/:

├── concatenated/                  # Concatenated FASTQ files by sample
│   ├── {sample}_S1_I1_001.fastq.gz
│   ├── {sample}_S1_R1_001.fastq.gz
│   └── {sample}_S1_R2_001.fastq.gz
└── md5checksums.txt              # MD5 checksums of uploaded files

System Requirements

  • Nextflow: Version 25.04.4 or higher
  • Singularity: For containerized execution
  • iRODS client: Access to iRODS commands (iget, imeta, etc.)
  • LSF: For job submission on HPC clusters (configured for Sanger's environment)

Error Handling

  • Invalid input files: Pipeline validates CSV/TSV headers and JSON structure
  • Missing samples: Warnings are logged for samples not found in iRODS
  • Missing required fields: Pipeline validates presence of required columns (sample/sample_id, cram_path, fastq_prefix)
  • Empty sample values: Pipeline checks for non-empty sample identifiers
  • Checksum verification: MD5 checksums are calculated for data integrity verification
  • FTP upload failures: Failed uploads are logged with detailed error messages

Monitoring and Logging

The pipeline generates comprehensive reports in the reports/ directory:

  • Timeline report: Visual timeline of task execution
  • Execution report: Detailed resource usage and performance metrics
  • Trace file: Complete execution trace for debugging

Pipeline Flow Diagram

---
title: Nextflow pipeline for retrieving CRAM files from iRODS and converting them to FASTQ
---
flowchart TB
    subgraph findcrams["IRODS_FINDCRAMS"]
        direction LR
        v0([IRODS_FIND])
        v1([IRODS_GETMETADATA])
        v2([makeFastqPrefix])
        v3([COMBINE_METADATA])
    end
    
    subgraph downloadcrams["IRODS_DOWNLOADCRAMS"]
        direction LR
        v4([IRODS_GETFILE])
        v5([CRAM2FASTQ])
        v6([COMBINE_METADATA])
    end
    
    subgraph fastq2ftp["FASTQS2FTP"]
        direction LR
        v7([CONCATENATE_FASTQS])
        v8([CALCULATE_MD5])
        v9([UPLOAD2FTP])
    end
    
    v0 --> v1 --> v2 --> v3
    v4 --> v5 --> v6
    v7 --> v8
    v7 --> v9
    
    findcrams -.-> downloadcrams -.-> fastq2ftp
Loading

Usage Notes

  • Only one input mode can be used per pipeline run (--samples, --crams, OR --fastqs)
  • When using --samples, the pipeline will automatically discover associated CRAM files in iRODS
  • Sample names must contain either a sample or sample_id column in input files
  • The pipeline automatically handles 10X ATAC-seq specific file naming conventions
  • FASTQ files are concatenated by sample and read type for easier downstream processing
  • FTP uploads require both --toftp flag AND --fastqs parameter with a CSV file (not directory)
  • End-to-end processing (CRAM conversion + FTP upload) requires two separate pipeline runs
  • Large CRAM files may take considerable time to download and convert depending on network bandwidth
  • The pipeline is optimized for batch processing of multiple samples simultaneously
  • The pipeline writes a fastqs.csv file to the output directory after CRAM conversion, which can be used for subsequent FTP uploads

About

Get CRAMs from iRODS and convert them to FASTQ

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors