Pan-Canadian-Genome-Library/molecular-data-submission-workflow

Introduction

Pan-Canadian-Genome-Library/molecular-data-submission-workflow is a Nextflow pipeline that automates the validation, packaging, and submission of molecular genomics data and associated metadata to the Pan-Canadian Genome Library (PCGL) data repository. The pipeline ensures data integrity, validates metadata compliance, handles file uploads, and generates comprehensive submission receipts for tracking and audit purposes. The workflow has adopted nf-core framework and best practice guidelines to ensure reproducibility, portability and scalability.

For detailed information about workflow components, prerequisites, and system architecture, see Complete Introduction Guide [TODO] with comprehensive workflow overview, subworkflows, modules and technical details.

Pipeline Overview

PCGL Submission Workflow Diagram

The workflow consists of five main subworkflows:

Dependency Checking - Validates input files, verifies metadata completeness, and ensures all submission prerequisites are met before processing
Metadata Payload Generation - Transforms input metadata into standardized JSON payloads that comply with PCGL data model requirements
Data Validation - Performs comprehensive validation of data file integrity, metadata format compliance, and cross-validation between data and metadata
Data Uploading - Handles secure file transfer to object storage (file-transfer) and submits metadata to PCGL repositories (file-manager and clinical systems)
Receipt Generation - Creates comprehensive batch receipts and summary reports for submission tracking, audit trails, and troubleshooting

Prerequisites

System Requirements

Nextflow: Version 22.04.0 or newer with DSL2 support
Container Engine: One of the following:
- Docker (recommended)
- Singularity/Apptainer
- Conda (alternative, but containers preferred)
Java: Java 17 (or later, up to 24)
Bash: Bash 3.2 (or later)
Memory: Minimum 8GB RAM recommended
Storage: Sufficient disk space for input data, intermediate files, and outputs

Access Requirements

PCGL API Token: Valid authentication token with submission permissions for your study. [TODO: add links to PCGL API token]
Network Access: Connectivity to PCGL submission endpoints:
- File Manager service
- File Transfer service
- Clinical submission service

Submission Dependencies

Study Registration: Your study must be registered in the PCGL system. Please contact PCGL Admin for more info.
Participant Registration: The participants in your submission batch must be registered in the PCGL system.
Biospecimen Entities: Please provide the metadata for relevant dependent Biospecimen Entities if they were not yet submitted.

Input Data Requirements

Please refer to Input Documentation [TODO] for the comprehensive parameter descriptions and file format specifications.

Molecular Data Files:
- Supported formats: CRAM, BAM, VCF, BCF
- Files must include appropriate index files (e.g., .crai, .bai, .tbi, .csi)
- Files must be accessible from the specified path_to_files_directory
Metadata Files:
- Required: file_metadata.tsv, analysis_metadata.tsv
- Optional:
  - Biospecimen: specimen_metadata.tsv, sample_metadata.tsv, experiment_metadata.tsv, read_group_metadata.tsv
  - Analysis: workflow_metadata.tsv
- All metadata files must be in tab-separated (TSV) format
- Files must comply with the Custom data model of your study, which is the combination of PCGL Base and Extentions Data Model. For the latest version of the PCGL Base Data Model, please see the latest release folder.

Environment Setup

Install Nextflow:

Please refer to this page on how to set-up Nextflow. Make sure to test your setup before running the workflow on actual data.
Install Container Engine:
- Docker: Follow Docker installation guide
- Singularity: Follow Singularity installation guide
Verify Installation:
```
nextflow info
docker --version  # or singularity --version
```
[!NOTE] Please refer to section System Requirements for required version.
Obtain PCGL API Token:
- Contact your PCGL administrator or data coordinator to begin registration for API key access
- Login into to CIlogon with your credentials
- Ensure the token has appropriate permissions for your study

Usage

For more detailed usage instructions, including additional examples and configuration options, see Complete Usage Guide [TODO] which contains detailed command-line examples, instructions, and advanced configurations.

First, prepare your data and metadata files into a data directory structure, e.g:

input/
├── metadata/
│   ├── file_metadata.tsv          # Required: File information  
│   ├── analysis_metadata.tsv      # Required: Analysis details
│   ├── workflow_metadata.tsv      # Optional: Workflow information
│   ├── read_group_metadata.tsv    # Optional: Read group details
│   ├── experiment_metadata.tsv    # Optional: Experiment information
│   ├── specimen_metadata.tsv      # Optional: Specimen details
│   └── sample_metadata.tsv        # Optional: Sample information
└── data/
    └── [your data files - CRAM, BAM, etc.]

Basic Usage with Required Metadata

The following step will attempt to submit files and their corresponding analysis and file metadata. This assumes that the prior Submission Dependencies such as Study, Participant, and Biospecimen entities(e.g, specimen, sample, experiment or read_group) have all been registered. If you are not submitting sequencing reads, the metadata of the workflow which generated the files should also be provided.

nextflow run Pan-Canadian-Genome-Library/molecular-data-submission-workflow \
    --study_id "YOUR_STUDY_ID" \
    --token "YOUR_ACCESS_TOKEN" \
    --path_to_files_directory "/path/to/data" \
    --file_metadata "/path/to/file_metadata.tsv" \
    --analysis_metadata "/path/to/analysis_metadata.tsv" \
    --outdir results \
    -profile docker,sd4h_prod

Advanced Usage with Additional Metadata

The following step will attempt to submit files and their corresponding analysis and biospecimen metadata. New records will be registered. Existing records will be verified and flagged for discrepancies and otherwise cross referenced and skipped.

nextflow run Pan-Canadian-Genome-Library/molecular-data-submission-workflow \
    --study_id "YOUR_STUDY_ID" \
    --token "YOUR_ACCESS_TOKEN" \
    --path_to_files_directory "/path/to/data" \
    --file_metadata "metadata/file_metadata.tsv" \
    --analysis_metadata "metadata/analysis_metadata.tsv" \
    --workflow_metadata "metadata/workflow_metadata.tsv" \
    --read_group_metadata "metadata/read_group_metadata.tsv" \
    --experiment_metadata "metadata/experiment_metadata.tsv" \
    --specimen_metadata "metadata/specimen_metadata.tsv" \
    --sample_metadata "metadata/sample_metadata.tsv" \
    --outdir results \
    -profile docker,sd4h_prod

Output

The workflow generates comprehensive outputs to track and verify your data submission. For detailed information about output files and how to interpret results, see:

Output Documentation [TODO] - Complete output structure and file descriptions
Receipt Guide - How to understand and interprete batch receipts

Primary Output Files

Batch Receipt Files (<outdir>/receipt_aggregate/):
- <batch_id>_batch_receipt.json - Submission summary in JSON format
- <batch_id>_batch_receipt.tsv - Submission summary in tabular format

Understanding Your Results

Successful submissions receive unique PCGL analysis IDs for tracking
Failed submissions include detailed error messages for troubleshooting
Batch receipts contain complete submission history and metadata for audit purposes

Testing

Before running the pipeline on your data, we recommend testing it with the provided test datasets:

Testing Guide provides the comprehensive instructions on how to test the workflow with the included test datasets
Test datasets are provided in the tests/test_data/ directory
Both minimal and comprehensive testing scenarios are covered

Credits

Pan-Canadian-Genome-Library/molecular-data-submission-workflow was originally written by Edmund Su and Linda Xiang.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines [TODO].

For troubleshooting common issues, error resolution, and getting help:

Troubleshooting Guide [TODO] - Common problems, solutions, and debugging tips

If you encounter issues not covered in the troubleshooting guide, please:

Check the GitHub Issues for existing solutions
Create a new issue with detailed error messages and system information
Contact the PCGL administrator or data coordinator

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
assets		assets
conf		conf
docs		docs
modules/local		modules/local
subworkflows		subworkflows
tests		tests
workflows		workflows
.DS_Store		.DS_Store
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pan-Canadian-Genome-Library/molecular-data-submission-workflow

Introduction

Pipeline Overview

PCGL Submission Workflow Diagram

Prerequisites

System Requirements

Access Requirements

Submission Dependencies

Input Data Requirements

Environment Setup

Usage

Basic Usage with Required Metadata

Advanced Usage with Additional Metadata

Output

Primary Output Files

Understanding Your Results

Testing

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Pan-Canadian-Genome-Library/molecular-data-submission-workflow

Folders and files

Latest commit

History

Repository files navigation

Pan-Canadian-Genome-Library/molecular-data-submission-workflow

Introduction

Pipeline Overview

PCGL Submission Workflow Diagram

Prerequisites

System Requirements

Access Requirements

Submission Dependencies

Input Data Requirements

Environment Setup

Usage

Basic Usage with Required Metadata

Advanced Usage with Additional Metadata

Output

Primary Output Files

Understanding Your Results

Testing

Credits

Contributions and Support

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages