Skip to content

NationalGenomicsInfrastructure/NGI_Element_benchmark

 
 

Repository files navigation

DOI

Element vs Illumina benchmark

This repository contains resources related to the article:

Whole genome sequencing with AVITI and NovaSeq X Plus reveals comparable performance with contextual biases

Pontus Höjer, Johannes Alneberg, Pär Lundin, Tom Martin, Julia Hauenstein, Helena Fällmar, Magnus Lindell, Christian Natanaelsson, Susana Häggqvist, Adam Ameur, Jessica Nordlund, Robert Månsson Welinder

bioRxiv 2025.10.10.681584; doi: https://doi.org/10.1101/2025.10.10.681584

Please use this to cite this work.

Data organisation

  • data/wgs: Raw Element & Illumina FASTQs and PacBio BAMs organized under here. To download data, see Data download section.
  • analysis: Analysis runs on data, either using Snakemake scripts or existing Nextflow nf-core/sarek workflows
  • resources: Genome annotations and reference data
  • notebooks: Jupyter notebooks for data processing and visualization
  • scripts: Python scripts used in Snakemake workflows
  • figures/svg: SVG figures
  • env: Environment related files

Code to figure/table

Jupyter notebooks with workflow folder (Snakemake or Nextflow/nf-core) used to generate each figure/table is specified below:

Figure/Table Notebook Workflow dir
Figure 1b duplicates.ipynb analysis/nfcore_sarek_rerun
Figure 1c samtools_stats_all.ipynb analysis/nfcore_sarek_rerun
Figure 1d samtools_stats_all.ipynb analysis/nfcore_sarek_rerun
Figure 2a differential_coverage.ipynb analysis/differential_coverage
Figure 2b variant_calling_benchmarks_allchr.ipynb analysis/variant_call_benchmarking_allchr
Figure 2c variant_calling_benchmarks_allchr.ipynb analysis/variant_call_benchmarking_allchr
Figure 2d variant_calling_benchmarks_allchr.ipynb analysis/variant_call_benchmarking_allchr
Figure 3a samtools_stats_per_read.ipynb analysis/error_rate
Figure 3b fraguracy_error_rate.ipynb analysis/error_rate
Figure 3c samtools_stats_per_read_insert_size.ipynb analysis/fragment_length_qual_dependence
Figure 3d compare_read_stack_multiple.ipynb analysis/stack_reads
Figure 4b g4_soft_clipped.ipynb analysis/soft_clipped
Figure 4c stratification_error_rate.ipynb analysis/stratification_error_rate
Figure 4d compare_read_stack_multiple.ipynb analysis/stack_reads
Figure 4e stratification_error_rate.ipynb analysis/stratification_error_rate
Supplementary Table 2 analysis/illumina_dups_per_lane
Supplementary Table 5 g4_overlap.ipynb analysis/g4_overlap
Supplementary Figure 2 samtools_stats_all.ipynb analysis/nfcore_sarek_rerun
Supplementary Figure 3 duplicates.ipynb analysis/chr20_duplicate
Supplementary Figure 4 samtools_stats_all.ipynb analysis/nfcore_sarek_rerun
Supplementary Figure 5 samtools_stats_per_read.ipynb analysis/error_rate
Supplementary Figure 6 samtools_stats_per_read_public.ipynb analysis/public_data
Supplementary Figure 7 fraguracy_error_rate.ipynb analysis/error_rate

Reproducing analysis

To reproduce the analysis done for this work perform the following steps

  1. Clone (git clone ...) this repository
  2. Download Element/Illumina FASTQs and PacBio BAMs
  3. Prepare enviroments and software to run analysis, see Environment setup and configuration
  4. Download and prepare resource files e.g. genome.
  5. Run nf-core/sarek
  6. Run secondary analysis using snakemake worflows

Data download

Illumina + Element FASTQ download

FASTQs are publically available on ENA: https://www.ebi.ac.uk/ena/browser/view/PRJEB90663

To download the short read fastqs there is a bash script download_fastqs.sh. I.e run:

bash download_fastqs.sh

The script will download FASTQs in the same folder structure to be able to run nf-core sarek. Information about the libraries and their indexing can be found here: libraries.md

PacBio BAM download

Aligned BAMs are available on ENA: https://www.ebi.ac.uk/ena/browser/view/PRJEB95775

There is a download bash script in data/wgs/PacBio_HiFi_BAMs/download_pacbio_bam_ena.sh

Run this within the directory i.e.

cd data/wgs/PacBio_HiFi_BAMs
bash download_pacbio_bam_ena.sh

Public Element FreeStyle dataset

FASTQs be downloaded from here: https://go.elementbiosciences.com/human-whole-genome-sequencing-third-party (dataset JM-L825-HG002). Place the FASTQs in the data/wgs/Element_Freestyle folder.

Environment setup and configuration

Workflow dependencies

The Snakemake workflows require snakemake v8.20.1 (later version may also work) to run, see documentation here

The nf-core/sarek pipeline is based on Nextflow, see documentation to setup Nextflow. Nextflow v24.10.4 was used in this work.

Executables

Some of the Snakemake workflows requires executables to be downloaded and added to the configurations.

Download the required executables below

There is also a utility script envs/download_containers_executables.sh to download executables and containers.

Add absolute paths to snakemake_config.yaml

Containers

Here is a list of containers required to run the Snakemake workflows

Pull images from the URLs or tag using singularity

singularity pull <image.img> <url/tag>

or Apptainer

apptainer pull <image.img> <url/tag>

There is also a utility script envs/download_containers_executables.sh to download executables and containers, apptainer or singularity is specified using the -t flag, e.g. bash download_containers_executables.sh -t apptainer. The script downloads to the current directory.

Finally add the absolute paths to the config snakemake_config.yaml

Avidity Manuscript environment container

In the envs/avidity folder is a conda environment YAML and singularity definition file to generate a container required for running the following snakemake analysis workflows.

  • analysis/chr20_duplicates/Snakefile
  • analysis/stack_reads/Snakefile

To generate the required container using Singularity

cd envs/avidity
singularity build avidity.sif Singularity.def

or Apptainer

cd envs/avidity
apptainer build avidity.sif Singularity.def

Add the container image path to the snakemake_config.yaml config.

Jupyther notebooks execution

Jupyter notebooks in the folder notebooks were executed using a conda as defined in envs/environment.yml. Create the conda environment using:

conda env create -f envs/environment.yml

Resources

Genome download and setup

The reference genome can be downloaded and generate required files using a utility Snakemake workflow. snakemake v8.20.1 or later is required to run and the samtools container (see Containers) is needed for indexing.

cd resources/GRCh38_GIABv3/
snakemake -j 4 -k -p --use-singularity

The FASTA must be un-bgzipped due to a sarek issue (see nf-core/sarek#1741).

GIAB stratifications

Genome in a bottle (GIAB) stratifications (v3.5) can be downloaded using the link below

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/genome-stratifications-GRCh38@all.tar.gz

To download and extract the data in to the resource folder run:

wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/genome-stratifications-GRCh38@all.tar.gz
tar -xzf genome-stratifications-GRCh38@all.tar.gz -C resources

Avidity codebase

Some of the analysis relies on scripts developed for the paper Arslan et al. Sequencing by avidity enables high accuracy with low reagent consumption. Nat Biotechnol 42, 132–138 (2024).

These are available at this repository: https://github.com/Elembio/AvidityManuscript2023 (specifically commit 701be395c892d00beca69693536ad600d209eec2).

Clone this into the scripts folder using the following command:

git clone --revision=701be395c892d00beca69693536ad600d209eec2 --depth=1 https://github.com/Elembio/AvidityManuscript2023.git scripts/AvidityManuscript2023

Running workflows

Running nf-core/sarek on Element/Illumina data

Nextflow v24.10.4 is required for running sarek.

Download the FASTQs and genome reference.

The configs for running sarek are found in the analysis/nfcore_sarek_rerun folder with one subfolder for each data source.

├── aviti_hq   # Element AVITI CB
├── aviti_ngi  # Element AVITI CB FS
└── xplus_sns  # Illumina NovaSeq XPLUS 

Within each is as *params_relpath.yaml defining the necessary parameters. Do not use the *params.yaml files as they are configured for a specific HPC resource.

The nextflow_no_tower.config configures the FASTP step to cap reads to 150 bp but perform no other trimming.

Modify the command below with your profile for choise (e.g. singularity) and run within the subfolder with the corresponding parameter YAML. E.g.

cd analysis/nfcore_sarek_rerun/xplus_sns
nextflow run nf-core/sarek -r 3.4.2 -profile <profile> -c ../nextflow_no_tower.config -params-file xplus_sns_params_relpath.yaml

Running nf-core/sarek on public Element FreeStyle data

Besides the data generated to this study we also relied on public data, see e.g. analysis/public_data. One of the datasets require mapping, to download see here.

Mapping was performed using nf-core/sarek and parameters/configs/samplesheets are found in the analysis/nfcore_sarek_public folder. Follow instructions from here

Running secondary snakemake analysis

Snakefiles are found in the analysis folder and requires snakemake v8.20.1 or later to run. Most snakemake workflows requires nf-core/sarek to be run first.

Configurations for resource files, executables and containers needed are specified in a YAML: snakemake_config.yaml. Make sure all paths are updated before you run.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 77.0%
  • Jupyter Notebook 22.6%
  • Other 0.4%