This repository contains resources related to the article:
Whole genome sequencing with AVITI and NovaSeq X Plus reveals comparable performance with contextual biases
Pontus Höjer, Johannes Alneberg, Pär Lundin, Tom Martin, Julia Hauenstein, Helena Fällmar, Magnus Lindell, Christian Natanaelsson, Susana Häggqvist, Adam Ameur, Jessica Nordlund, Robert Månsson Welinder
bioRxiv 2025.10.10.681584; doi: https://doi.org/10.1101/2025.10.10.681584
Please use this to cite this work.
data/wgs: Raw Element & Illumina FASTQs and PacBio BAMs organized under here. To download data, see Data download section.analysis: Analysis runs on data, either using Snakemake scripts or existing Nextflow nf-core/sarek workflowsresources: Genome annotations and reference datanotebooks: Jupyter notebooks for data processing and visualizationscripts: Python scripts used in Snakemake workflowsfigures/svg: SVG figuresenv: Environment related files
Jupyter notebooks with workflow folder (Snakemake or Nextflow/nf-core) used to generate each figure/table is specified below:
| Figure/Table | Notebook | Workflow dir |
|---|---|---|
| Figure 1b | duplicates.ipynb | analysis/nfcore_sarek_rerun |
| Figure 1c | samtools_stats_all.ipynb | analysis/nfcore_sarek_rerun |
| Figure 1d | samtools_stats_all.ipynb | analysis/nfcore_sarek_rerun |
| Figure 2a | differential_coverage.ipynb | analysis/differential_coverage |
| Figure 2b | variant_calling_benchmarks_allchr.ipynb | analysis/variant_call_benchmarking_allchr |
| Figure 2c | variant_calling_benchmarks_allchr.ipynb | analysis/variant_call_benchmarking_allchr |
| Figure 2d | variant_calling_benchmarks_allchr.ipynb | analysis/variant_call_benchmarking_allchr |
| Figure 3a | samtools_stats_per_read.ipynb | analysis/error_rate |
| Figure 3b | fraguracy_error_rate.ipynb | analysis/error_rate |
| Figure 3c | samtools_stats_per_read_insert_size.ipynb | analysis/fragment_length_qual_dependence |
| Figure 3d | compare_read_stack_multiple.ipynb | analysis/stack_reads |
| Figure 4b | g4_soft_clipped.ipynb | analysis/soft_clipped |
| Figure 4c | stratification_error_rate.ipynb | analysis/stratification_error_rate |
| Figure 4d | compare_read_stack_multiple.ipynb | analysis/stack_reads |
| Figure 4e | stratification_error_rate.ipynb | analysis/stratification_error_rate |
| Supplementary Table 2 | analysis/illumina_dups_per_lane |
|
| Supplementary Table 5 | g4_overlap.ipynb | analysis/g4_overlap |
| Supplementary Figure 2 | samtools_stats_all.ipynb | analysis/nfcore_sarek_rerun |
| Supplementary Figure 3 | duplicates.ipynb | analysis/chr20_duplicate |
| Supplementary Figure 4 | samtools_stats_all.ipynb | analysis/nfcore_sarek_rerun |
| Supplementary Figure 5 | samtools_stats_per_read.ipynb | analysis/error_rate |
| Supplementary Figure 6 | samtools_stats_per_read_public.ipynb | analysis/public_data |
| Supplementary Figure 7 | fraguracy_error_rate.ipynb | analysis/error_rate |
To reproduce the analysis done for this work perform the following steps
- Clone (
git clone ...) this repository - Download Element/Illumina FASTQs and PacBio BAMs
- Prepare enviroments and software to run analysis, see Environment setup and configuration
- Download and prepare resource files e.g. genome.
- Run nf-core/sarek
- Run secondary analysis using snakemake worflows
FASTQs are publically available on ENA: https://www.ebi.ac.uk/ena/browser/view/PRJEB90663
To download the short read fastqs there is a bash script download_fastqs.sh. I.e run:
bash download_fastqs.sh
The script will download FASTQs in the same folder structure to be able to run nf-core sarek. Information about the libraries and their indexing can be found here: libraries.md
Aligned BAMs are available on ENA: https://www.ebi.ac.uk/ena/browser/view/PRJEB95775
There is a download bash script in data/wgs/PacBio_HiFi_BAMs/download_pacbio_bam_ena.sh
Run this within the directory i.e.
cd data/wgs/PacBio_HiFi_BAMs
bash download_pacbio_bam_ena.sh
FASTQs be downloaded from here: https://go.elementbiosciences.com/human-whole-genome-sequencing-third-party (dataset JM-L825-HG002). Place the FASTQs in the data/wgs/Element_Freestyle
folder.
The Snakemake workflows require snakemake v8.20.1 (later version may also work) to run, see documentation here
The nf-core/sarek pipeline is based on Nextflow, see documentation to setup Nextflow. Nextflow v24.10.4 was used in this work.
Some of the Snakemake workflows requires executables to be downloaded and added to the configurations.
Download the required executables below
- mosdepth_d4: https://github.com/brentp/mosdepth/releases/download/v0.3.10/mosdepth_d4
- fraguracy: https://github.com/brentp/fraguracy/releases/download/v0.2.4/fraguracy
There is also a utility script envs/download_containers_executables.sh to download executables and containers.
Add absolute paths to snakemake_config.yaml
Here is a list of containers required to run the Snakemake workflows
- bcftools: https://depot.galaxyproject.org/singularity/bcftools:1.18--h8b25389_0
- bedtools_2.31.1: https://depot.galaxyproject.org/singularity/bedtools:2.31.1--hf5e1c6e_2
- bedtools: https://depot.galaxyproject.org/singularity/bedtools:2.30.0--hc088bd4_0
- d4tools: docker://clinicalgenomics/d4tools:2.0
- deepvariant: docker://quay.io/nf-core/deepvariant:1.5.0
- gatk: https://depot.galaxyproject.org/singularity/gatk4:4.5.0.0--py36hdfd78af_0
- happy: docker://community.wave.seqera.io/library/hap.py_rtg-tools:2ebb433f3ce976d3
- mosdepth: https://depot.galaxyproject.org/singularity/mosdepth:0.3.8--hd299d5a_0
- multiqc: https://depot.galaxyproject.org/singularity/multiqc:1.21--pyhdfd78af_0
- pandas: https://depot.galaxyproject.org/singularity/pandas:1.5.2
- picard: https://depot.galaxyproject.org/singularity/picard:3.0.0--hdfd78af_1
- rtgtools: docker://realtimegenomics/rtg-tools:3.12.1
- samtools: https://depot.galaxyproject.org/singularity/samtools:1.19.2--h50ea8bc_0
Pull images from the URLs or tag using singularity
singularity pull <image.img> <url/tag>
or Apptainer
apptainer pull <image.img> <url/tag>
There is also a utility script envs/download_containers_executables.sh to download executables and containers, apptainer or singularity is specified using the -t flag, e.g. bash download_containers_executables.sh -t apptainer. The script downloads to the current directory.
Finally add the absolute paths to the config snakemake_config.yaml
In the envs/avidity folder is a conda environment YAML and singularity definition file to generate a container required for running the following snakemake analysis workflows.
analysis/chr20_duplicates/Snakefileanalysis/stack_reads/Snakefile
To generate the required container using Singularity
cd envs/avidity
singularity build avidity.sif Singularity.def
or Apptainer
cd envs/avidity
apptainer build avidity.sif Singularity.def
Add the container image path to the snakemake_config.yaml config.
Jupyter notebooks in the folder notebooks were executed using a conda as defined in envs/environment.yml. Create the conda environment using:
conda env create -f envs/environment.yml
The reference genome can be downloaded and generate required files using a utility Snakemake workflow. snakemake v8.20.1 or later is required to run and the samtools container (see Containers) is needed for indexing.
cd resources/GRCh38_GIABv3/
snakemake -j 4 -k -p --use-singularity
The FASTA must be un-bgzipped due to a sarek issue (see nf-core/sarek#1741).
Genome in a bottle (GIAB) stratifications (v3.5) can be downloaded using the link below
To download and extract the data in to the resource folder run:
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/genome-stratifications-GRCh38@all.tar.gz
tar -xzf genome-stratifications-GRCh38@all.tar.gz -C resources
Some of the analysis relies on scripts developed for the paper Arslan et al. Sequencing by avidity enables high accuracy with low reagent consumption. Nat Biotechnol 42, 132–138 (2024).
These are available at this repository: https://github.com/Elembio/AvidityManuscript2023 (specifically commit 701be395c892d00beca69693536ad600d209eec2).
Clone this into the scripts folder using the following command:
git clone --revision=701be395c892d00beca69693536ad600d209eec2 --depth=1 https://github.com/Elembio/AvidityManuscript2023.git scripts/AvidityManuscript2023
Nextflow v24.10.4 is required for running sarek.
Download the FASTQs and genome reference.
The configs for running sarek are found in the analysis/nfcore_sarek_rerun folder with one subfolder for each data source.
├── aviti_hq # Element AVITI CB
├── aviti_ngi # Element AVITI CB FS
└── xplus_sns # Illumina NovaSeq XPLUS
Within each is as *params_relpath.yaml defining the necessary parameters. Do not use the *params.yaml files as they are configured for a specific HPC resource.
The nextflow_no_tower.config configures the FASTP step to cap reads to 150 bp but perform no other trimming.
Modify the command below with your profile for choise (e.g. singularity) and run within the subfolder with the corresponding parameter YAML. E.g.
cd analysis/nfcore_sarek_rerun/xplus_sns
nextflow run nf-core/sarek -r 3.4.2 -profile <profile> -c ../nextflow_no_tower.config -params-file xplus_sns_params_relpath.yaml
Besides the data generated to this study we also relied on public data, see e.g. analysis/public_data. One of the datasets require mapping, to download see here.
Mapping was performed using nf-core/sarek and parameters/configs/samplesheets are found in the analysis/nfcore_sarek_public folder. Follow instructions from here
Snakefiles are found in the analysis folder and requires snakemake v8.20.1 or later to run. Most snakemake workflows requires nf-core/sarek to be run first.
Configurations for resource files, executables and containers needed are specified in a YAML: snakemake_config.yaml. Make sure all paths are updated before you run.