Skip to content

Assemble Hepatitis C virus genomes from Illumina enrichment sequencing

License

Notifications You must be signed in to change notification settings

folkehelseinstituttet/hcvtyper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,123 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

folkehelseinstituttet/hcvtyper folkehelseinstituttet/hcvtyper

Nextflow run with docker

Table of Contents

About HCVTyper

folkehelseinstituttet/hcvtyper is a bioinformatics pipeline used at the Norwegian Institute of Public Health that is designed for highly variable viruses, and viruses that are likely to appear as co-infections between multiple strains, such as Hepatitis C Virus. The pipeline will identify the most likely major and minor strain in a sample sequenced with the Illumina platform. It will map the reads to these references using Bowtie2 and create consensus sequences. For Hepatitis C Viruses the pipeline can also run a GLUE-analysis to identify drug resistance mutations. maps Illumina reads to a reference genome and creates a consensus sequence.

Requirements

The pipeline only requires Nextflow and Docker in order to run. Note that you must be able to run Docker as a non-root user as described here.

Important

HCV-GLUE is currently only available with the Docker profile. We recommend that you always run the pipeline with Docker.

Run the pipeline

The pipeline does not require any installation, only an internet connection. The pipeline is typically run with the following command:

nextflow run folkehelseinstituttet/hcvtyper -r v1.1.3 \
    --input samplesheet.csv \
    --outdir <OUTDIR> \
    -profile docker

Nextflow will pull the pipeline from the GitHub repo automatically when it is launched. Here, the version of the 1.1.3 release is downloaded and run. You can omit -r and the code from the master branch will be used. But we always recommend that you specify either branch or release using -r.

If you want to download a local copy of the pipeline you can run:

nextflow pull folkehelseinstituttet/hcvtyper -r v1.0.6

Again, -r is optional.

Test the pipeline

To run a minimal test:

nextflow run folkehelseinstituttet/hcvtyper -profile docker,test

This is only to see if you can get the pipeline up and running and will not run the entire pipeline such as HCV-GLUE. The results will be in a directory called minimal_test.

To run a full test on a real dataset type:

# First download the test dataset using nf-core/fetchngs
nextflow run nf-core/fetchngs -profile docker --input 'https://raw.githubusercontent.com/folkehelseinstituttet/hcvtyper/refs/heads/dev/assets/test_ids.csv' --outdir full_test

# Then run the pipeline on the downloaded dataset
nextflow run folkehelseinstituttet/hcvtyper -profile docker,test_full

This will download a HCV Illumina dataset from SRA and run the entire pipeline. The results will be in a directory called full_test. Note that the pipeline will by default download and use the Kraken 2 PlusPFP-8 database. This reqires at least 5 GB of free disk space and will take a few minutes to download and unpack. In addition, the default memory and cpu requirements of 12 cpus and 72 GB have been overridden to 50.GB and 8.

Required parameters

Samplesheet input

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown below. The sample names can contain numbers and underscores (_), but not spaces, dots (.) or other symbols. The fastq_1 and fastq_2 columns must contain the full path to the gzipped paired fastq files corresponding to the same sample.

sample,fastq_1,fastq_2
Sample_1,/path/to/sample1_fastq_R1.fastq.gz,/path/to/sample1_fastq_R2.fastq.gz
Sample_2,/path/to/sample2_fastq_R1.fastq.gz,/path/to/sample2_fastq_R2.fastq.gz

The samplesheet is input to the pipeline using the --input parameter, e.g.: --input assets/samplesheet_illumina.csv

An example samplesheet has been provided with the pipeline in the assets directory.

File naming requirements:

  • FASTQ files should be gzipped and paired-end
  • Files should follow the naming pattern: *_R1.fastq.gz and *_R2.fastq.gz (or similar R1/R2 designation)
  • All FASTQ files for a project should be organized in a single directory or subdirectories

Creating a samplesheet automatically: If you have many samples, you can use the provided Docker container to automatically generate a samplesheet from a directory containing FASTQ files. The paired fastq files can be in subdirectories, and you need to point to the directory above the sub-directories. You must also point to an existing directory where you want to write the samplesheet. The container can be run like this:

# Generate samplesheet from a directory containing FASTQ files
docker run --rm \
    -v /path/to/fastq/directory:/data \
    -v /path/to/output/directory:/out \
    ghcr.io/jonbra/viralseq_utils:latest \
    /data /out/samplesheet.csv

Output directory

The output directory is specified using the --outdir parameter, e.g.: --outdir results

Profiles

The pipeline can be run using different profiles, which will determine how the pipeline is executed. The default profile is docker, which uses Docker containers to run the pipeline. You can also use singularity or conda profiles if you prefer those environments. To set the profile use the -profile parameter, e.g.: -profile docker/singularity/conda.

Provide parameters in a file

The different parameters can be provided in a file using the argument -params-file path/to/params-file.yml. The file can be either YAML-formatted:

input: 'samplesheet.csv'
outdir: 'results'

or JSON-formatted:

{
  "input": "samplesheet.csv",
  "outdir": "results"
}

Optional parameters

Kraken2 databases

The pipeline uses Kraken2 for two purposes. One is to classify the reads against a general database to get a broad overview of the taxonomic diversity within the sample (e.g., are there a lot of human reads?). The second is to classify the reads against a specific HCV-database and then use only the classified reads for the rest of the pipeline. This is done to reduce the computational load and time needed to run mapping and de novo assembly.

By default, the pipeline will download and use the PlusPFP-8 database compiled by Ben Langmead for the broad classification. This requires the download and upacking of a fairly large file (>5 GB) and we recommend that you download and unpack this yourself and specify the path to the database using the --kraken_all_db parameter.

For the HCV-specific classification, the pipeline will use a very small and provided database which consists of around 200 different HCV strains. You can specify a custom HCV-datavase using the --kraken_focused_db paramter.

HCV reference sequences

The database comes with a provided set of about 200 HCV reference sequences downloaded from NCBI. See the file data/blast_db/HCVgenosubtypes_8.5.19_clean.fa. The fasta headers have been modified to begin with the genotype and subtype information (e.g., 1a, 3b, etc.) followed by an underscore and the NCBI accession number (e.g, 1a_AF009606). You can for example add or remove HCV strains by modifying this file. Remember to format the fasta headers accordingly. This file will then be used in the mapping and analysis of the de novo assembled contigs to identify genotype and subtype. You need to provide the path to this file like this: --references /path/to/HCV-sequences.fasta.

Co-infections (major and minor strains)

The pipeline will first map all HCV-classified reads against all HCV reference sequences. Then it will identify the reference sequence with the most mapped reads and use the genotype and subtype information from this reference sequence to call major genotype and subtype. To identify a potential co-infection (minor strain), the pipeline will identify the reference that belongs to a different genotype than the major strain (expect for genotypes 1a and 1b which are considered different enough so that we can distinguish them in a co-infection) and has the highest coverage (i.e., percent of the genome covered by 5 or more reads). By default we have set a threshold of minimum 500 reads and 30% genome coverage in order to consider a strain as a minor strain at all. This can be overridden using the parameters --minRead and --minCov.

Note that there is a recombinant strain between subtypes 2k and 1b present in the database. If this is detected, the pipeline will not allow for a co-infection with either genotypes 1 or 2.

Starting and stopping the pipeline

If the pipeline crashes, or stopped deliberately, it can be restarted from the last completed step by running the same command but with the -resume option. Read more about resuming a Nextflow pipeline here.

Customizing the pipeline

Changing the arguments given to the various sub-tools can be done in several ways, perhaps the easiest is to create a custom config file. Described in more detail here.

Output files

The pipeline generates a comprehensive set of output files from various processes to facilitate result interpretation and quality control. By default, many intermediate files are published to help you understand the analysis. You can customize which files are published by modifying the publishDir settings in the configuration files. For example, to disable publishing for a specific process:

withName: 'PROCESS_NAME' {
    publishDir = [enabled: false]
}

Main output files

Summary.csv

The primary output file containing per-sample genotyping and quality metrics. Key columns include:

Read statistics:

  • sampleName - Sample identifier
  • total_raw_reads - Total number of raw reads
  • total_trimmed_reads - Reads after quality trimming
  • total_classified_reads - Reads classified by Kraken2 as target organism
  • total_mapped_reads - Reads mapped to all reference genomes
  • fraction_mapped_reads_vs_median - Fraction of mapped reads relative to median across all samples. Useful for identifying outliers in a sequencing batch.

Genotyping results:

  • Major_genotype_mapping / Minor_genotype_mapping - Identified genotypes (major/minor variants) from the reference mapping
  • Major_reference / Minor_reference - CLosest references identified in the mapping against all references. These were used for genotyping and re-mapping
  • major_typable / minor_typable - Whether the sample meets quality thresholds for reliable genotyping (YES/NO)

Mapping statistics (major/minor):

  • Reads_withdup_mapped_major/minor - Mapped reads including duplicates
  • Reads_nodup_mapped_major/minor - Mapped reads after duplicate removal
  • Percent_reads_mapped_of_trimmed_with_dups_major/minor - Percentage of trimmed reads that mapped, duplicates included
  • Major/Minor_cov_breadth_min_5/10 - Percentage of reference covered at ≥5× or ≥10× depth
  • Major/Minor_avg_depth - Average sequencing depth across the reference

HCV-specific outputs (if applicable):

  • GLUE_genotype / GLUE_subtype - Genotype and subtype determined by HCV-GLUE. "Typable" only if this matches the mapping genotype.
  • Reference - GLUE reference sequence
  • Drug resistance markers for NS3/4A inhibitors (glecaprevir, grazoprevir, paritaprevir, voxilaprevir)
  • Drug resistance markers for NS5A inhibitors (daclatasvir, elbasvir, ledipasvir, ombitasvir, pibrentasvir, velpatasvir)
  • Drug resistance markers for NS5B inhibitors (dasabuvir, sofosbuvir)
  • *_mut columns - Detailed mutation information
  • *_mut_short columns - Abbreviated mutation notation

Technical metadata:

  • sequencer_id - Sequencing instrument identifier
  • pipeline_version - Version of the folkehelseinstituttet/hcvtyper pipeline
  • HCV_project_version - HCV-GLUE version
  • GLUE_engine_version - GLUE engine version
  • PHE_drug_resistance_extension_version - Version of the Public Health England (PHE) drug resistance extension applied in HCV-GLUE

MultiQC Report

A comprehensive HTML report (multiqc_report.html) that summarizes:

  • Run information and pipeline parameters
  • Command line and configuration used
  • Pipeline version and software versions
  • Quality control metrics (FastQC, trimming statistics)
  • Read classification and mapping statistics
  • Genotyping results and drug resistance summaries (for HCV)
  • Visualization of coverage and variant distributions

The MultiQC report provides an interactive overview of all samples and is the recommended starting point for result interpretation.

Additional output directories

  • fastqc/ - Raw and trimmed read quality reports
  • fastp/ or cutadapt/ - Read trimming logs and statistics
  • kraken2/ - Taxonomic classification reports
  • samtools/ - BAM file statistics and mapping metrics
  • bowtie2/ or tanoti/ - Alignment files and indices
  • spades/ - De novo assembly results (if enabled)
  • blast/ - BLAST results against reference database
  • hcvglue/ - HCV-GLUE genotyping and resistance reports (for HCV samples)
  • pipeline_info/ - Execution reports, timeline, and software versions

Citations

If you use folkehelseinstituttet/hcvtyper for your analysis, please cite it using the following doi: https://doi.org/10.1101/2025.10.21.683612

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Assemble Hepatitis C virus genomes from Illumina enrichment sequencing

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors