- About HCVTyper
- Requirements
- Run the pipeline
- Test the pipeline
- Required parameters
- Optional parameters
- Starting and stopping the pipeline
- Customizing the pipeline
- Output files
- Citations
folkehelseinstituttet/hcvtyper is a bioinformatics pipeline used at the Norwegian Institute of Public Health that is designed for highly variable viruses, and viruses that are likely to appear as co-infections between multiple strains, such as Hepatitis C Virus. The pipeline will identify the most likely major and minor strain in a sample sequenced with the Illumina platform. It will map the reads to these references using Bowtie2 and create consensus sequences. For Hepatitis C Viruses the pipeline can also run a GLUE-analysis to identify drug resistance mutations. maps Illumina reads to a reference genome and creates a consensus sequence.
The pipeline only requires Nextflow and Docker in order to run. Note that you must be able to run Docker as a non-root user as described here.
Important
HCV-GLUE is currently only available with the Docker profile. We recommend that you always run the pipeline with Docker.
The pipeline does not require any installation, only an internet connection. The pipeline is typically run with the following command:
nextflow run folkehelseinstituttet/hcvtyper -r v1.1.3 \
--input samplesheet.csv \
--outdir <OUTDIR> \
-profile docker
Nextflow will pull the pipeline from the GitHub repo automatically when it is launched. Here, the version of the 1.1.3 release is downloaded and run. You can omit -r and the code from the master branch will be used. But we always recommend that you specify either branch or release using -r.
If you want to download a local copy of the pipeline you can run:
nextflow pull folkehelseinstituttet/hcvtyper -r v1.0.6
Again, -r is optional.
To run a minimal test:
nextflow run folkehelseinstituttet/hcvtyper -profile docker,test
This is only to see if you can get the pipeline up and running and will not run the entire pipeline such as HCV-GLUE. The results will be in a directory called minimal_test.
To run a full test on a real dataset type:
# First download the test dataset using nf-core/fetchngs
nextflow run nf-core/fetchngs -profile docker --input 'https://raw.githubusercontent.com/folkehelseinstituttet/hcvtyper/refs/heads/dev/assets/test_ids.csv' --outdir full_test
# Then run the pipeline on the downloaded dataset
nextflow run folkehelseinstituttet/hcvtyper -profile docker,test_full
This will download a HCV Illumina dataset from SRA and run the entire pipeline. The results will be in a directory called full_test.
Note that the pipeline will by default download and use the Kraken 2 PlusPFP-8 database. This reqires at least 5 GB of free disk space and will take a few minutes to download and unpack. In addition, the default memory and cpu requirements of 12 cpus and 72 GB have been overridden to 50.GB and 8.
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown below. The sample names can contain numbers and underscores (_), but not spaces, dots (.) or other symbols. The fastq_1 and fastq_2 columns must contain the full path to the gzipped paired fastq files corresponding to the same sample.
sample,fastq_1,fastq_2
Sample_1,/path/to/sample1_fastq_R1.fastq.gz,/path/to/sample1_fastq_R2.fastq.gz
Sample_2,/path/to/sample2_fastq_R1.fastq.gz,/path/to/sample2_fastq_R2.fastq.gz
The samplesheet is input to the pipeline using the --input parameter, e.g.:
--input assets/samplesheet_illumina.csv
An example samplesheet has been provided with the pipeline in the assets directory.
File naming requirements:
- FASTQ files should be gzipped and paired-end
- Files should follow the naming pattern:
*_R1.fastq.gzand*_R2.fastq.gz(or similar R1/R2 designation) - All FASTQ files for a project should be organized in a single directory or subdirectories
Creating a samplesheet automatically: If you have many samples, you can use the provided Docker container to automatically generate a samplesheet from a directory containing FASTQ files. The paired fastq files can be in subdirectories, and you need to point to the directory above the sub-directories. You must also point to an existing directory where you want to write the samplesheet. The container can be run like this:
# Generate samplesheet from a directory containing FASTQ files
docker run --rm \
-v /path/to/fastq/directory:/data \
-v /path/to/output/directory:/out \
ghcr.io/jonbra/viralseq_utils:latest \
/data /out/samplesheet.csvThe output directory is specified using the --outdir parameter, e.g.:
--outdir results
The pipeline can be run using different profiles, which will determine how the pipeline is executed. The default profile is docker, which uses Docker containers to run the pipeline. You can also use singularity or conda profiles if you prefer those environments. To set the profile use the -profile parameter, e.g.: -profile docker/singularity/conda.
The different parameters can be provided in a file using the argument -params-file path/to/params-file.yml. The file can be either YAML-formatted:
input: 'samplesheet.csv'
outdir: 'results'or JSON-formatted:
{
"input": "samplesheet.csv",
"outdir": "results"
}The pipeline uses Kraken2 for two purposes. One is to classify the reads against a general database to get a broad overview of the taxonomic diversity within the sample (e.g., are there a lot of human reads?). The second is to classify the reads against a specific HCV-database and then use only the classified reads for the rest of the pipeline. This is done to reduce the computational load and time needed to run mapping and de novo assembly.
By default, the pipeline will download and use the PlusPFP-8 database compiled by Ben Langmead for the broad classification. This requires the download and upacking of a fairly large file (>5 GB) and we recommend that you download and unpack this yourself and specify the path to the database using the --kraken_all_db parameter.
For the HCV-specific classification, the pipeline will use a very small and provided database which consists of around 200 different HCV strains. You can specify a custom HCV-datavase using the --kraken_focused_db paramter.
The database comes with a provided set of about 200 HCV reference sequences downloaded from NCBI. See the file data/blast_db/HCVgenosubtypes_8.5.19_clean.fa. The fasta headers have been modified to begin with the genotype and subtype information (e.g., 1a, 3b, etc.) followed by an underscore and the NCBI accession number (e.g, 1a_AF009606). You can for example add or remove HCV strains by modifying this file. Remember to format the fasta headers accordingly. This file will then be used in the mapping and analysis of the de novo assembled contigs to identify genotype and subtype. You need to provide the path to this file like this: --references /path/to/HCV-sequences.fasta.
The pipeline will first map all HCV-classified reads against all HCV reference sequences. Then it will identify the reference sequence with the most mapped reads and use the genotype and subtype information from this reference sequence to call major genotype and subtype. To identify a potential co-infection (minor strain), the pipeline will identify the reference that belongs to a different genotype than the major strain (expect for genotypes 1a and 1b which are considered different enough so that we can distinguish them in a co-infection) and has the highest coverage (i.e., percent of the genome covered by 5 or more reads). By default we have set a threshold of minimum 500 reads and 30% genome coverage in order to consider a strain as a minor strain at all. This can be overridden using the parameters --minRead and --minCov.
Note that there is a recombinant strain between subtypes 2k and 1b present in the database. If this is detected, the pipeline will not allow for a co-infection with either genotypes 1 or 2.
If the pipeline crashes, or stopped deliberately, it can be restarted from the last completed step by running the same command but with the -resume option. Read more about resuming a Nextflow pipeline here.
Changing the arguments given to the various sub-tools can be done in several ways, perhaps the easiest is to create a custom config file. Described in more detail here.
The pipeline generates a comprehensive set of output files from various processes to facilitate result interpretation and quality control. By default, many intermediate files are published to help you understand the analysis. You can customize which files are published by modifying the publishDir settings in the configuration files. For example, to disable publishing for a specific process:
withName: 'PROCESS_NAME' {
publishDir = [enabled: false]
}The primary output file containing per-sample genotyping and quality metrics. Key columns include:
Read statistics:
sampleName- Sample identifiertotal_raw_reads- Total number of raw readstotal_trimmed_reads- Reads after quality trimmingtotal_classified_reads- Reads classified by Kraken2 as target organismtotal_mapped_reads- Reads mapped to all reference genomesfraction_mapped_reads_vs_median- Fraction of mapped reads relative to median across all samples. Useful for identifying outliers in a sequencing batch.
Genotyping results:
Major_genotype_mapping/Minor_genotype_mapping- Identified genotypes (major/minor variants) from the reference mappingMajor_reference/Minor_reference- CLosest references identified in the mapping against all references. These were used for genotyping and re-mappingmajor_typable/minor_typable- Whether the sample meets quality thresholds for reliable genotyping (YES/NO)
Mapping statistics (major/minor):
Reads_withdup_mapped_major/minor- Mapped reads including duplicatesReads_nodup_mapped_major/minor- Mapped reads after duplicate removalPercent_reads_mapped_of_trimmed_with_dups_major/minor- Percentage of trimmed reads that mapped, duplicates includedMajor/Minor_cov_breadth_min_5/10- Percentage of reference covered at ≥5× or ≥10× depthMajor/Minor_avg_depth- Average sequencing depth across the reference
HCV-specific outputs (if applicable):
GLUE_genotype/GLUE_subtype- Genotype and subtype determined by HCV-GLUE. "Typable" only if this matches the mapping genotype.Reference- GLUE reference sequence- Drug resistance markers for NS3/4A inhibitors (glecaprevir, grazoprevir, paritaprevir, voxilaprevir)
- Drug resistance markers for NS5A inhibitors (daclatasvir, elbasvir, ledipasvir, ombitasvir, pibrentasvir, velpatasvir)
- Drug resistance markers for NS5B inhibitors (dasabuvir, sofosbuvir)
*_mutcolumns - Detailed mutation information*_mut_shortcolumns - Abbreviated mutation notation
Technical metadata:
sequencer_id- Sequencing instrument identifierpipeline_version- Version of the folkehelseinstituttet/hcvtyper pipelineHCV_project_version- HCV-GLUE versionGLUE_engine_version- GLUE engine versionPHE_drug_resistance_extension_version- Version of the Public Health England (PHE) drug resistance extension applied in HCV-GLUE
A comprehensive HTML report (multiqc_report.html) that summarizes:
- Run information and pipeline parameters
- Command line and configuration used
- Pipeline version and software versions
- Quality control metrics (FastQC, trimming statistics)
- Read classification and mapping statistics
- Genotyping results and drug resistance summaries (for HCV)
- Visualization of coverage and variant distributions
The MultiQC report provides an interactive overview of all samples and is the recommended starting point for result interpretation.
fastqc/- Raw and trimmed read quality reportsfastp/orcutadapt/- Read trimming logs and statisticskraken2/- Taxonomic classification reportssamtools/- BAM file statistics and mapping metricsbowtie2/ortanoti/- Alignment files and indicesspades/- De novo assembly results (if enabled)blast/- BLAST results against reference databasehcvglue/- HCV-GLUE genotyping and resistance reports (for HCV samples)pipeline_info/- Execution reports, timeline, and software versions
If you use folkehelseinstituttet/hcvtyper for your analysis, please cite it using the following doi: https://doi.org/10.1101/2025.10.21.683612
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

