Long read de novo assembly

Assembly of the genome of the Saccharomyces cerevisiae using PacBio HiFi reads.

Set up

-Set up a new directory for this project  
-The dataset is a HiFi Pacbio reads from the yeast Saccharomyces cerevisiae.

The data can be downloaded from NCBI Sequence Read Archive (SRA)

Software

FastQC (Version 0.11.9)
Download fastQC using conda

conda install -c bioconda fastqc

Hifiasm (Version 0.18.7)
Install Hifiasm

conda install -c bioconda hifiasm

Quast (Version 5.2.0)
Download using conda
Make environment for installing quast

conda create --name quast_env  
conda activate quast_env  
conda install -c bioconda busco

Busco (Version 5.4.4)

conda create --name busco  
conda activate busco
conda install -c conda-forge -c bioconda busco=5.4.4

MultiQC (Version 5.4.4)

conda create -n multiqc
conda activate multiqc
conda install -c conda-forge -c bioconda busco=5.4.4

Check quality of the data, FASTQC

Here we check the quality of the reads with FastQC. In general, the quality of the reads is good. There is unexpected GC content distribution and uneven GC content per base at both ends of the reads.

Check the quality by fastqc

fastqc data/SRR13577846.fastq.gz
xdg-open SRR13577846_fastqc.html

Assembly

Run hifiasm

Hifiasm is a genome assembler that uses PacBio HiFi or Oxford Nanopore sequencing reads to construct the genome assembly. The assembler merges overlapping reads to create a consensus sequence and iteratively builds contigs until the final assembly is achieved using the overlap-layout-consensus (OLC) algorithm.

hifiasm -s stats.txt -o output -t 10 data/SRR13577846.fastq.gz
#change .gfa to .fa (fasta file)
awk '/^S/{print ">"$2;print $3}'  output.bp.p_ctg.gfa > output.bp.p_ctg.fa

Quality assesment using Quast

QUAST is a tool that helps researchers evaluate the quality of genome assemblies. It works by comparing the assembly to a reference genome and generates metrics to assess the accuracy and completeness of the assembly.

Finding the refrence

QUAST is a program used to assess the quality of genome assemblies by comparing them to a reference genome. The reference genome provides a benchmark for assessing the quality of the assembly, as it represents a high-quality genome that has been previously validated. https://www.ncbi.nlm.nih.gov/genome/?term=Saccharomyces%20cerevisiae%5B0rganism%5D&cmd=DetailsSearch

Run Quast

quast 2_assembly/output.bp.p_ctg.fa -r Refrence/GCF_000146045.2_R64_genomic.fna

Find information by looking at report

cd results_2023_02_23_09_11_14/
cat report.txt

REPORT:
QUAST report provides various statistics about the quality of the genome assembly. Specifically, it reports information about the number of contigs, the size of the largest contig, the total length of the assembly, the N50 and NG50 values (which represent the length of the contig at the midpoint of the size distribution), and the number of misassemblies. These metrics can be used to assess the completeness and accuracy of your assembly.

Assembly statistics for this data:
Contigs → 36
Largest contig → 1505909
Total length →12502635
N50 →805283
NG50→ 805283
Misassemblies → 111

Quality assesment using Busco

BUSCO is a software used to check the completeness of genome assemblies by comparing the presence and completeness of conserved genes to a set of universal orthologs.
BUSCO is run by comparing a set of gene sequences (from a specific lineage or database) against the genome or transcriptome of interest to assess the completeness and quality of the assembly.

Run Busco

busco -i 2_assembly/ output.bp.p_ctg.fa --auto-lineage -o Busco -m genome

Find information by looking at report

cat short_summary.specific.saccharomycetes_odb10.Busco.txt

The BUSCO report provides information on the percentage of complete, fragmented, and missing BUSCOs in the assembly, which can be used to assess the assembly quality and completeness. Report for this data indicates that out of 2137 BUSCOs, 2129 were complete, 2 were fragmented, and 6 were missing.

Complete →2129
Fragmented →2
Missing →6
Total BUSCOs→2137

MultiQC Report

MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

Run MultiQC

cd Assembly
multiqc ./

Find information by looking at report

cat multiqc_general_stats.txt

The report provides various quality control metrics for the sample SRR13577846. According to FastQC, the sample has 0.85% duplicate sequences, 38% GC content, an average sequence length of 9391 base pairs, and a total of 117525 sequences. The report also indicates that 20% of the sequences failed FastQC checks. QUAST provides assembly statistics, including a total assembly length of 12502635 base pairs and an N50 value of 805283.

Sample →SRR13577846
FastQC_percent_duplicates →0.853946558668
FastQC_percent_gc →38.0
FastQC_avg_sequence_length →9391.45693257
FastQC_total_sequences →117525.0
FastQC_percent_fails →20.0
QUAST_Total_length →12502635.0
QUAST_N50 →805283.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
1_FASTQC		1_FASTQC
2_Assembly		2_Assembly
3_Quast/quast_results		3_Quast/quast_results
4_Busco		4_Busco
5_Multiqc		5_Multiqc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Long read de novo assembly

Set up

Software

Check quality of the data, FASTQC

Check the quality by fastqc

Assembly

Run hifiasm

Quality assesment using Quast

Finding the refrence

Run Quast

Find information by looking at report

Quality assesment using Busco

Run Busco

Find information by looking at report

MultiQC Report

Run MultiQC

Find information by looking at report

About

Uh oh!

Releases

Packages

Languages

SagharT/Assembly

Folders and files

Latest commit

History

Repository files navigation

Long read de novo assembly

Set up

Software

Check quality of the data, FASTQC

Check the quality by fastqc

Assembly

Run hifiasm

Quality assesment using Quast

Finding the refrence

Run Quast

Find information by looking at report

Quality assesment using Busco

Run Busco

Find information by looking at report

MultiQC Report

Run MultiQC

Find information by looking at report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages