Skip to content

SagharT/Assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Long read de novo assembly

Assembly of the genome of the Saccharomyces cerevisiae using PacBio HiFi reads.

Set up

-Set up a new directory for this project  
-The dataset is a HiFi Pacbio reads from the yeast Saccharomyces cerevisiae.

The data can be downloaded from NCBI Sequence Read Archive (SRA)

Software

FastQC (Version 0.11.9)
Download fastQC using conda

conda install -c bioconda fastqc 

Hifiasm (Version 0.18.7)
Install Hifiasm

conda install -c bioconda hifiasm 

Quast (Version 5.2.0)
Download using conda
Make environment for installing quast

conda create --name quast_env  
conda activate quast_env  
conda install -c bioconda busco 

Busco (Version 5.4.4)

conda create --name busco  
conda activate busco
conda install -c conda-forge -c bioconda busco=5.4.4

MultiQC (Version 5.4.4)

conda create -n multiqc
conda activate multiqc
conda install -c conda-forge -c bioconda busco=5.4.4

Check quality of the data, FASTQC

Here we check the quality of the reads with FastQC. In general, the quality of the reads is good. There is unexpected GC content distribution and uneven GC content per base at both ends of the reads.

Check the quality by fastqc

fastqc data/SRR13577846.fastq.gz
xdg-open SRR13577846_fastqc.html 

Assembly

Run hifiasm

Hifiasm is a genome assembler that uses PacBio HiFi or Oxford Nanopore sequencing reads to construct the genome assembly. The assembler merges overlapping reads to create a consensus sequence and iteratively builds contigs until the final assembly is achieved using the overlap-layout-consensus (OLC) algorithm.

hifiasm -s stats.txt -o output -t 10 data/SRR13577846.fastq.gz
#change .gfa to .fa (fasta file)
awk '/^S/{print ">"$2;print $3}'  output.bp.p_ctg.gfa > output.bp.p_ctg.fa

Quality assesment using Quast

QUAST is a tool that helps researchers evaluate the quality of genome assemblies. It works by comparing the assembly to a reference genome and generates metrics to assess the accuracy and completeness of the assembly.

Finding the refrence

QUAST is a program used to assess the quality of genome assemblies by comparing them to a reference genome. The reference genome provides a benchmark for assessing the quality of the assembly, as it represents a high-quality genome that has been previously validated. https://www.ncbi.nlm.nih.gov/genome/?term=Saccharomyces%20cerevisiae%5B0rganism%5D&cmd=DetailsSearch

Run Quast

quast 2_assembly/output.bp.p_ctg.fa -r Refrence/GCF_000146045.2_R64_genomic.fna

Find information by looking at report

cd results_2023_02_23_09_11_14/
cat report.txt

REPORT:
QUAST report provides various statistics about the quality of the genome assembly. Specifically, it reports information about the number of contigs, the size of the largest contig, the total length of the assembly, the N50 and NG50 values (which represent the length of the contig at the midpoint of the size distribution), and the number of misassemblies. These metrics can be used to assess the completeness and accuracy of your assembly.

Assembly statistics for this data:
Contigs → 36
Largest contig → 1505909
Total length →12502635
N50 →805283
NG50→ 805283
Misassemblies → 111

Quality assesment using Busco

BUSCO is a software used to check the completeness of genome assemblies by comparing the presence and completeness of conserved genes to a set of universal orthologs.
BUSCO is run by comparing a set of gene sequences (from a specific lineage or database) against the genome or transcriptome of interest to assess the completeness and quality of the assembly.

Run Busco

busco -i 2_assembly/ output.bp.p_ctg.fa --auto-lineage -o Busco -m genome

Find information by looking at report

cat short_summary.specific.saccharomycetes_odb10.Busco.txt

The BUSCO report provides information on the percentage of complete, fragmented, and missing BUSCOs in the assembly, which can be used to assess the assembly quality and completeness. Report for this data indicates that out of 2137 BUSCOs, 2129 were complete, 2 were fragmented, and 6 were missing.

Complete →2129
Fragmented →2
Missing →6
Total BUSCOs→2137

MultiQC Report

MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

Run MultiQC

cd Assembly
multiqc ./

Find information by looking at report

cat multiqc_general_stats.txt

The report provides various quality control metrics for the sample SRR13577846. According to FastQC, the sample has 0.85% duplicate sequences, 38% GC content, an average sequence length of 9391 base pairs, and a total of 117525 sequences. The report also indicates that 20% of the sequences failed FastQC checks. QUAST provides assembly statistics, including a total assembly length of 12502635 base pairs and an N50 value of 805283.

Sample →SRR13577846
FastQC_percent_duplicates →0.853946558668
FastQC_percent_gc →38.0
FastQC_avg_sequence_length →9391.45693257
FastQC_total_sequences →117525.0
FastQC_percent_fails →20.0
QUAST_Total_length →12502635.0
QUAST_N50 →805283.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages