Assembly of the genome of the Saccharomyces cerevisiae using PacBio HiFi reads.
-Set up a new directory for this project
-The dataset is a HiFi Pacbio reads from the yeast Saccharomyces cerevisiae.
The data can be downloaded from NCBI Sequence Read Archive (SRA)
FastQC (Version 0.11.9)
Download fastQC using conda
conda install -c bioconda fastqc Hifiasm (Version 0.18.7)
Install Hifiasm
conda install -c bioconda hifiasm Quast (Version 5.2.0)
Download using conda
Make environment for installing quast
conda create --name quast_env
conda activate quast_env
conda install -c bioconda busco Busco (Version 5.4.4)
conda create --name busco
conda activate busco
conda install -c conda-forge -c bioconda busco=5.4.4MultiQC (Version 5.4.4)
conda create -n multiqc
conda activate multiqc
conda install -c conda-forge -c bioconda busco=5.4.4Here we check the quality of the reads with FastQC. In general, the quality of the reads is good. There is unexpected GC content distribution and uneven GC content per base at both ends of the reads.
fastqc data/SRR13577846.fastq.gz
xdg-open SRR13577846_fastqc.html Hifiasm is a genome assembler that uses PacBio HiFi or Oxford Nanopore sequencing reads to construct the genome assembly. The assembler merges overlapping reads to create a consensus sequence and iteratively builds contigs until the final assembly is achieved using the overlap-layout-consensus (OLC) algorithm.
hifiasm -s stats.txt -o output -t 10 data/SRR13577846.fastq.gz
#change .gfa to .fa (fasta file)
awk '/^S/{print ">"$2;print $3}' output.bp.p_ctg.gfa > output.bp.p_ctg.faQUAST is a tool that helps researchers evaluate the quality of genome assemblies. It works by comparing the assembly to a reference genome and generates metrics to assess the accuracy and completeness of the assembly.
QUAST is a program used to assess the quality of genome assemblies by comparing them to a reference genome. The reference genome provides a benchmark for assessing the quality of the assembly, as it represents a high-quality genome that has been previously validated. https://www.ncbi.nlm.nih.gov/genome/?term=Saccharomyces%20cerevisiae%5B0rganism%5D&cmd=DetailsSearch
quast 2_assembly/output.bp.p_ctg.fa -r Refrence/GCF_000146045.2_R64_genomic.fnacd results_2023_02_23_09_11_14/
cat report.txtREPORT:
QUAST report provides various statistics about the quality of the genome assembly. Specifically, it reports information about the number of contigs, the size of the largest contig, the total length of the assembly, the N50 and NG50 values (which represent the length of the contig at the midpoint of the size distribution), and the number of misassemblies. These metrics can be used to assess the completeness and accuracy of your assembly.
Assembly statistics for this data:
Contigs → 36
Largest contig → 1505909
Total length →12502635
N50 →805283
NG50→ 805283
Misassemblies → 111
BUSCO is a software used to check the completeness of genome assemblies by comparing the presence and completeness of conserved genes to a set of universal orthologs.
BUSCO is run by comparing a set of gene sequences (from a specific lineage or database) against the genome or transcriptome of interest to assess the completeness and quality of the assembly.
busco -i 2_assembly/ output.bp.p_ctg.fa --auto-lineage -o Busco -m genomecat short_summary.specific.saccharomycetes_odb10.Busco.txtThe BUSCO report provides information on the percentage of complete, fragmented, and missing BUSCOs in the assembly, which can be used to assess the assembly quality and completeness. Report for this data indicates that out of 2137 BUSCOs, 2129 were complete, 2 were fragmented, and 6 were missing.
Complete →2129
Fragmented →2
Missing →6
Total BUSCOs→2137
MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
cd Assembly
multiqc ./cat multiqc_general_stats.txtThe report provides various quality control metrics for the sample SRR13577846. According to FastQC, the sample has 0.85% duplicate sequences, 38% GC content, an average sequence length of 9391 base pairs, and a total of 117525 sequences. The report also indicates that 20% of the sequences failed FastQC checks. QUAST provides assembly statistics, including a total assembly length of 12502635 base pairs and an N50 value of 805283.
Sample →SRR13577846
FastQC_percent_duplicates →0.853946558668
FastQC_percent_gc →38.0
FastQC_avg_sequence_length →9391.45693257
FastQC_total_sequences →117525.0
FastQC_percent_fails →20.0
QUAST_Total_length →12502635.0
QUAST_N50 →805283.0