Advanced plant genome assembly pipeline for Oryza sativa using HiFi reads, k-mer profiling (Jellyfish), and Hifiasm
This repository contains a high-performance bioinformatics pipeline for assembling the Oryza sativa genome using PacBio HiFi reads (Accession: SRR35146894).
Compared to bacterial assembly, this project handles a larger eukaryotic genome, requiring advanced k-mer profiling and specialized assemblers for long reads.
- Data Retrieval: Automated SRA data download via
sra-tools. - K-mer Profiling (Jellyfish): Multi-k analysis (k=21, 51) to estimate genome size, heterozygosity, and repeat content.
- Long-read Assembly (Hifiasm): State-of-the-art assembler specifically designed for PacBio HiFi reads.
- HPC Optimized: Configured for multi-threading (up to 20 threads) and memory-intensive processing.
- Setup: Workspace initialization.
- K-mer Counting: Generating histograms with
jellyfishto assess data quality. - Format Conversion: Preparing sequences for the assembly engine.
- Assembly: Executing
hifiasmfor high-fidelity contig generation.
The k-mer analysis is a crucial step in plant genomics. It allows us to:
- Predict the genome size.
- Identify the level of duplication and repeats.
- Optimize assembly parameters.
Analogy: If assembly is building a house, K-mer analysis is checking the quality and quantity of bricks before the first wall is even raised.
bash
git clone https://github.com/daniil-11-ger/oryza-sativa-genome-assembly.git
bash scripts/assembly_pipeline.sh
Scaffolding (RagTag): Organized contigs into chromosomes using a reference-guided approach (IRGSP-1.0).
Organelle Sorting: Identified and separated chloroplast and mitochondrial genomes (nc_001320.1 and nc_011033.1).
Gene Prediction (Augustus): Performed ab initio gene prediction with extrinsic hints specifically for Oryza sativa.
https://daniil-11-ger.github.io/oryza-sativa-genome-assembly/
- Repeat Masking: Refine the EDTA/RepeatMasker stage for better soft-masking before annotation.
- Functional Annotation: Using BLAST/InterProScan to assign biological functions to the predicted genes.
- Synteny Analysis: Comparing the Kasalath assembly with other rice cultivars.