oryza-sativa-genome-assembly

Advanced plant genome assembly pipeline for Oryza sativa using HiFi reads, k-mer profiling (Jellyfish), and Hifiasm

Plant Genome Assembly: Oryza sativa (Rice)

This repository contains a high-performance bioinformatics pipeline for assembling the Oryza sativa genome using PacBio HiFi reads (Accession: SRR35146894).

Project Features

Compared to bacterial assembly, this project handles a larger eukaryotic genome, requiring advanced k-mer profiling and specialized assemblers for long reads.

Key Components:

Data Retrieval: Automated SRA data download via sra-tools.
K-mer Profiling (Jellyfish): Multi-k analysis (k=21, 51) to estimate genome size, heterozygosity, and repeat content.
Long-read Assembly (Hifiasm): State-of-the-art assembler specifically designed for PacBio HiFi reads.
HPC Optimized: Configured for multi-threading (up to 20 threads) and memory-intensive processing.

Pipeline Workflow

Setup: Workspace initialization.
K-mer Counting: Generating histograms with jellyfish to assess data quality.
Format Conversion: Preparing sequences for the assembly engine.
Assembly: Executing hifiasm for high-fidelity contig generation.

K-mer Analysis Insights

The k-mer analysis is a crucial step in plant genomics. It allows us to:

Predict the genome size.
Identify the level of duplication and repeats.
Optimize assembly parameters.

Analogy: If assembly is building a house, K-mer analysis is checking the quality and quantity of bricks before the first wall is even raised.

How to Run

bash

Clone the repository

git clone https://github.com/daniil-11-ger/oryza-sativa-genome-assembly.git

Run the pipeline

bash scripts/assembly_pipeline.sh

Pipeline Stages

Scaffolding (RagTag): Organized contigs into chromosomes using a reference-guided approach (IRGSP-1.0).

Organelle Sorting: Identified and separated chloroplast and mitochondrial genomes (nc_001320.1 and nc_011033.1).

Gene Prediction (Augustus): Performed ab initio gene prediction with extrinsic hints specifically for Oryza sativa.

Website

https://daniil-11-ger.github.io/oryza-sativa-genome-assembly/

Future Directions

Repeat Masking: Refine the EDTA/RepeatMasker stage for better soft-masking before annotation.
Functional Annotation: Using BLAST/InterProScan to assign biological functions to the predicted genes.
Synteny Analysis: Comparing the Kasalath assembly with other rice cultivars.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
envs		envs
images		images
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oryza-sativa-genome-assembly

Plant Genome Assembly: Oryza sativa (Rice)

Project Features

Key Components:

Pipeline Workflow

K-mer Analysis Insights

How to Run

Clone the repository

Run the pipeline

Pipeline Stages

Website

Future Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oryza-sativa-genome-assembly

Plant Genome Assembly: Oryza sativa (Rice)

Project Features

Key Components:

Pipeline Workflow

K-mer Analysis Insights

How to Run

Clone the repository

Run the pipeline

Pipeline Stages

Website

Future Directions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages