Phytoakmeter: Variant Calling Pipeline

Bioinformatics pipeline for variant calling and phylogenetic analysis.

Repository Structure

phytoakmeter/
└── cg_bioinfo_v1/
    └── variant_calling/
        ├── conda_environments/  # Conda environment specifications
        ├── data/               # Raw sequencing data (excluded from Git)
        ├── logs/               # Job execution logs (tracked in Git)
        ├── results/            # Analysis outputs (excluded from Git)
        └── scripts/            # Analysis scripts (tracked in Git)
            ├── python/         # Python scripts
            └── shell/          # Shell scripts for pipeline steps

Note: Only conda_environments/, logs/, and scripts/ are tracked in Git. Data and results directories are excluded due to file size.

Pipeline Overview

The variant calling pipeline consists of 20 sequential steps executed via shell scripts:

1. Quality Control & Preprocessing

00_setup.sh - Initial setup and configuration
01_fastp.sh - Adapter trimming and quality filtering
02_fastqc.sh - Quality assessment of reads
03_multiqc.sh - Aggregate QC reports

2. Reference Preparation & Read Alignment

04_bwa_index.sh - Index reference genome for BWA
05_bwa_map.sh - Map reads to reference genome
06_add_rg.sh - Add read group information
07_remove_dup.sh - Remove PCR duplicates

3. Variant Calling Preparation

08_gatk_index.sh - Index files for GATK
09_indel_real.sh - Indel realignment

4. Variant Detection

10_mpileup.sh - Generate variant pileup
11_concat.sh - Concatenate variant files across samples/chromosomes

5. Variant Filtering & Processing

12_filter.sh - Initial variant filtering
13_hardfilter.sh - Apply hard filters to variants
14_subsetting.sh - Extract variant subsets
15_conversion.sh - Format conversion for downstream analysis

6. Population Structure & Phylogenetics

16_admixture.sh - Population structure analysis with ADMIXTURE
17_popinfo.sh - Extract population information
18_raxml.sh - Phylogenetic tree construction with RAxML
19_phyloinfo.sh - Process phylogenetic results

7. Additional Analysis

scripts/python/vcf2phylip.py - Convert VCF to PHYLIP format

Environment Setup

Each bioinformatics tool uses a dedicated conda environment to ensure reproducibility and avoid dependency conflicts. All environment specifications are in conda_environments/.

Available Environments

cg_bioinfo_v1_R.yml - R and related packages
cg_bioinfo_v1_admixture.yml - ADMIXTURE for population structure
cg_bioinfo_v1_bcftools.yml - BCFtools for variant manipulation
cg_bioinfo_v1_bwa.yml - BWA aligner
cg_bioinfo_v1_fastp.yml - Fastp for read trimming
cg_bioinfo_v1_fastqc.yml - FastQC for quality control
cg_bioinfo_v1_gatk.yml - GATK for variant calling
cg_bioinfo_v1_multiqc.yml - MultiQC for report aggregation
cg_bioinfo_v1_picard.yml - Picard tools
cg_bioinfo_v1_plink.yml - PLINK for genetic analysis
cg_bioinfo_v1_raxml.yml - RAxML for phylogenetics
cg_bioinfo_v1_samtools.yml - SAMtools for alignment processing

Creating Environments

cd cg_bioinfo_v1/variant_calling/conda_environments/

# Create all environments
for yml in *.yml; do
    conda env create -f $yml
done

Or create individually:

conda env create -f cg_bioinfo_v1_fastp.yml
conda env create -f cg_bioinfo_v1_bwa.yml
# etc.

Activating Environments

Each script requires its corresponding environment:

# Example: fastp step
conda activate cg_bioinfo_v1_fastp
bash scripts/shell/01_fastp.sh

# Example: BWA mapping
conda activate cg_bioinfo_v1_bwa
bash scripts/shell/05_bwa_map.sh

Usage

Running the Pipeline

Prepare input data
- Place raw FASTQ files in data/
- Ensure reference genome is available

Create conda environments (one-time setup)

cd cg_bioinfo_v1/variant_calling/conda_environments/
for yml in *.yml; do conda env create -f $yml; done

Execute pipeline sequentially

cd cg_bioinfo_v1/variant_calling/

# Activate appropriate environment for each step
conda activate cg_bioinfo_v1_fastp
bash scripts/shell/00_setup.sh
bash scripts/shell/01_fastp.sh

# Continue through all steps...

Monitor execution
- Check logs in logs/01_logs_fastp/, logs/02_logs_fastqc/, etc.
- Each step generates job-specific log files

Log Organization

Logs are organized by pipeline step:

logs/
├── 01_logs_fastp/
├── 02_logs_fastqc/
├── 03_logs_multiqc/
├── 04_logs_bwa_index/
└── ...

Data Management

Input data (data/): Raw sequencing files - excluded from Git
Results (results/): All analysis outputs - excluded from Git
Scripts (scripts/): All analysis code - tracked in Git
Logs (logs/): Execution logs - tracked in Git for reproducibility
Environments (conda_environments/): Environment specs - tracked in Git

Git Workflow

This repository tracks scripts, logs, and conda environments only.

Making Changes

# After editing scripts or adding logs
git add .
git commit -m "Updated filtering parameters in 12_filter.sh"
git push

Pulling Updates

git pull

Checking Status

git status          # See what changed
git log --oneline   # View commit history

Best Practices

Always activate the correct conda environment before running each script
Check logs after each step to verify successful completion
Commit script changes with descriptive messages
Keep data and results separate from the Git repository
Document parameter changes in commit messages

Troubleshooting

If a script fails, check the corresponding log file in logs/XX_logs_TOOLNAME/
Ensure the correct conda environment is activated
Verify input files exist in expected locations
Check that previous pipeline steps completed successfully

Contact

For questions or issues, please open an issue on GitHub or contact the repository maintainer.

License

NA

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cg_bioinfo_v1/variant_calling		cg_bioinfo_v1/variant_calling
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phytoakmeter: Variant Calling Pipeline

Repository Structure

Pipeline Overview

1. Quality Control & Preprocessing

2. Reference Preparation & Read Alignment

3. Variant Calling Preparation

4. Variant Detection

5. Variant Filtering & Processing

6. Population Structure & Phylogenetics

7. Additional Analysis

Environment Setup

Available Environments

Creating Environments

Activating Environments

Usage

Running the Pipeline

Log Organization

Data Management

Git Workflow

Making Changes

Pulling Updates

Checking Status

Best Practices

Troubleshooting

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phytoakmeter: Variant Calling Pipeline

Repository Structure

Pipeline Overview

1. Quality Control & Preprocessing

2. Reference Preparation & Read Alignment

3. Variant Calling Preparation

4. Variant Detection

5. Variant Filtering & Processing

6. Population Structure & Phylogenetics

7. Additional Analysis

Environment Setup

Available Environments

Creating Environments

Activating Environments

Usage

Running the Pipeline

Log Organization

Data Management

Git Workflow

Making Changes

Pulling Updates

Checking Status

Best Practices

Troubleshooting

Contact

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages