Skip to content

nnamremmizxilef/phytoakmeter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Phytoakmeter: Variant Calling Pipeline

Bioinformatics pipeline for variant calling and phylogenetic analysis.

Repository Structure

phytoakmeter/
└── cg_bioinfo_v1/
    └── variant_calling/
        ├── conda_environments/  # Conda environment specifications
        ├── data/               # Raw sequencing data (excluded from Git)
        ├── logs/               # Job execution logs (tracked in Git)
        ├── results/            # Analysis outputs (excluded from Git)
        └── scripts/            # Analysis scripts (tracked in Git)
            ├── python/         # Python scripts
            └── shell/          # Shell scripts for pipeline steps

Note: Only conda_environments/, logs/, and scripts/ are tracked in Git. Data and results directories are excluded due to file size.

Pipeline Overview

The variant calling pipeline consists of 20 sequential steps executed via shell scripts:

1. Quality Control & Preprocessing

  • 00_setup.sh - Initial setup and configuration
  • 01_fastp.sh - Adapter trimming and quality filtering
  • 02_fastqc.sh - Quality assessment of reads
  • 03_multiqc.sh - Aggregate QC reports

2. Reference Preparation & Read Alignment

  • 04_bwa_index.sh - Index reference genome for BWA
  • 05_bwa_map.sh - Map reads to reference genome
  • 06_add_rg.sh - Add read group information
  • 07_remove_dup.sh - Remove PCR duplicates

3. Variant Calling Preparation

  • 08_gatk_index.sh - Index files for GATK
  • 09_indel_real.sh - Indel realignment

4. Variant Detection

  • 10_mpileup.sh - Generate variant pileup
  • 11_concat.sh - Concatenate variant files across samples/chromosomes

5. Variant Filtering & Processing

  • 12_filter.sh - Initial variant filtering
  • 13_hardfilter.sh - Apply hard filters to variants
  • 14_subsetting.sh - Extract variant subsets
  • 15_conversion.sh - Format conversion for downstream analysis

6. Population Structure & Phylogenetics

  • 16_admixture.sh - Population structure analysis with ADMIXTURE
  • 17_popinfo.sh - Extract population information
  • 18_raxml.sh - Phylogenetic tree construction with RAxML
  • 19_phyloinfo.sh - Process phylogenetic results

7. Additional Analysis

  • scripts/python/vcf2phylip.py - Convert VCF to PHYLIP format

Environment Setup

Each bioinformatics tool uses a dedicated conda environment to ensure reproducibility and avoid dependency conflicts. All environment specifications are in conda_environments/.

Available Environments

  • cg_bioinfo_v1_R.yml - R and related packages
  • cg_bioinfo_v1_admixture.yml - ADMIXTURE for population structure
  • cg_bioinfo_v1_bcftools.yml - BCFtools for variant manipulation
  • cg_bioinfo_v1_bwa.yml - BWA aligner
  • cg_bioinfo_v1_fastp.yml - Fastp for read trimming
  • cg_bioinfo_v1_fastqc.yml - FastQC for quality control
  • cg_bioinfo_v1_gatk.yml - GATK for variant calling
  • cg_bioinfo_v1_multiqc.yml - MultiQC for report aggregation
  • cg_bioinfo_v1_picard.yml - Picard tools
  • cg_bioinfo_v1_plink.yml - PLINK for genetic analysis
  • cg_bioinfo_v1_raxml.yml - RAxML for phylogenetics
  • cg_bioinfo_v1_samtools.yml - SAMtools for alignment processing

Creating Environments

cd cg_bioinfo_v1/variant_calling/conda_environments/

# Create all environments
for yml in *.yml; do
    conda env create -f $yml
done

Or create individually:

conda env create -f cg_bioinfo_v1_fastp.yml
conda env create -f cg_bioinfo_v1_bwa.yml
# etc.

Activating Environments

Each script requires its corresponding environment:

# Example: fastp step
conda activate cg_bioinfo_v1_fastp
bash scripts/shell/01_fastp.sh

# Example: BWA mapping
conda activate cg_bioinfo_v1_bwa
bash scripts/shell/05_bwa_map.sh

Usage

Running the Pipeline

  1. Prepare input data

    • Place raw FASTQ files in data/
    • Ensure reference genome is available
  2. Create conda environments (one-time setup)

    cd cg_bioinfo_v1/variant_calling/conda_environments/
    for yml in *.yml; do conda env create -f $yml; done
  3. Execute pipeline sequentially

    cd cg_bioinfo_v1/variant_calling/
    
    # Activate appropriate environment for each step
    conda activate cg_bioinfo_v1_fastp
    bash scripts/shell/00_setup.sh
    bash scripts/shell/01_fastp.sh
    
    # Continue through all steps...
  4. Monitor execution

    • Check logs in logs/01_logs_fastp/, logs/02_logs_fastqc/, etc.
    • Each step generates job-specific log files

Log Organization

Logs are organized by pipeline step:

logs/
├── 01_logs_fastp/
├── 02_logs_fastqc/
├── 03_logs_multiqc/
├── 04_logs_bwa_index/
└── ...

Data Management

  • Input data (data/): Raw sequencing files - excluded from Git
  • Results (results/): All analysis outputs - excluded from Git
  • Scripts (scripts/): All analysis code - tracked in Git
  • Logs (logs/): Execution logs - tracked in Git for reproducibility
  • Environments (conda_environments/): Environment specs - tracked in Git

Git Workflow

This repository tracks scripts, logs, and conda environments only.

Making Changes

# After editing scripts or adding logs
git add .
git commit -m "Updated filtering parameters in 12_filter.sh"
git push

Pulling Updates

git pull

Checking Status

git status          # See what changed
git log --oneline   # View commit history

Best Practices

  1. Always activate the correct conda environment before running each script
  2. Check logs after each step to verify successful completion
  3. Commit script changes with descriptive messages
  4. Keep data and results separate from the Git repository
  5. Document parameter changes in commit messages

Troubleshooting

  • If a script fails, check the corresponding log file in logs/XX_logs_TOOLNAME/
  • Ensure the correct conda environment is activated
  • Verify input files exist in expected locations
  • Check that previous pipeline steps completed successfully

Contact

For questions or issues, please open an issue on GitHub or contact the repository maintainer.

License

NA

About

Git repository of the phytoakmeter folder on hyperion.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors