Bioinformatics pipeline for variant calling and phylogenetic analysis.
phytoakmeter/
└── cg_bioinfo_v1/
└── variant_calling/
├── conda_environments/ # Conda environment specifications
├── data/ # Raw sequencing data (excluded from Git)
├── logs/ # Job execution logs (tracked in Git)
├── results/ # Analysis outputs (excluded from Git)
└── scripts/ # Analysis scripts (tracked in Git)
├── python/ # Python scripts
└── shell/ # Shell scripts for pipeline steps
Note: Only conda_environments/, logs/, and scripts/ are tracked in Git. Data and results directories are excluded due to file size.
The variant calling pipeline consists of 20 sequential steps executed via shell scripts:
00_setup.sh- Initial setup and configuration01_fastp.sh- Adapter trimming and quality filtering02_fastqc.sh- Quality assessment of reads03_multiqc.sh- Aggregate QC reports
04_bwa_index.sh- Index reference genome for BWA05_bwa_map.sh- Map reads to reference genome06_add_rg.sh- Add read group information07_remove_dup.sh- Remove PCR duplicates
08_gatk_index.sh- Index files for GATK09_indel_real.sh- Indel realignment
10_mpileup.sh- Generate variant pileup11_concat.sh- Concatenate variant files across samples/chromosomes
12_filter.sh- Initial variant filtering13_hardfilter.sh- Apply hard filters to variants14_subsetting.sh- Extract variant subsets15_conversion.sh- Format conversion for downstream analysis
16_admixture.sh- Population structure analysis with ADMIXTURE17_popinfo.sh- Extract population information18_raxml.sh- Phylogenetic tree construction with RAxML19_phyloinfo.sh- Process phylogenetic results
scripts/python/vcf2phylip.py- Convert VCF to PHYLIP format
Each bioinformatics tool uses a dedicated conda environment to ensure reproducibility and avoid dependency conflicts. All environment specifications are in conda_environments/.
cg_bioinfo_v1_R.yml- R and related packagescg_bioinfo_v1_admixture.yml- ADMIXTURE for population structurecg_bioinfo_v1_bcftools.yml- BCFtools for variant manipulationcg_bioinfo_v1_bwa.yml- BWA alignercg_bioinfo_v1_fastp.yml- Fastp for read trimmingcg_bioinfo_v1_fastqc.yml- FastQC for quality controlcg_bioinfo_v1_gatk.yml- GATK for variant callingcg_bioinfo_v1_multiqc.yml- MultiQC for report aggregationcg_bioinfo_v1_picard.yml- Picard toolscg_bioinfo_v1_plink.yml- PLINK for genetic analysiscg_bioinfo_v1_raxml.yml- RAxML for phylogeneticscg_bioinfo_v1_samtools.yml- SAMtools for alignment processing
cd cg_bioinfo_v1/variant_calling/conda_environments/
# Create all environments
for yml in *.yml; do
conda env create -f $yml
doneOr create individually:
conda env create -f cg_bioinfo_v1_fastp.yml
conda env create -f cg_bioinfo_v1_bwa.yml
# etc.Each script requires its corresponding environment:
# Example: fastp step
conda activate cg_bioinfo_v1_fastp
bash scripts/shell/01_fastp.sh
# Example: BWA mapping
conda activate cg_bioinfo_v1_bwa
bash scripts/shell/05_bwa_map.sh-
Prepare input data
- Place raw FASTQ files in
data/ - Ensure reference genome is available
- Place raw FASTQ files in
-
Create conda environments (one-time setup)
cd cg_bioinfo_v1/variant_calling/conda_environments/ for yml in *.yml; do conda env create -f $yml; done
-
Execute pipeline sequentially
cd cg_bioinfo_v1/variant_calling/ # Activate appropriate environment for each step conda activate cg_bioinfo_v1_fastp bash scripts/shell/00_setup.sh bash scripts/shell/01_fastp.sh # Continue through all steps...
-
Monitor execution
- Check logs in
logs/01_logs_fastp/,logs/02_logs_fastqc/, etc. - Each step generates job-specific log files
- Check logs in
Logs are organized by pipeline step:
logs/
├── 01_logs_fastp/
├── 02_logs_fastqc/
├── 03_logs_multiqc/
├── 04_logs_bwa_index/
└── ...
- Input data (
data/): Raw sequencing files - excluded from Git - Results (
results/): All analysis outputs - excluded from Git - Scripts (
scripts/): All analysis code - tracked in Git - Logs (
logs/): Execution logs - tracked in Git for reproducibility - Environments (
conda_environments/): Environment specs - tracked in Git
This repository tracks scripts, logs, and conda environments only.
# After editing scripts or adding logs
git add .
git commit -m "Updated filtering parameters in 12_filter.sh"
git pushgit pullgit status # See what changed
git log --oneline # View commit history- Always activate the correct conda environment before running each script
- Check logs after each step to verify successful completion
- Commit script changes with descriptive messages
- Keep data and results separate from the Git repository
- Document parameter changes in commit messages
- If a script fails, check the corresponding log file in
logs/XX_logs_TOOLNAME/ - Ensure the correct conda environment is activated
- Verify input files exist in expected locations
- Check that previous pipeline steps completed successfully
For questions or issues, please open an issue on GitHub or contact the repository maintainer.
NA