An educational project focused on analyzing genomic data using Principal Component Analysis (PCA).
The goal is to reproduce and visualize the PCA results from the publication:
Oliveira S. et al., Genome-wide variation in the Angolan Namib Desert reveals unique pre-Bantu ancestry, Sci. Adv. 9 (2023)
-
Data Preprocessing (via PLINK)
- SNP filtering
- Removing individuals with low call rate
- LD pruning
- Selecting autosomal SNPs only
-
PCA Analysis
- Computing PCs (PC1–PC20)
- Explained variance ratios
- Visualization: PC1 vs PC2, PC2 vs PC3, etc.
-
Visualization & Interpretation
- Coloring samples by population
- Overlaying geographical coordinates (synthetic and/or from supplementary materials)
- Checking correlations between PCs and coordinates
- Clustering in PC space (PyCaret, KMeans, etc.)
- PCA clusters broadly reflect geographical population structure.
- Clustering confirms separation into groups consistent with ethnic and regional divisions.
- Some plots (especially maps and interactive visualizations) may render correctly only inside Jupyter or Colab environments.
-
Install dependencies:
pip install -r requirements.txt
-
Launch the notebook locally:
jupyter notebook Practice_PCA_genotypes_final.ipynb
- Python 3.10+
- PLINK v1.9
pandas,numpy,scikit-learnmatplotlib,seaborn,plotly,foliumpycaret(for clustering)
The dataset was provided as pre-filtered (LD-pruned) as part of the course.
All major preprocessing steps are still included in the notebook for clarity and reproducibility.