Skip to content

Machine learning system for predicting genetic disorders using genomic, clinical, and demographic data. Implements robust preprocessing, feature selection, and multi-model classification (RF, XGBoost, LightGBM, CatBoost) with cross-validation to support early, data-driven genetic risk assessment.

Notifications You must be signed in to change notification settings

dyneth02/Genome-based-Disorder-Prediction-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Genetic Disorder Prediction System

๐Ÿงฌ Overview

A comprehensive machine learning system for predicting genetic disorders and their specific subclasses using advanced ensemble methods and hierarchical classification. This project implements a two-stage prediction pipeline that accurately classifies both broad genetic disorder categories and detailed disorder subclasses.

๐ŸŽฏ Features

  • Two-Stage Hierarchical Prediction: Genetic Disorder โ†’ Disorder Subclass
  • Multiple ML Models: Random Forest, XGBoost, Logistic Regression, LightGBM, CatBoost, Linear SVC
  • Advanced Preprocessing: Strategic data leakage prevention and feature engineering
  • 5-Fold Cross Validation: Robust model evaluation with hyperparameter tuning
  • Comprehensive Metrics: Accuracy, ROC-AUC, Precision, Recall, F1-Score
  • Web Application: Full-stack deployment with FastAPI backend and React frontend
  • Model Interpretability: Feature importance analysis and probability outputs

๐Ÿ“Š Model Architecture

Parent Model (Genetic Disorder Classification)

  • Target: Mitochondrial genetic inheritance disorders, Multifactorial genetic inheritance disorders, Single-gene inheritance diseases
  • Primary Algorithm: Optimized Random Forest with SMOTE
  • Performance: >85% accuracy with comprehensive cross-validation

Child Model (Disorder Subclass Classification)

  • Target: Cancer, Cystic fibrosis, Diabetes, Down syndrome, Huntington's disease, Klinefelter syndrome, Leber's hereditary optic neuropathy, Leigh syndrome, Turner syndrome
  • Architecture: Hierarchical model using parent probabilities as features
  • Integration: Seamless two-stage prediction pipeline

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8+
  • Node.js 16+
  • Git

Installation

Backend (FastAPI)

cd backend
pip install -r requirements.txt

# Start the backend server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend (React)

cd frontend
npm install

# Start the development server
npm run dev

Usage

  1. Access Web Interface: Open http://localhost:5173 in your browser
  2. Input Patient Data: Fill in the comprehensive medical form
  3. Get Predictions: Receive instant genetic disorder risk assessments
  4. View Probabilities: See detailed class probabilities for informed decisions

๐Ÿ“ Project Structure

genome-based-disorder-prediction-system/
โ”œโ”€โ”€ backend/                 # FastAPI backend
โ”‚   โ”œโ”€โ”€ app/
โ”‚   โ”‚   โ”œโ”€โ”€ models/         # ML model implementations
โ”‚   โ”‚   โ”œโ”€โ”€ routers/        # API endpoints
โ”‚   โ”‚   โ””โ”€โ”€ services/       # Business logic
โ”‚   โ”œโ”€โ”€ models/             # Trained model artifacts
โ”‚   โ””โ”€โ”€ requirements.txt
โ”œโ”€โ”€ frontend/               # React frontend
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/     # React components
โ”‚   โ”‚   โ””โ”€โ”€ App.jsx         # Main application
โ”‚   โ””โ”€โ”€ package.json
โ”œโ”€โ”€ notebooks/              # Jupyter notebooks for analysis
โ”œโ”€โ”€ data/                   # Dataset files
โ”œโ”€โ”€ docs/                   # Documentation
โ””โ”€โ”€ README.md

๐Ÿ”ง Technical Implementation

Machine Learning Pipeline

  1. Data Preprocessing: Strategic standardization and feature engineering
  2. Model Training: 5-fold cross-validation with RandomizedSearchCV
  3. Hierarchical Prediction: Two-stage Random Forest architecture
  4. Performance Evaluation: Comprehensive metrics and visualization

Key Technologies

  • Backend: FastAPI, Scikit-learn, Pandas, NumPy
  • Frontend: React, Vite, Tailwind CSS
  • ML Libraries: Scikit-learn, XGBoost, LightGBM, CatBoost
  • Deployment: Docker-ready, RESTful APIs

๐Ÿ“ˆ Performance Results

Cross-Validation Metrics (5-Fold)

Model Stage Accuracy ROC-AUC Precision Recall
Parent (Genetic Disorder) 85.2% 0.912 0.847 0.852
Child (Disorder Subclass) 83.7% 0.894 0.831 0.837
Overall System 84.5% 0.903 0.839 0.845

Feature Importance

Top predictive features include:

  • Genetic markers and inheritance patterns
  • Clinical test results and blood work
  • Patient demographics and family history
  • Symptom presentations and severity scores

๐Ÿ—๏ธ API Documentation

Prediction Endpoint

POST /api/predict
Content-Type: application/json

{
  "patient_data": {
    "age": 35,
    "gender": "male",
    "blood_test_results": "normal",
    "symptom_score": 7,
    // ... other features
  }
}

Response

{
  "genetic_disorder": "Single-gene inheritance diseases",
  "disorder_subclass": "Huntington's disease",
  "probabilities": {
    "parent": { /* Genetic disorder probabilities */ },
    "child": { /* Subclass probabilities */ }
  },
  "confidence_score": 0.89
}

๐Ÿ”ฌ Model Training

To retrain the models with your data:

# Run the training pipeline
python -m backend.app.models.train_pipeline

# Or use the Jupyter notebook
jupyter notebook notebooks/model_training.ipynb

๐ŸŒŸ Key Innovations

  1. Strategic Data Leakage: Advanced preprocessing techniques
  2. Hierarchical Architecture: Two-stage prediction for improved accuracy
  3. Ensemble Methods: Combined multiple algorithms for robust performance
  4. Real-time Deployment: Production-ready web application
  5. Comprehensive Evaluation: Extensive metrics and visualization

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Supervisors: Mr. Prasanna Sumathipala, Mr. Samadhi Chathuranga, Mr. Chan, Ms. Supipi
  • Dataset Providers: Genome analysis research community
  • Open Source Libraries: Scikit-learn, FastAPI, React communities

๐Ÿ“ž Support

For support and questions:

๐Ÿ”ฎ Future Enhancements

  • Integration with electronic health records
  • Real-time learning capabilities
  • Multi-omics data integration
  • Mobile application development
  • Advanced explainable AI features

โญ Star this repository if you find it helpful!

Last updated: October 2025

About

Machine learning system for predicting genetic disorders using genomic, clinical, and demographic data. Implements robust preprocessing, feature selection, and multi-model classification (RF, XGBoost, LightGBM, CatBoost) with cross-validation to support early, data-driven genetic risk assessment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published