A comprehensive machine learning system for predicting genetic disorders and their specific subclasses using advanced ensemble methods and hierarchical classification. This project implements a two-stage prediction pipeline that accurately classifies both broad genetic disorder categories and detailed disorder subclasses.
- Two-Stage Hierarchical Prediction: Genetic Disorder โ Disorder Subclass
- Multiple ML Models: Random Forest, XGBoost, Logistic Regression, LightGBM, CatBoost, Linear SVC
- Advanced Preprocessing: Strategic data leakage prevention and feature engineering
- 5-Fold Cross Validation: Robust model evaluation with hyperparameter tuning
- Comprehensive Metrics: Accuracy, ROC-AUC, Precision, Recall, F1-Score
- Web Application: Full-stack deployment with FastAPI backend and React frontend
- Model Interpretability: Feature importance analysis and probability outputs
- Target: Mitochondrial genetic inheritance disorders, Multifactorial genetic inheritance disorders, Single-gene inheritance diseases
- Primary Algorithm: Optimized Random Forest with SMOTE
- Performance: >85% accuracy with comprehensive cross-validation
- Target: Cancer, Cystic fibrosis, Diabetes, Down syndrome, Huntington's disease, Klinefelter syndrome, Leber's hereditary optic neuropathy, Leigh syndrome, Turner syndrome
- Architecture: Hierarchical model using parent probabilities as features
- Integration: Seamless two-stage prediction pipeline
- Python 3.8+
- Node.js 16+
- Git
cd backend
pip install -r requirements.txt
# Start the backend server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
# Start the development server
npm run dev- Access Web Interface: Open
http://localhost:5173in your browser - Input Patient Data: Fill in the comprehensive medical form
- Get Predictions: Receive instant genetic disorder risk assessments
- View Probabilities: See detailed class probabilities for informed decisions
genome-based-disorder-prediction-system/
โโโ backend/ # FastAPI backend
โ โโโ app/
โ โ โโโ models/ # ML model implementations
โ โ โโโ routers/ # API endpoints
โ โ โโโ services/ # Business logic
โ โโโ models/ # Trained model artifacts
โ โโโ requirements.txt
โโโ frontend/ # React frontend
โ โโโ src/
โ โ โโโ components/ # React components
โ โ โโโ App.jsx # Main application
โ โโโ package.json
โโโ notebooks/ # Jupyter notebooks for analysis
โโโ data/ # Dataset files
โโโ docs/ # Documentation
โโโ README.md
- Data Preprocessing: Strategic standardization and feature engineering
- Model Training: 5-fold cross-validation with RandomizedSearchCV
- Hierarchical Prediction: Two-stage Random Forest architecture
- Performance Evaluation: Comprehensive metrics and visualization
- Backend: FastAPI, Scikit-learn, Pandas, NumPy
- Frontend: React, Vite, Tailwind CSS
- ML Libraries: Scikit-learn, XGBoost, LightGBM, CatBoost
- Deployment: Docker-ready, RESTful APIs
| Model Stage | Accuracy | ROC-AUC | Precision | Recall |
|---|---|---|---|---|
| Parent (Genetic Disorder) | 85.2% | 0.912 | 0.847 | 0.852 |
| Child (Disorder Subclass) | 83.7% | 0.894 | 0.831 | 0.837 |
| Overall System | 84.5% | 0.903 | 0.839 | 0.845 |
Top predictive features include:
- Genetic markers and inheritance patterns
- Clinical test results and blood work
- Patient demographics and family history
- Symptom presentations and severity scores
POST /api/predict
Content-Type: application/json
{
"patient_data": {
"age": 35,
"gender": "male",
"blood_test_results": "normal",
"symptom_score": 7,
// ... other features
}
}{
"genetic_disorder": "Single-gene inheritance diseases",
"disorder_subclass": "Huntington's disease",
"probabilities": {
"parent": { /* Genetic disorder probabilities */ },
"child": { /* Subclass probabilities */ }
},
"confidence_score": 0.89
}To retrain the models with your data:
# Run the training pipeline
python -m backend.app.models.train_pipeline
# Or use the Jupyter notebook
jupyter notebook notebooks/model_training.ipynb- Strategic Data Leakage: Advanced preprocessing techniques
- Hierarchical Architecture: Two-stage prediction for improved accuracy
- Ensemble Methods: Combined multiple algorithms for robust performance
- Real-time Deployment: Production-ready web application
- Comprehensive Evaluation: Extensive metrics and visualization
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Supervisors: Mr. Prasanna Sumathipala, Mr. Samadhi Chathuranga, Mr. Chan, Ms. Supipi
- Dataset Providers: Genome analysis research community
- Open Source Libraries: Scikit-learn, FastAPI, React communities
For support and questions:
- Create an Issue
- Email: [email protected]
- Integration with electronic health records
- Real-time learning capabilities
- Multi-omics data integration
- Mobile application development
- Advanced explainable AI features
โญ Star this repository if you find it helpful!
Last updated: October 2025