A deep learning project that analyzes Patient Discharge Summaries to predict health progress and goal achievement using fine-tuned Bio_ClinicalBERT transformer models.
- Overview
- Features
- Project Structure
- Installation
- Dataset
- Model Architecture
- Usage
- Training Process
- Results
- Visualization
- Requirements
- Contributing
- License
- Acknowledgments
The Gain Extraction Model is an advanced NLP system designed to automatically assess patient health improvements from clinical discharge summaries. By leveraging the Bio_ClinicalBERT transformer model, this project achieves high accuracy in determining whether patients have met their healthcare goals and made meaningful progress during treatment.
- π Analyze patient discharge summaries efficiently
- π― Predict patient health progress with high accuracy
- π₯ Support healthcare providers in tracking patient outcomes
- β‘ Provide automated, scalable health assessment tools
- State-of-the-Art NLP: Utilizes Bio_ClinicalBERT, specifically trained on MIMIC-III clinical notes
- Automated Preprocessing: Comprehensive text cleaning and normalization pipeline
- Class Imbalance Handling: Intelligent majority downsampling techniques
- GPU Optimization: CUDA-accelerated training and inference
- Comprehensive Evaluation: Detailed metrics including confusion matrices and accuracy scores
- TensorBoard Integration: Real-time training visualization
- Production-Ready: Modular code structure for easy deployment
.
βββ brief-review-of-gain-prediction.ipynb # Main analysis notebook
βββ brog_train.csv # Training dataset
βββ brog_test.csv # Testing dataset
βββ requirements.txt # Python dependencies
βββ readme.txt # Project notes
βββ save_tokenizer_bcb/ # Saved tokenizer files
βββ save_model_bcb/ # Pre-trained model files
βββ fine_tuned_model_bcb/ # Fine-tuned model checkpoints
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- 8GB+ RAM
- 10GB+ free disk space
- Clone the repository
git clone https://github.com/bilalhameed248/Brief-Review-Of-Gain-Prediction-Model.git
cd gain-prediction-project- Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Download NLTK data
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')The project uses clinical discharge summaries with binary classification:
- Class 0: No significant health improvement
- Class 1: Positive health progress/goal achievement
brog_train.csv: Training dataset with labeled examplesbrog_test.csv: Test dataset for model evaluation
The preprocessing pipeline includes:
-
Text Cleaning
- Removal of special characters and punctuation
- Lowercase conversion
- Pattern removal
-
Linguistic Processing
- Stop word removal
- Lemmatization using WordNet
- Word tokenization
-
Class Balancing
- Majority downsampling
- Removal of low-information samples (< 8 words)
Base Model: emilyalsentzer/Bio_ClinicalBERT
Training Data: MIMIC-III database (~880M words from ICU patient notes)
Architecture Details:
- Transformer-based encoder
- 768 hidden dimensions
- 12 attention heads
- Fine-tuned for binary sequence classification
- Input: Tokenized text sequences (max_length=130)
- Output: Binary classification (0/1)
- Loss Function: Cross-entropy
- Optimizer: AdamW
- Learning Rate: 1e-5- Open the Jupyter notebook
jupyter notebook brief-review-of-gain-prediction.ipynb- Run all cells sequentially or execute specific sections:
- Data Loading & Preprocessing
- Model Training
- Evaluation & Visualization
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('./save_tokenizer_bcb/')
model = AutoModelForSequenceClassification.from_pretrained('./fine_tuned_model_bcb/')
# Prepare input
text = "Patient shows significant improvement in mobility and pain management."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=130)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1)
print(f"Prediction: {'Positive Progress' if prediction == 1 else 'No Significant Progress'}")| Parameter | Value |
|---|---|
| Batch Size | 4 |
| Learning Rate | 1e-5 |
| Max Steps | 1000 |
| Warmup Steps | 500 |
| Evaluation Strategy | Every 5 steps |
| Max Sequence Length | 130 tokens |
- Data Split: 80% train, 10% validation, 10% test
- Tokenization: Padding and truncation to max_length
- Fine-tuning: Transfer learning from Bio_ClinicalBERT
- Evaluation: Continuous validation monitoring
- Checkpoint Saving: Best model selection
Launch TensorBoard to visualize training metrics:
cd fine_tuned_model_bcb
tensorboard --logdir=runsAccess at: http://localhost:6006/
| Metric | Pre-trained Model | Fine-tuned Model |
|---|---|---|
| Accuracy | 50% | 90% |
| Improvement | Baseline | +40% |
The model demonstrates strong performance in both positive and negative class predictions, with detailed confusion matrices available in the notebook visualizations.
- β Significant accuracy improvement after fine-tuning
- β Balanced performance across both classes
- β Effective handling of clinical terminology
- β Robust to various text lengths and formats
The project includes comprehensive visualizations:
- Class Distribution: Bar plots showing label balance
- Common Phrases: Most frequent terms per class
- Confusion Matrices: Prediction accuracy breakdown
- Training Curves: Loss and accuracy over time (TensorBoard)
See requirements.txt for full details. Key dependencies:
torch>=2.0.0
transformers>=4.30.0
datasets>=2.12.0
pandas>=1.5.0
numpy>=1.24.0
scikit-learn>=1.2.0
seaborn>=0.12.0
matplotlib>=3.7.0
nltk>=3.8
accelerate>=0.20.0
evaluate>=0.4.0
pynvml>=11.5.0
tensorboard>=2.13.0- Bio_ClinicalBERT: Emily Alsentzer et al. for the pre-trained clinical BERT model
- MIMIC-III: Beth Israel Deaconess Medical Center for the clinical database
- Hugging Face: For the Transformers library and model hub
- PyTorch: For the deep learning framework
Made with β€οΈ for better healthcare outcomes
β Star this repo if you find it helpful!