🏥 Gain Extraction Model - Patient Progress Prediction

A deep learning project that analyzes Patient Discharge Summaries to predict health progress and goal achievement using fine-tuned Bio_ClinicalBERT transformer models.

📋 Table of Contents

Overview
Features
Project Structure
Installation
Dataset
Model Architecture
Usage
Training Process
Results
Visualization
Requirements
Contributing
License
Acknowledgments

🎯 Overview

The Gain Extraction Model is an advanced NLP system designed to automatically assess patient health improvements from clinical discharge summaries. By leveraging the Bio_ClinicalBERT transformer model, this project achieves high accuracy in determining whether patients have met their healthcare goals and made meaningful progress during treatment.

Key Objectives

📊 Analyze patient discharge summaries efficiently
🎯 Predict patient health progress with high accuracy
🏥 Support healthcare providers in tracking patient outcomes
⚡ Provide automated, scalable health assessment tools

✨ Features

State-of-the-Art NLP: Utilizes Bio_ClinicalBERT, specifically trained on MIMIC-III clinical notes
Automated Preprocessing: Comprehensive text cleaning and normalization pipeline
Class Imbalance Handling: Intelligent majority downsampling techniques
GPU Optimization: CUDA-accelerated training and inference
Comprehensive Evaluation: Detailed metrics including confusion matrices and accuracy scores
TensorBoard Integration: Real-time training visualization
Production-Ready: Modular code structure for easy deployment

📁 Project Structure

.
├── brief-review-of-gain-prediction.ipynb  # Main analysis notebook
├── brog_train.csv                          # Training dataset
├── brog_test.csv                           # Testing dataset
├── requirements.txt                        # Python dependencies
├── readme.txt                              # Project notes
├── save_tokenizer_bcb/                     # Saved tokenizer files
├── save_model_bcb/                         # Pre-trained model files
└── fine_tuned_model_bcb/                   # Fine-tuned model checkpoints

🚀 Installation

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU (recommended)
8GB+ RAM
10GB+ free disk space

Setup

Clone the repository

git clone https://github.com/bilalhameed248/Brief-Review-Of-Gain-Prediction-Model.git
cd gain-prediction-project

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Download NLTK data

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

📊 Dataset

Data Description

The project uses clinical discharge summaries with binary classification:

Class 0: No significant health improvement
Class 1: Positive health progress/goal achievement

Data Files

brog_train.csv: Training dataset with labeled examples
brog_test.csv: Test dataset for model evaluation

Data Preprocessing

The preprocessing pipeline includes:

Text Cleaning
- Removal of special characters and punctuation
- Lowercase conversion
- Pattern removal
Linguistic Processing
- Stop word removal
- Lemmatization using WordNet
- Word tokenization
Class Balancing
- Majority downsampling
- Removal of low-information samples (< 8 words)

🧠 Model Architecture

Bio_ClinicalBERT

Base Model: emilyalsentzer/Bio_ClinicalBERT

Training Data: MIMIC-III database (~880M words from ICU patient notes)

Architecture Details:

Transformer-based encoder
768 hidden dimensions
12 attention heads
Fine-tuned for binary sequence classification

Model Configuration

- Input: Tokenized text sequences (max_length=130)
- Output: Binary classification (0/1)
- Loss Function: Cross-entropy
- Optimizer: AdamW
- Learning Rate: 1e-5

💻 Usage

Quick Start

Open the Jupyter notebook

jupyter notebook brief-review-of-gain-prediction.ipynb

Run all cells sequentially or execute specific sections:
- Data Loading & Preprocessing
- Model Training
- Evaluation & Visualization

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('./save_tokenizer_bcb/')
model = AutoModelForSequenceClassification.from_pretrained('./fine_tuned_model_bcb/')

# Prepare input
text = "Patient shows significant improvement in mobility and pain management."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=130)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1)
    
print(f"Prediction: {'Positive Progress' if prediction == 1 else 'No Significant Progress'}")

🎓 Training Process

Training Configuration

Parameter	Value
Batch Size	4
Learning Rate	1e-5
Max Steps	1000
Warmup Steps	500
Evaluation Strategy	Every 5 steps
Max Sequence Length	130 tokens

Training Pipeline

Data Split: 80% train, 10% validation, 10% test
Tokenization: Padding and truncation to max_length
Fine-tuning: Transfer learning from Bio_ClinicalBERT
Evaluation: Continuous validation monitoring
Checkpoint Saving: Best model selection

Monitoring Training

Launch TensorBoard to visualize training metrics:

cd fine_tuned_model_bcb
tensorboard --logdir=runs

Access at: http://localhost:6006/

📈 Results

Performance Metrics

Metric	Pre-trained Model	Fine-tuned Model
Accuracy	50%	90%
Improvement	Baseline	+40%

Confusion Matrix

The model demonstrates strong performance in both positive and negative class predictions, with detailed confusion matrices available in the notebook visualizations.

Key Findings

✅ Significant accuracy improvement after fine-tuning
✅ Balanced performance across both classes
✅ Effective handling of clinical terminology
✅ Robust to various text lengths and formats

📊 Visualization

The project includes comprehensive visualizations:

Class Distribution: Bar plots showing label balance
Common Phrases: Most frequent terms per class
Confusion Matrices: Prediction accuracy breakdown
Training Curves: Loss and accuracy over time (TensorBoard)

📦 Requirements

See requirements.txt for full details. Key dependencies:

torch>=2.0.0
transformers>=4.30.0
datasets>=2.12.0
pandas>=1.5.0
numpy>=1.24.0
scikit-learn>=1.2.0
seaborn>=0.12.0
matplotlib>=3.7.0
nltk>=3.8
accelerate>=0.20.0
evaluate>=0.4.0
pynvml>=11.5.0
tensorboard>=2.13.0

🙏 Acknowledgments

Bio_ClinicalBERT: Emily Alsentzer et al. for the pre-trained clinical BERT model
MIMIC-III: Beth Israel Deaconess Medical Center for the clinical database
Hugging Face: For the Transformers library and model hub
PyTorch: For the deep learning framework

Made with ❤️ for better healthcare outcomes

⭐ Star this repo if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
brief-review-of-gain-prediction.ipynb		brief-review-of-gain-prediction.ipynb
brog_test.csv		brog_test.csv
brog_train.csv		brog_train.csv
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🏥 Gain Extraction Model - Patient Progress Prediction

📋 Table of Contents

🎯 Overview

Key Objectives

✨ Features

📁 Project Structure

🚀 Installation

Prerequisites

Setup

📊 Dataset

Data Description

Data Files

Data Preprocessing

🧠 Model Architecture

Bio_ClinicalBERT

Model Configuration

💻 Usage

Quick Start

Inference Example

🎓 Training Process

Training Configuration

Training Pipeline

Monitoring Training

📈 Results

Performance Metrics

Confusion Matrix

Key Findings

📊 Visualization

📦 Requirements

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages