Skip to content

bilalhameed248/Brief-Review-Of-Gain-Prediction-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯ Gain Extraction Model - Patient Progress Prediction

A deep learning project that analyzes Patient Discharge Summaries to predict health progress and goal achievement using fine-tuned Bio_ClinicalBERT transformer models.

Python PyTorch Transformers License


πŸ“‹ Table of Contents


🎯 Overview

The Gain Extraction Model is an advanced NLP system designed to automatically assess patient health improvements from clinical discharge summaries. By leveraging the Bio_ClinicalBERT transformer model, this project achieves high accuracy in determining whether patients have met their healthcare goals and made meaningful progress during treatment.

Key Objectives

  • πŸ“Š Analyze patient discharge summaries efficiently
  • 🎯 Predict patient health progress with high accuracy
  • πŸ₯ Support healthcare providers in tracking patient outcomes
  • ⚑ Provide automated, scalable health assessment tools

✨ Features

  • State-of-the-Art NLP: Utilizes Bio_ClinicalBERT, specifically trained on MIMIC-III clinical notes
  • Automated Preprocessing: Comprehensive text cleaning and normalization pipeline
  • Class Imbalance Handling: Intelligent majority downsampling techniques
  • GPU Optimization: CUDA-accelerated training and inference
  • Comprehensive Evaluation: Detailed metrics including confusion matrices and accuracy scores
  • TensorBoard Integration: Real-time training visualization
  • Production-Ready: Modular code structure for easy deployment

πŸ“ Project Structure

.
β”œβ”€β”€ brief-review-of-gain-prediction.ipynb  # Main analysis notebook
β”œβ”€β”€ brog_train.csv                          # Training dataset
β”œβ”€β”€ brog_test.csv                           # Testing dataset
β”œβ”€β”€ requirements.txt                        # Python dependencies
β”œβ”€β”€ readme.txt                              # Project notes
β”œβ”€β”€ save_tokenizer_bcb/                     # Saved tokenizer files
β”œβ”€β”€ save_model_bcb/                         # Pre-trained model files
└── fine_tuned_model_bcb/                   # Fine-tuned model checkpoints

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (recommended)
  • 8GB+ RAM
  • 10GB+ free disk space

Setup

  1. Clone the repository
git clone https://github.com/bilalhameed248/Brief-Review-Of-Gain-Prediction-Model.git
cd gain-prediction-project
  1. Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Download NLTK data
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

πŸ“Š Dataset

Data Description

The project uses clinical discharge summaries with binary classification:

  • Class 0: No significant health improvement
  • Class 1: Positive health progress/goal achievement

Data Files

  • brog_train.csv: Training dataset with labeled examples
  • brog_test.csv: Test dataset for model evaluation

Data Preprocessing

The preprocessing pipeline includes:

  1. Text Cleaning

    • Removal of special characters and punctuation
    • Lowercase conversion
    • Pattern removal
  2. Linguistic Processing

    • Stop word removal
    • Lemmatization using WordNet
    • Word tokenization
  3. Class Balancing

    • Majority downsampling
    • Removal of low-information samples (< 8 words)

🧠 Model Architecture

Bio_ClinicalBERT

Base Model: emilyalsentzer/Bio_ClinicalBERT

Training Data: MIMIC-III database (~880M words from ICU patient notes)

Architecture Details:

  • Transformer-based encoder
  • 768 hidden dimensions
  • 12 attention heads
  • Fine-tuned for binary sequence classification

Model Configuration

- Input: Tokenized text sequences (max_length=130)
- Output: Binary classification (0/1)
- Loss Function: Cross-entropy
- Optimizer: AdamW
- Learning Rate: 1e-5

πŸ’» Usage

Quick Start

  1. Open the Jupyter notebook
jupyter notebook brief-review-of-gain-prediction.ipynb
  1. Run all cells sequentially or execute specific sections:
    • Data Loading & Preprocessing
    • Model Training
    • Evaluation & Visualization

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('./save_tokenizer_bcb/')
model = AutoModelForSequenceClassification.from_pretrained('./fine_tuned_model_bcb/')

# Prepare input
text = "Patient shows significant improvement in mobility and pain management."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=130)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1)
    
print(f"Prediction: {'Positive Progress' if prediction == 1 else 'No Significant Progress'}")

πŸŽ“ Training Process

Training Configuration

Parameter Value
Batch Size 4
Learning Rate 1e-5
Max Steps 1000
Warmup Steps 500
Evaluation Strategy Every 5 steps
Max Sequence Length 130 tokens

Training Pipeline

  1. Data Split: 80% train, 10% validation, 10% test
  2. Tokenization: Padding and truncation to max_length
  3. Fine-tuning: Transfer learning from Bio_ClinicalBERT
  4. Evaluation: Continuous validation monitoring
  5. Checkpoint Saving: Best model selection

Monitoring Training

Launch TensorBoard to visualize training metrics:

cd fine_tuned_model_bcb
tensorboard --logdir=runs

Access at: http://localhost:6006/


πŸ“ˆ Results

Performance Metrics

Metric Pre-trained Model Fine-tuned Model
Accuracy 50% 90%
Improvement Baseline +40%

Confusion Matrix

The model demonstrates strong performance in both positive and negative class predictions, with detailed confusion matrices available in the notebook visualizations.

Key Findings

  • βœ… Significant accuracy improvement after fine-tuning
  • βœ… Balanced performance across both classes
  • βœ… Effective handling of clinical terminology
  • βœ… Robust to various text lengths and formats

πŸ“Š Visualization

The project includes comprehensive visualizations:

  • Class Distribution: Bar plots showing label balance
  • Common Phrases: Most frequent terms per class
  • Confusion Matrices: Prediction accuracy breakdown
  • Training Curves: Loss and accuracy over time (TensorBoard)

πŸ“¦ Requirements

See requirements.txt for full details. Key dependencies:

torch>=2.0.0
transformers>=4.30.0
datasets>=2.12.0
pandas>=1.5.0
numpy>=1.24.0
scikit-learn>=1.2.0
seaborn>=0.12.0
matplotlib>=3.7.0
nltk>=3.8
accelerate>=0.20.0
evaluate>=0.4.0
pynvml>=11.5.0
tensorboard>=2.13.0

πŸ™ Acknowledgments

  • Bio_ClinicalBERT: Emily Alsentzer et al. for the pre-trained clinical BERT model
  • MIMIC-III: Beth Israel Deaconess Medical Center for the clinical database
  • Hugging Face: For the Transformers library and model hub
  • PyTorch: For the deep learning framework

Made with ❀️ for better healthcare outcomes

⭐ Star this repo if you find it helpful!

About

Health Progress Prediction and Goal Attainment Analysis in Patient Discharge Summaries with BERT and Tensorflow. - Feb 2022 - Jun 2023

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors