A deep learning based project for automated classification of clinical assessment text to identify whether "Diagnosis affecting recovery" is present in initial evaluations.
This project implements a fine-tuned BERT-based model (Bio_ClinicalBERT) to analyze clinical assessment text and determine if the documentation contains evidence of diagnosis affecting patient recovery. The model processes clinical impressions and evaluations to support healthcare documentation quality assurance by identifying whether recovery-impacting diagnoses are properly documented.
- Pre-trained Clinical Model: Utilizes
emilyalsentzer/Bio_ClinicalBERToptimized for clinical text - Text Preprocessing: Advanced NLP preprocessing with lemmatization and stopword removal
- Binary Classification: Identifies presence/absence of diagnosis affecting recovery
- High Accuracy Threshold: 95% confidence threshold for predictions
- Sentence-Level Analysis: Tokenizes and analyzes text at sentence granularity
- GPU Acceleration: CUDA-enabled training and inference
.
├── fd_train.csv # Training dataset
├── fd_test.csv # Testing dataset
├── requirements.txt # Python dependencies
├── Untitled.ipynb # Main notebook with training & inference
├── save_tokenizer_bcb/ # Saved tokenizer (generated)
├── save_model_bcb/ # Saved base model (generated)
└── fine_tuned_model_bcb/ # Fine-tuned model checkpoints (generated)
- Deep Learning: PyTorch, Transformers (Hugging Face)
- NLP: NLTK, WordNetLemmatizer
- Data Processing: Pandas, NumPy
- Model Training: Hugging Face Trainer API
- Evaluation: scikit-learn, Hugging Face Evaluate
- Visualization: Seaborn
- GPU Management: CUDA, Numba, pynvml
-
Clone the repository
git clone https://github.com/bilalhameed248/Diagnosis-Effecting-Patients-Recovery-Detection.git cd Diagnosis-Effecting-Patients-Recovery-Detection -
Install dependencies
pip install -r requirements.txt
-
Download NLTK data
import nltk nltk.download('stopwords') nltk.download('popular')
-
Prepare your data: Ensure
fd_train.csvandfd_test.csvcontain columns:sentence: Clinical textlabel: Binary labels (0/1)
-
Run training cells in Untitled.ipynb:
# Load and preprocess data dataset = load_dataset('csv', data_files={ 'train': './fd_train.csv', 'test': './fd_test.csv' }) # Train model trainer.train()
-
Training configuration:
- Epochs: 10
- Batch size: 4
- Max sequence length: 512 tokens
- Evaluation strategy: Per epoch
Use the inference_why_skill_care() function for predictions:
content = """
Patient evaluation text here...
"""
inference_why_skill_care(content)Output: Returns whether diagnosis affecting recovery is found with 95%+ confidence.
The model includes:
- Accuracy metric evaluation
- Confusion matrix analysis
- Classification reports
- Checkpoint saving (top 2 models)
TrainingArguments(
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=10,
save_total_limit=2,
evaluation_strategy="epoch"
)- Lemmatization using WordNet
- Stopword removal (English)
- Lowercasing and special character removal
- Whitespace normalization
- GPU: CUDA-compatible GPU recommended
- VRAM: Minimum 8GB for batch size 4
- Storage: ~2GB for model checkpoints
Monitor GPU usage:
from pynvml import *
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)The model evaluates clinical documentation to determine if it contains evidence of diagnosis affecting patient recovery. Key metrics tracked:
- Training/validation accuracy
- Confusion matrix
- Per-epoch performance
For questions or support, please open an issue in the repository.