This directory contains scripts to generate pseudo labels for the ReX-CXR dataset using LLM-based label extraction.
The ReX dataset contains only free-text radiology reports (Findings and Impression) without structured disease labels. To enable label-aware contrastive learning, we use an LLM to extract 13 disease labels from the clinical text.
pseudo_labeling_rex.py- Main script for pseudo labeling the entire ReX datasettest_pseudo_labeling.py- Test script to validate on a small sample before running full datasetREADME_PSEUDO_LABELING.md- This file
The following 13 conditions are extracted (matching MIMIC-CXR labels):
- Atelectasis - Partial collapse of lung tissue
- Cardiomegaly - Enlarged heart
- Consolidation - Dense opacification of lung tissue
- Edema - Fluid accumulation in lungs
- Enlarged Cardiomediastinum - Widening of heart/mediastinal silhouette
- Fracture - Bone fracture (usually ribs)
- Lung Lesion - Nodule, mass, or focal lung abnormality
- Lung Opacity - Any area of increased opacity
- No Finding - Explicitly normal/unremarkable study
- Pleural Effusion - Fluid in pleural space
- Pleural Other - Other pleural abnormalities
- Pneumonia - Infectious consolidation/infiltrate
- Pneumothorax - Air in pleural space
1- Condition is present (explicitly mentioned)0- Condition is absent (explicitly stated as normal/negative)-1- Condition is uncertain or not mentioned
Before processing the full dataset, test on a small sample to validate the extraction:
python /data/code/CXR_embedding_research/test_pseudo_labeling.pyThis will:
- Load 5 sample reports from ReX dataset
- Extract labels using the LLM
- Display the results for manual inspection
If the test results look reasonable, run the full pseudo labeling:
python /data/code/CXR_embedding_research/pseudo_labeling_rex.pyThis will:
- Load the entire ReX dataset from
/data/ReXGradient-160K/metadata/train_with_view_embeddings_aug.csv - Extract labels for all samples using 64 parallel workers
- Save results to
/data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv - Log progress to
/data/code/CXR_embedding_research/pseudo_labeling_logs.txt
Monitor the log file in real-time:
tail -f /data/code/CXR_embedding_research/pseudo_labeling_logs.txt- Parallel workers: 64 threads (adjustable via
max_workersparameter) - Retry logic: 3 attempts per sample if extraction fails
- Error handling: Failed extractions result in all labels set to -1 (uncertain)
The output CSV will have the same columns as the input, plus 13 additional label columns:
PatientID, AccessionNumber, ..., Findings, Impression, ...,
Atelectasis, Cardiomegaly, Consolidation, Edema,
Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity,
No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax
After completion, the script will print:
- Total samples processed
- Success/failure rate
- Label distribution for each condition
You can manually inspect some samples:
import pandas as pd
df = pd.read_csv('/data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv')
# Check samples with "Pneumonia"
pneumonia_cases = df[df['Pneumonia'] == 1]
print(pneumonia_cases[['Findings', 'Impression', 'Pneumonia']].head())
# Check label distribution
for col in ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
'Enlarged Cardiomediastinum', 'Fracture', 'Lung Lesion',
'Lung Opacity', 'No Finding', 'Pleural Effusion',
'Pleural Other', 'Pneumonia', 'Pneumothorax']:
print(f"{col}: +1={sum(df[col]==1)}, 0={sum(df[col]==0)}, -1={sum(df[col]==-1)}")- Check if the local LLM server is running at
http://localhost:4000/v1 - Verify API key is correct
- Check log file for error messages
- Reduce
max_workersto avoid overwhelming the LLM server - Adjust
temperatureparameter (currently 0.1) for more/less variation
- Review the
labeling_promptinpseudo_labeling_rex.py - Adjust labeling rules or examples
- Consider manual validation on a sample
After pseudo labeling:
- Validation: Manually check a random sample of 50-100 labels for accuracy
- Train with labels: Use the labeled ReX data in
train_with_accelerate.pywith label-aware InfoNCE loss - Compare: Evaluate if pseudo labels improve model performance vs. text-similarity-only approach
- Original ReX dataset: https://physionet.org/content/rexgradient-cxr/
- MIMIC-CXR labels: https://physionet.org/content/mimic-cxr/