Pseudo Labeling for ReX Dataset

This directory contains scripts to generate pseudo labels for the ReX-CXR dataset using LLM-based label extraction.

Background

The ReX dataset contains only free-text radiology reports (Findings and Impression) without structured disease labels. To enable label-aware contrastive learning, we use an LLM to extract 13 disease labels from the clinical text.

Files

pseudo_labeling_rex.py - Main script for pseudo labeling the entire ReX dataset
test_pseudo_labeling.py - Test script to validate on a small sample before running full dataset
README_PSEUDO_LABELING.md - This file

Disease Labels

The following 13 conditions are extracted (matching MIMIC-CXR labels):

Atelectasis - Partial collapse of lung tissue
Cardiomegaly - Enlarged heart
Consolidation - Dense opacification of lung tissue
Edema - Fluid accumulation in lungs
Enlarged Cardiomediastinum - Widening of heart/mediastinal silhouette
Fracture - Bone fracture (usually ribs)
Lung Lesion - Nodule, mass, or focal lung abnormality
Lung Opacity - Any area of increased opacity
No Finding - Explicitly normal/unremarkable study
Pleural Effusion - Fluid in pleural space
Pleural Other - Other pleural abnormalities
Pneumonia - Infectious consolidation/infiltrate
Pneumothorax - Air in pleural space

Label Values

1 - Condition is present (explicitly mentioned)
0 - Condition is absent (explicitly stated as normal/negative)
-1 - Condition is uncertain or not mentioned

Usage

Step 1: Test on Sample Data

Before processing the full dataset, test on a small sample to validate the extraction:

python /data/code/CXR_embedding_research/test_pseudo_labeling.py

This will:

Load 5 sample reports from ReX dataset
Extract labels using the LLM
Display the results for manual inspection

Step 2: Run Full Pseudo Labeling

If the test results look reasonable, run the full pseudo labeling:

python /data/code/CXR_embedding_research/pseudo_labeling_rex.py

This will:

Load the entire ReX dataset from /data/ReXGradient-160K/metadata/train_with_view_embeddings_aug.csv
Extract labels for all samples using 64 parallel workers
Save results to /data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv
Log progress to /data/code/CXR_embedding_research/pseudo_labeling_logs.txt

Step 3: Monitor Progress

Monitor the log file in real-time:

tail -f /data/code/CXR_embedding_research/pseudo_labeling_logs.txt

Performance

Parallel workers: 64 threads (adjustable via max_workers parameter)
Retry logic: 3 attempts per sample if extraction fails
Error handling: Failed extractions result in all labels set to -1 (uncertain)

Expected Output

The output CSV will have the same columns as the input, plus 13 additional label columns:

PatientID, AccessionNumber, ..., Findings, Impression, ...,
Atelectasis, Cardiomegaly, Consolidation, Edema,
Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity,
No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax

Quality Assurance

After completion, the script will print:

Total samples processed
Success/failure rate
Label distribution for each condition

You can manually inspect some samples:

import pandas as pd

df = pd.read_csv('/data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv')

# Check samples with "Pneumonia"
pneumonia_cases = df[df['Pneumonia'] == 1]
print(pneumonia_cases[['Findings', 'Impression', 'Pneumonia']].head())

# Check label distribution
for col in ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
            'Enlarged Cardiomediastinum', 'Fracture', 'Lung Lesion',
            'Lung Opacity', 'No Finding', 'Pleural Effusion',
            'Pleural Other', 'Pneumonia', 'Pneumothorax']:
    print(f"{col}: +1={sum(df[col]==1)}, 0={sum(df[col]==0)}, -1={sum(df[col]==-1)}")

Troubleshooting

Issue: LLM not responding

Check if the local LLM server is running at http://localhost:4000/v1
Verify API key is correct

Issue: Low success rate

Check log file for error messages
Reduce max_workers to avoid overwhelming the LLM server
Adjust temperature parameter (currently 0.1) for more/less variation

Issue: Inconsistent labels

Review the labeling_prompt in pseudo_labeling_rex.py
Adjust labeling rules or examples
Consider manual validation on a sample

Next Steps

After pseudo labeling:

Validation: Manually check a random sample of 50-100 labels for accuracy
Train with labels: Use the labeled ReX data in train_with_accelerate.py with label-aware InfoNCE loss
Compare: Evaluate if pseudo labels improve model performance vs. text-similarity-only approach

References

Original ReX dataset: https://physionet.org/content/rexgradient-cxr/
MIMIC-CXR labels: https://physionet.org/content/mimic-cxr/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudo Labeling for ReX Dataset

Background

Files

Disease Labels

Label Values

Usage

Step 1: Test on Sample Data

Step 2: Run Full Pseudo Labeling

Step 3: Monitor Progress

Performance

Expected Output

Quality Assurance

Troubleshooting

Issue: LLM not responding

Issue: Low success rate

Issue: Inconsistent labels

Next Steps

References

FilesExpand file tree

README_PSEUDO_LABELING.md

Latest commit

History

README_PSEUDO_LABELING.md

File metadata and controls

Pseudo Labeling for ReX Dataset

Background

Files

Disease Labels

Label Values

Usage

Step 1: Test on Sample Data

Step 2: Run Full Pseudo Labeling

Step 3: Monitor Progress

Performance

Expected Output

Quality Assurance

Troubleshooting

Issue: LLM not responding

Issue: Low success rate

Issue: Inconsistent labels

Next Steps

References