Skip to content

Latest commit

 

History

History
146 lines (103 loc) · 4.96 KB

File metadata and controls

146 lines (103 loc) · 4.96 KB

Pseudo Labeling for ReX Dataset

This directory contains scripts to generate pseudo labels for the ReX-CXR dataset using LLM-based label extraction.

Background

The ReX dataset contains only free-text radiology reports (Findings and Impression) without structured disease labels. To enable label-aware contrastive learning, we use an LLM to extract 13 disease labels from the clinical text.

Files

  • pseudo_labeling_rex.py - Main script for pseudo labeling the entire ReX dataset
  • test_pseudo_labeling.py - Test script to validate on a small sample before running full dataset
  • README_PSEUDO_LABELING.md - This file

Disease Labels

The following 13 conditions are extracted (matching MIMIC-CXR labels):

  1. Atelectasis - Partial collapse of lung tissue
  2. Cardiomegaly - Enlarged heart
  3. Consolidation - Dense opacification of lung tissue
  4. Edema - Fluid accumulation in lungs
  5. Enlarged Cardiomediastinum - Widening of heart/mediastinal silhouette
  6. Fracture - Bone fracture (usually ribs)
  7. Lung Lesion - Nodule, mass, or focal lung abnormality
  8. Lung Opacity - Any area of increased opacity
  9. No Finding - Explicitly normal/unremarkable study
  10. Pleural Effusion - Fluid in pleural space
  11. Pleural Other - Other pleural abnormalities
  12. Pneumonia - Infectious consolidation/infiltrate
  13. Pneumothorax - Air in pleural space

Label Values

  • 1 - Condition is present (explicitly mentioned)
  • 0 - Condition is absent (explicitly stated as normal/negative)
  • -1 - Condition is uncertain or not mentioned

Usage

Step 1: Test on Sample Data

Before processing the full dataset, test on a small sample to validate the extraction:

python /data/code/CXR_embedding_research/test_pseudo_labeling.py

This will:

  • Load 5 sample reports from ReX dataset
  • Extract labels using the LLM
  • Display the results for manual inspection

Step 2: Run Full Pseudo Labeling

If the test results look reasonable, run the full pseudo labeling:

python /data/code/CXR_embedding_research/pseudo_labeling_rex.py

This will:

  • Load the entire ReX dataset from /data/ReXGradient-160K/metadata/train_with_view_embeddings_aug.csv
  • Extract labels for all samples using 64 parallel workers
  • Save results to /data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv
  • Log progress to /data/code/CXR_embedding_research/pseudo_labeling_logs.txt

Step 3: Monitor Progress

Monitor the log file in real-time:

tail -f /data/code/CXR_embedding_research/pseudo_labeling_logs.txt

Performance

  • Parallel workers: 64 threads (adjustable via max_workers parameter)
  • Retry logic: 3 attempts per sample if extraction fails
  • Error handling: Failed extractions result in all labels set to -1 (uncertain)

Expected Output

The output CSV will have the same columns as the input, plus 13 additional label columns:

PatientID, AccessionNumber, ..., Findings, Impression, ...,
Atelectasis, Cardiomegaly, Consolidation, Edema,
Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity,
No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax

Quality Assurance

After completion, the script will print:

  • Total samples processed
  • Success/failure rate
  • Label distribution for each condition

You can manually inspect some samples:

import pandas as pd

df = pd.read_csv('/data/ReXGradient-160K/metadata/train_with_view_embeddings_aug_labeled.csv')

# Check samples with "Pneumonia"
pneumonia_cases = df[df['Pneumonia'] == 1]
print(pneumonia_cases[['Findings', 'Impression', 'Pneumonia']].head())

# Check label distribution
for col in ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
            'Enlarged Cardiomediastinum', 'Fracture', 'Lung Lesion',
            'Lung Opacity', 'No Finding', 'Pleural Effusion',
            'Pleural Other', 'Pneumonia', 'Pneumothorax']:
    print(f"{col}: +1={sum(df[col]==1)}, 0={sum(df[col]==0)}, -1={sum(df[col]==-1)}")

Troubleshooting

Issue: LLM not responding

  • Check if the local LLM server is running at http://localhost:4000/v1
  • Verify API key is correct

Issue: Low success rate

  • Check log file for error messages
  • Reduce max_workers to avoid overwhelming the LLM server
  • Adjust temperature parameter (currently 0.1) for more/less variation

Issue: Inconsistent labels

  • Review the labeling_prompt in pseudo_labeling_rex.py
  • Adjust labeling rules or examples
  • Consider manual validation on a sample

Next Steps

After pseudo labeling:

  1. Validation: Manually check a random sample of 50-100 labels for accuracy
  2. Train with labels: Use the labeled ReX data in train_with_accelerate.py with label-aware InfoNCE loss
  3. Compare: Evaluate if pseudo labels improve model performance vs. text-similarity-only approach

References