Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 150 additions & 12 deletions docs/tcga_pan_can_atlas/ohif-viewer.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,159 @@
# OHIF Viewer for TCGA CT Scans
OHIF URLs for CT Scans were obtained from the Imaging Data Commons and added as
resource data to all tcga_pan_can_2018 studies,
# TCGA Imaging Data Integration with cBioPortal

This document describes the process of acquiring TCGA imaging studies from the NCI Imaging Data Commons (IDC) and integrating them as patient resources in cBioPortal's TCGA Pan Cancer Atlas 2018 studies. The images can be viewed using OHIF (radiology) and SLIM (pathology) viewers within the portal.

## Data Structure in IDC

IDC organizes imaging data hierarchically:

```
Collections → Patients → Studies → Series → Instances
```

- **Study**: One complete imaging exam for a patient, regardless of modality (e.g., "CT Chest with Contrast", "MRI Brain"). One patient could have undergone multiple imaging studies over time.
- **Series**: Individual image batches within a study (e.g., different CT scan sections or sequences)

**Note**: While a single study may contain multiple imaging modalities, data was extracted at the study level and then split by modality for cBioPortal integration. This modality-level organization enables allows users to selectively access specific imaging types (e.g., CT scans vs. H&E slides) for each patient.

## Available Imaging Modalities

| Code | Modality | Viewer | Resource ID | Resource Tab Name |
|------|----------|--------|-------------|-------------------|
| CR | Computed Radiography | OHIF | IDC_OHIF_CR | Computed Radiography |
| CT | Computed Tomography | OHIF | IDC_OHIF_CT | CT Scan |
| DX | Digital Radiography | OHIF | IDC_OHIF_DX | Digital Radiography |
| MG | Mammography | OHIF | IDC_OHIF_MG | Mammography |
| MR | Magnetic Resonance | OHIF | IDC_OHIF_MR | Magnetic Resonance |
| NM | Nuclear Medicine | OHIF | IDC_OHIF_NM | Nuclear Medicine |
| PT | Positron Emission Tomography | OHIF | IDC_OHIF_PT | PET Scan |
| SM | Slide Microscopy (H&E) | SLIM | IDC_SLIM | H&E Slide |

**Note**: Annotation (ANN), Segmentation (SEG) and Structured Report (SR) are excluded as standalone resources since they are automatically loaded with their parent imaging studies in OHIF. Other (OT) modality is also excluded.

## Implementation Steps

## Steps
### 1. Download Data from IDC

Go to IDC and select TCGA -> CT Scans. Download a manifest with all the links to ~/Downloads/idc_ohif.tsv.
[idc-index](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb) package was used to programmatically download UID's for all tcga from IDC.
```python
from idc_index import IDCClient

client = IDCClient()

client.sql_query("SELECT * FROM index WHERE collection_id LIKE '%tcga%'").to_csv("idc_tcga.txt", sep='\t', index=None)
```

### 2. Generate data

Generate patient-level resource files for each TCGA Pan Cancer study by linking patients to their imaging studies via OHIF and SLIM viewer URLs.

```python
import pandas as pd
import numpy as np
import os

df_tcga = pd.read_csv("idc_tcga.txt", sep='\t', dtype=str)

# Remove modalities that cannot directly be viewed along with Other type
df_tcga = df_tcga[~df_tcga['Modality'].isin(['ANN', 'SEG', 'SR', 'OT'])]

# Group the patients by collection_id, PatientID, StudyInstanceUID, Modality
cols = ['PatientID', 'StudyInstanceUID', 'Modality']
resource_df = df_tcga[cols].drop_duplicates().reset_index(drop=True)

# Map modalities
modality_map = {
'CR': 'IDC_OHIF_CR',
'CT': 'IDC_OHIF_CT',
'DX': 'IDC_OHIF_DX',
'MG': 'IDC_OHIF_MG',
'MR': 'IDC_OHIF_MR',
'NM': 'IDC_OHIF_NM',
'PT': 'IDC_OHIF_PT',
'SM': 'IDC_SLIM'
}
resource_df['Modality'] = resource_df['Modality'].replace(modality_map)

# Add StudyInstanceUID URL
resource_df['StudyInstanceUID'] = np.where(
resource_df['Modality'] == 'IDC_SLIM',
'https://viewer.imaging.datacommons.cancer.gov/slim/studies/' + resource_df['StudyInstanceUID'],
'https://viewer.imaging.datacommons.cancer.gov/viewer/' + resource_df['StudyInstanceUID']
)

### 2. Curate coadread_tcga semi-manually
# Rename columns
resource_df = resource_df[['PatientID', 'Modality', 'StudyInstanceUID']]
resource_df.columns = ['PATIENT_ID', 'RESOURCE_ID', 'URL']

Add all *resource* files for coad and read data and link patients to samples. This was done semi-manually.
# Write filtered resource files per study
dh_files_path = "/Users/madupurr/Github/datahub/public"

### 3. Do the rest using a command one liner
# resource defenition file map
definition_map = {
'IDC_OHIF_CR': 'Computed Radiography',
'IDC_OHIF_CT': 'CT Scan',
'IDC_OHIF_DX': 'Digital Radiography',
'IDC_OHIF_MG': 'Mammography',
'IDC_OHIF_MR': 'Magnetic Resonance',
'IDC_OHIF_NM': 'Nuclear Medicine',
'IDC_OHIF_PT': 'PET Scan',
'IDC_SLIM': 'H&E Slide'
}

for st in os.listdir(dh_files_path):
if 'tcga_pan_can_atlas_2018' in st:
patient_file = os.path.join(dh_files_path, st, 'data_clinical_patient.txt')
resource_file = os.path.join(dh_files_path, st, 'data_resource_patient.txt')
resource_def_file = os.path.join(dh_files_path, st, 'data_resource_definition.txt')

clinical_df = pd.read_csv(patient_file, sep='\t', skiprows=4, dtype=str)
patient_ids = clinical_df['PATIENT_ID'].unique()

# Filter resource_df for only matching patients
filtered_resource_df = resource_df[resource_df['PATIENT_ID'].isin(patient_ids)]

# Write to tab-separated file
filtered_resource_df = filtered_resource_df.sort_values(by='PATIENT_ID').reset_index(drop=True)
filtered_resource_df.to_csv(resource_file, sep='\t', index=False)
print(f"Written {len(filtered_resource_df)} resources to {resource_file}")

# Get unique RESOURCE_IDs for this study that are in definition_map
resource_ids_in_study = filtered_resource_df['RESOURCE_ID'].unique()

# Build resource definition DataFrame only if there are RESOURCE_IDs
if len(resource_ids_in_study) > 0:
resource_def_df = pd.DataFrame({
'RESOURCE_ID': resource_ids_in_study,
'DISPLAY_NAME': [definition_map[rid] for rid in resource_ids_in_study],
'RESOURCE_TYPE': 'PATIENT',
'DESCRIPTION': [definition_map[rid] for rid in resource_ids_in_study],
'OPEN_BY_DEFAULT': 'TRUE',
'PRIORITY': 1
})

# Write to data_resource_definition.txt file
resource_def_df.to_csv(resource_def_file, sep='\t', index=False)
print(f"Written definitions to {resource_def_file}")


Generate them for the rest:
```bash
for f in $(cut -f2 ~/Downloads/idc_ohif.tsv | gsort | uniq | grep tcga_ | grep -v Filters | grep -v coad | grep -v read); do (head -1 coadread_tcga_pan_can_atlas_2018/data_resource_patient.txt; cut -f1,2,4 ~/Downloads/idc_ohif.tsv | tail -n +9 | grep $f | cut -f1,3 | awk -vFS='\t' -vOFS='\t' '{$1=substr($1,0,12); $3="https://viewer.imaging.datacommons.cancer.gov/viewer/"$2; $2="IDC_OHIF_V2"; print $0}' | gsort -k1,1 | uniq | rev | uniq -f2 | rev; ) > ${f/tcga_/}_tcga_pan_can_atlas_2018/*data_resource*patient*; done
```

Note: there are a few patients that have multiple CT Scans. Not entirely sure what the difference is, the above command just selects the first one
#### Output Files

For each TCGA Pan-Cancer Atlas 2018 study, the script generates two files:

1. **`data_resource_patient.txt`**: Links patients to their imaging studies
- Columns: `PATIENT_ID`, `RESOURCE_ID`, `URL`

2. **`data_resource_definition.txt`**: Defines the resource types available in the study
- Columns: `RESOURCE_ID`, `DISPLAY_NAME`, `RESOURCE_TYPE`, `DESCRIPTION`, `OPEN_BY_DEFAULT`, `PRIORITY`

#### Viewer URLs

- **OHIF Viewer** (for radiology imaging): `https://viewer.imaging.datacommons.cancer.gov/viewer/{StudyInstanceUID}`
- **SLIM Viewer** (for slide microscopy): `https://viewer.imaging.datacommons.cancer.gov/slim/studies/{StudyInstanceUID}`

#### Notes

- Only patients already present in the clinical data files are included in the resource files
- All resource files are sorted by `PATIENT_ID` for consistency
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/acc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/blca_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/brca_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/cesc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/chol_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/dlbc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/esca_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/gbm_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/hnsc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/kich_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/kirc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/kirp_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/laml_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/lgg_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/lihc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/luad_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/lusc_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
4 changes: 2 additions & 2 deletions public/meso_tcga_pan_can_atlas_2018/data_resource_patient.txt
Git LFS file not shown
Git LFS file not shown
Loading