@@ -5,57 +5,46 @@ We gratefully acknowledge [Google Public Data Program](https://console.cloud.goo
55{% endhint %}
66
77``` mermaid
8- graph TB
9- DCM["<b>DICOM FILES (.dcm)</b><br/>Named by crdc_instance_uuid, grouped by crdc_series_uuid"]
10-
11- DCM -->|"stored in"| BUCKETS
12-
13- subgraph BUCKETS["CLOUD STORAGE BUCKETS (AWS S3 + GCS mirrors)"]
14- direction LR
15- B1["idc-open-data<br/>~90%, CC BY"]
16- B2["idc-open-data-two / idc1<br/>head scans"]
17- B3["idc-open-data-cr / cr<br/>~4%, CC BY-NC"]
18- end
19-
20- B1 & B2 & B3 -->|"all 3 buckets imported"| PROXY
21- B1 -->|"replicated into"| GHC
22-
23- subgraph STORES["DICOMweb / DICOM STORES"]
24- direction LR
25- PROXY["IDC Public Proxy<br/>No auth, 100% coverage"]
26- GHC["Google Healthcare API<br/>Auth required, ~96% coverage"]
27- end
28-
29- GHC -->|"DICOM metadata exported to"| BQ
30-
31- subgraph BQ["BigQuery (GCP auth + billing)"]
32- BQ_DESC["All 4000+ DICOM tags <br/>Tables: dicom_all, dicom_metadata, clinical"]
33- end
34-
35- BQ -->|"~50 key columns queried via SQL"| IDX
36- BQ -->|"tables exported to"| S3BQ
37- S3BQ["Parquet files in AWS S3"]
38-
39- subgraph IDX["idc-index PARQUET FILES (no auth)"]
40- IDX_DESC["~50 key columns per series, bundled in Python package <br/>Auto-loaded: index, prior_versions_index<br/>On-demand: collections, seg, sm, ann, clinical, contrast"]
41- end
42-
43- IDX -.->|"SeriesInstanceUID for DICOMweb queries"| STORES
44- IDX -.->|"series_aws_url / crdc_series_uuid maps to bucket paths"| BUCKETS
45-
46- style DCM fill:#fff3e0,stroke:#FF9800,stroke-width:2px,color:#000
47- style BUCKETS fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
48- style B1 fill:#e8f4fd,stroke:#2196F3,color:#000
49- style B2 fill:#e8f4fd,stroke:#2196F3,color:#000
50- style B3 fill:#e8f4fd,stroke:#2196F3,color:#000
51- style STORES fill:#f3e5f5,stroke:#9C27B0,stroke-width:2px,color:#000
52- style PROXY fill:#f3e5f5,stroke:#9C27B0,color:#000
53- style GHC fill:#f3e5f5,stroke:#9C27B0,color:#000
54- style BQ fill:#fce4ec,stroke:#E91E63,stroke-width:2px,color:#000
55- style BQ_DESC fill:#fce4ec,stroke:none,color:#000
56- style IDX fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px,color:#000
57- style S3BQ fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
58- style IDX_DESC fill:#e8f5e9,stroke:none,color:#000
8+ flowchart TB
9+ subgraph BUCKETS["CLOUD STORAGE BUCKETS (AWS S3 + GCS mirrors)"]
10+ direction LR
11+ B1["gs://idc-open-data<br>s3://idc-open-data<br>~90%, CC BY"]
12+ B2["gs://idc-open-idc1<br>s3://idc-open-data-two<br>potential head scans, CC BY"]
13+ B3["gs://idc-open-cr<br>s3://idc-open-data-cr<br>~4%, CC BY-NC"]
14+ end
15+ subgraph STORES["DICOMweb / DICOM STORES"]
16+ direction LR
17+ PROXY["IDC DICOM store<br>IDC Public Proxy in front of Google Healthcare DICOM store<br>No auth, 100% coverage"]
18+ GHC["Google Healthcare DICOM store<br>Google Healthcare API<br>Auth required, >95% coverage"]
19+ end
20+ subgraph BQ["BigQuery (GCP auth + billing)<br> All 4000+ DICOM tags <br>Tables: dicom_all, dicom_metadata, derived metadata, clinical"]
21+ end
22+ subgraph IDX["idc-index PARQUET FILES (no auth)<br>~50 key columns per series, bundled in Python package <br>Auto-loaded: index, prior_versions_index<br>On-demand: collections, seg, sm, ann, clinical, contrast"]
23+ end
24+ DCM["<b>DICOM FILES (.dcm)</b><br>Named by crdc_instance_uuid, grouped by crdc_series_uuid"] -- stored in --> BUCKETS
25+ B1 -- DICOM instances<BR>imported into --> PROXY
26+ B2 -- DICOM instances<BR>imported into --> PROXY
27+ B3 -- DICOM instances<BR>imported into --> PROXY
28+ B1 -- DICOM instances<BR>imported into --> GHC
29+ PROXY -- DICOM metadata exported to --> BQ
30+ BQ -- ~50 key columns queried via SQL --> IDX
31+ BQ -- tables exported to --> S3BQ["Parquet files in AWS S3"]
32+ IDX -. SeriesInstanceUID for DICOMweb queries .-> STORES
33+ IDX -. series_aws_url / crdc_series_uuid maps to bucket paths .-> BUCKETS
34+
35+ style DCM fill:#fff3e0,stroke:#FF9800,stroke-width:2px,color:#000
36+ style BUCKETS fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
37+ style B1 fill:#e8f4fd,stroke:#2196F3,color:#000
38+ style B2 fill:#e8f4fd,stroke:#2196F3,color:#000
39+ style B3 fill:#e8f4fd,stroke:#2196F3,color:#000
40+ style PROXY fill:#f3e5f5,stroke:#9C27B0,color:#000
41+ style GHC fill:#f3e5f5,stroke:#9C27B0,color:#000
42+ style BQ fill:#fce4ec,stroke:#E91E63,stroke-width:2px,color:#000
43+ %% style BQ_DESC fill:#fce4ec,stroke:none,color:#000
44+ style IDX fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px,color:#000
45+ style S3BQ fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
46+ %% style IDX_DESC fill:#e8f5e9,stroke:none,color:#000
47+ style STORES fill:#f3e5f5,stroke:#9C27B0,stroke-width:2px,color:#000
5948```
6049
6150Let's start with the overall principles of how we organize data in IDC.
0 commit comments