Skip to content

Commit 2a011e1

Browse files
fedorovgitbook-bot
authored andcommitted
GITBOOK-468: change request with no subject merged in GitBook
1 parent 94faa9d commit 2a011e1

File tree

1 file changed

+40
-51
lines changed

1 file changed

+40
-51
lines changed

data/organization-of-data/files-and-metadata.md

Lines changed: 40 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5,57 +5,46 @@ We gratefully acknowledge [Google Public Data Program](https://console.cloud.goo
55
{% endhint %}
66

77
```mermaid
8-
graph TB
9-
DCM["<b>DICOM FILES (.dcm)</b><br/>Named by crdc_instance_uuid, grouped by crdc_series_uuid"]
10-
11-
DCM -->|"stored in"| BUCKETS
12-
13-
subgraph BUCKETS["CLOUD STORAGE BUCKETS (AWS S3 + GCS mirrors)"]
14-
direction LR
15-
B1["idc-open-data<br/>~90%, CC BY"]
16-
B2["idc-open-data-two / idc1<br/>head scans"]
17-
B3["idc-open-data-cr / cr<br/>~4%, CC BY-NC"]
18-
end
19-
20-
B1 & B2 & B3 -->|"all 3 buckets imported"| PROXY
21-
B1 -->|"replicated into"| GHC
22-
23-
subgraph STORES["DICOMweb / DICOM STORES"]
24-
direction LR
25-
PROXY["IDC Public Proxy<br/>No auth, 100% coverage"]
26-
GHC["Google Healthcare API<br/>Auth required, ~96% coverage"]
27-
end
28-
29-
GHC -->|"DICOM metadata exported to"| BQ
30-
31-
subgraph BQ["BigQuery (GCP auth + billing)"]
32-
BQ_DESC["All 4000+ DICOM tags &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br/>Tables: dicom_all, dicom_metadata, clinical"]
33-
end
34-
35-
BQ -->|"~50 key columns queried via SQL"| IDX
36-
BQ -->|"tables exported to"| S3BQ
37-
S3BQ["Parquet files in AWS S3"]
38-
39-
subgraph IDX["idc-index PARQUET FILES (no auth)"]
40-
IDX_DESC["~50 key columns per series, bundled in Python package &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br/>Auto-loaded: index, prior_versions_index<br/>On-demand: collections, seg, sm, ann, clinical, contrast"]
41-
end
42-
43-
IDX -.->|"SeriesInstanceUID for DICOMweb queries"| STORES
44-
IDX -.->|"series_aws_url / crdc_series_uuid maps to bucket paths"| BUCKETS
45-
46-
style DCM fill:#fff3e0,stroke:#FF9800,stroke-width:2px,color:#000
47-
style BUCKETS fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
48-
style B1 fill:#e8f4fd,stroke:#2196F3,color:#000
49-
style B2 fill:#e8f4fd,stroke:#2196F3,color:#000
50-
style B3 fill:#e8f4fd,stroke:#2196F3,color:#000
51-
style STORES fill:#f3e5f5,stroke:#9C27B0,stroke-width:2px,color:#000
52-
style PROXY fill:#f3e5f5,stroke:#9C27B0,color:#000
53-
style GHC fill:#f3e5f5,stroke:#9C27B0,color:#000
54-
style BQ fill:#fce4ec,stroke:#E91E63,stroke-width:2px,color:#000
55-
style BQ_DESC fill:#fce4ec,stroke:none,color:#000
56-
style IDX fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px,color:#000
57-
style S3BQ fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
58-
style IDX_DESC fill:#e8f5e9,stroke:none,color:#000
8+
flowchart TB
9+
 subgraph BUCKETS["CLOUD STORAGE BUCKETS (AWS S3 + GCS mirrors)"]
10+
    direction LR
11+
        B1["gs://idc-open-data<br>s3://idc-open-data<br>~90%, CC BY"]
12+
        B2["gs://idc-open-idc1<br>s3://idc-open-data-two<br>potential head scans, CC BY"]
13+
        B3["gs://idc-open-cr<br>s3://idc-open-data-cr<br>~4%, CC BY-NC"]
14+
  end
15+
 subgraph STORES["DICOMweb / DICOM STORES"]
16+
    direction LR
17+
        PROXY["IDC DICOM store<br>IDC Public Proxy in front of Google Healthcare DICOM store<br>No auth, 100% coverage"]
18+
        GHC["Google Healthcare DICOM store<br>Google Healthcare API<br>Auth required, >95% coverage"]
19+
  end
20+
 subgraph BQ["BigQuery (GCP auth + billing)<br>&nbsp; &nbsp;&nbsp; &nbsp;All 4000+ DICOM tags &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br>Tables: dicom_all, dicom_metadata, derived metadata, clinical"]
21+
  end
22+
 subgraph IDX["idc-index PARQUET FILES (no auth)<br>~50 key columns per series, bundled in Python package &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br>Auto-loaded: index, prior_versions_index<br>On-demand: collections, seg, sm, ann, clinical, contrast"]
23+
  end
24+
    DCM["<b>DICOM FILES (.dcm)</b><br>Named by crdc_instance_uuid, grouped by crdc_series_uuid"] -- stored in --> BUCKETS
25+
    B1 -- DICOM instances<BR>imported into --> PROXY
26+
    B2 -- DICOM instances<BR>imported into --> PROXY
27+
    B3 -- DICOM instances<BR>imported into --> PROXY
28+
    B1 -- DICOM instances<BR>imported into --> GHC
29+
    PROXY -- DICOM metadata exported to --> BQ
30+
    BQ -- ~50 key columns queried via SQL --> IDX
31+
    BQ -- tables exported to --> S3BQ["Parquet files in AWS S3"]
32+
    IDX -. SeriesInstanceUID for DICOMweb queries .-> STORES
33+
    IDX -. series_aws_url / crdc_series_uuid maps to bucket paths .-> BUCKETS
34+
35+
    style DCM fill:#fff3e0,stroke:#FF9800,stroke-width:2px,color:#000
36+
    style BUCKETS fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
37+
    style B1 fill:#e8f4fd,stroke:#2196F3,color:#000
38+
    style B2 fill:#e8f4fd,stroke:#2196F3,color:#000
39+
    style B3 fill:#e8f4fd,stroke:#2196F3,color:#000
40+
    style PROXY fill:#f3e5f5,stroke:#9C27B0,color:#000
41+
    style GHC fill:#f3e5f5,stroke:#9C27B0,color:#000
42+
    style BQ fill:#fce4ec,stroke:#E91E63,stroke-width:2px,color:#000
43+
    %% style BQ_DESC fill:#fce4ec,stroke:none,color:#000
44+
    style IDX fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px,color:#000
45+
    style S3BQ fill:#e8f4fd,stroke:#2196F3,stroke-width:2px,color:#000
46+
    %% style IDX_DESC fill:#e8f5e9,stroke:none,color:#000
47+
    style STORES fill:#f3e5f5,stroke:#9C27B0,stroke-width:2px,color:#000
5948
```
6049

6150
Let's start with the overall principles of how we organize data in IDC.

0 commit comments

Comments
 (0)