You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: data/data-model.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Data model
2
2
3
-
IDC relies on DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The _Entity-Relationship (E-R) diagram_ and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in [this page](https://mermaid.js.org/syntax/entityRelationshipDiagram.html) from Mermaid documentation).
3
+
IDC relies on the DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The _Entity-Relationship (E-R) diagram_ and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in [this page](https://mermaid.js.org/syntax/entityRelationshipDiagram.html) from Mermaid documentation).
4
4
5
5
```mermaid
6
6
erDiagram
@@ -45,12 +45,12 @@ erDiagram
45
45
46
46
```
47
47
48
-
IDC content is organized in **Collections**: groups of DICOM files that were collected through certain research activity.
48
+
IDC content is organized in **Collections**: groups of DICOM files that were collected through certain research activity. We sometimes refer to these as **Original Collections** to distinguish them from Analysis Results collections described below.
49
49
50
50
Collections are organized into **Programs**, which group related collections, or those collections that were contributed under the same funding initiative or a consortium. Example: TCGA program contains TCGA-GBM, TCGA-BRCA and other collections. You will see Collections nested under Programs in the upper left section of the [IDC Portal](https://portal.imaging.datacommons.cancer.gov/explore/). You will also see the list of collections that meet the filter criteria in the top table on the right-hand side of the portal interface. 
51
51
52
52
Individual DICOM files included in the collection contain attributes that organize content according to the [data-model.md](../dicom/data-model.md"mention"). 
53
53
54
-
Each collection will contain data for one or more case, or **patient**. Data for the individual patient is organized in DICOM **studies**, which group images corresponding to a single imaging exam/enconter, and collected in a given session. Studies are composed of DICOM **series**, which in turn consist of DICOM **instances**. Each DICOM instance correspond to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.
54
+
Each collection will contain data for one or more cases, or **patients**. Data for the individual patient is organized in DICOM **studies**, which group images corresponding to a single imaging exam/encounter, and collected in a given session. Studies are composed of DICOM **series**, which in turn consist of DICOM **instances**. Each DICOM instance corresponds to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using the IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.
55
55
56
-
**Analysis results collection** is a very important concept in IDC. These contain analysis results that were not contributed as part of any specific collection. Such analysis results might be contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.
56
+
The **Analysis results collection** is a very important concept in IDC. An analysis result is the DICOM encoded result of some analysis performed on data from one or more original collections. Such analysis results are often contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.
Copy file name to clipboardExpand all lines: data/data-versioning.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,9 @@
2
2
3
3
## Summary
4
4
5
-
IDC updates its data offering at the intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in [this page](data-release-notes.md#idc-releases-summary-view). 
5
+
IDC updates its data offering at intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in [this page](data-release-notes.md#idc-releases-summary-view). 
6
6
7
-
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries, the result of those queries will change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
7
+
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries against the IDC idc\_current and idc\_current\_clinical datasets, the result of those queries may change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
8
8
9
9
Here is how you can learn what version of IDC data you are interacting with, depending on what interface to the data you are using:
10
10
@@ -19,7 +19,7 @@ from idc_index import IDCClient
19
19
idc_version = IDCClient.get_idc_version()
20
20
```
21
21
22
-
***BigQuery**: within`bigquery-public-data`project, `idc_current`dataset contains table "views" to effectively provide an alias for the latest IDC data release. To find the actual IDC data release number, expand the list of datasets under `bigquery-public-data`project, and search for the ones that follow the pattern \`idc\_v\<number>\`. The one with the largest number corresponds to the latest released version, and will match the content in `idc_current` (related Google bug [here](https://issuetracker.google.com/issues/324112186)).
22
+
***BigQuery**: the`bigquery-public-data`project, `idc_current` datasets are effectively aliases of the latest IDC BQ datasets. To find the actual IDC data release number, expand the list of datasets under `bigquery-public-data`project, and search for the ones that follow the pattern \`idc\_v\<number>\`. The one with the largest number corresponds to the latest released version, and will match the content in `idc_current` (related Google bug [here](https://issuetracker.google.com/issues/324112186)).
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (extremely rare).
32
+
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (very rare).
33
33
34
34
Users interact with IDC using one of the following interfaces to define cohorts, and then perform analyses on these cohorts:
35
35
36
36
*[IDC Portal](https://portal.imaging.datacommons.cancer.gov/explore/) directly or using [IDC API](https://learn.canceridc.dev/api/getting-started): while this approach is most convenient, it allows searching using a small subset of attributes, defines cohorts only in terms of cases that meet the defined criteria, and has very limited options for combining multiple search criteria
37
37
*[IDC BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc_current\&t=dicom_all\&page=table) tables via [SQL interface](https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction): this approach is most powerful, as it allows the use of [any of the DICOM metadata attributes](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)
38
38
39
-
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
39
+
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching with respect to an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
40
40
41
41
In order to reproduce the result of such analysis, it must be possible to precisely recreate a cohort. For this purpose an IDC cohort as defined in the Portal is specified and saved as a filter applied against a specified IDC data version. Alternatively, the cohort can be defined as an SQL query, or as a list of unique identifiers selecting specific files within a defined data release version.
42
42
43
43
Because an IDC version exactly defines the set of data against which the filter/query is applied, and because all versions of all data, except data removed due to PHI/PII concerns, should continue to be available, a cohort is therefore persistent over the course of the evolution of IDC data.
44
44
45
-
46
-
47
45
There are various reasons that can cause modification of the existing collections in IDC:
48
46
49
47
* images for new patients can be added to an existing collections;
@@ -55,17 +53,19 @@ These and other possible changes mean that DICOM instances, series and studies c
55
53
56
54
Because DICOM `SOPInstanceUIDs`, `SeriesInstanceUIDs` or `StudyInstanceUIDs` can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a [_UUID_](https://en.wikipedia.org/wiki/Universally_unique_identifier) to uniquely identify it and differentiate it from other versions of the same DICOM object.
57
55
56
+
In the same way patients and collections are versioned and each version of each patient and collection is assigned a unique UUID. As with instance, series and study UUIDs, the UUID of the patient and collection to which an instance belongs is available in the auxiliary\_metadata\_table.
57
+
58
58
{% hint style="info" %}
59
59
It is very important to appreciate the difference between DICOM Unique Identifiers (UIDs) and CRDC Universally Unique Identifiers (UUIDs) assigned at the various levels of the DICOM hierarchy:
60
60
61
-
***DICOM UID**s are available as DICOM metadata attributes within the DICOM files for each DICOM Study, Series and Instance. Those UIDs follow the conventions of the DICOM UI Value Representation. DICOM UIDs are not versioned. I.e., if a DICOM study is augmented with a new DICOM series, DICOM `StudyInstanceUID` will not change. If an instance within an existing DICOM series is modified, DICOM `SeriesInstanceUID` or the `SOPInstanceUID` of the modified instance may or may not change.
61
+
***DICOM UID**s are available as DICOM metadata attributes within the DICOM files for each DICOM Study, Series and Instance. Those UIDs follow the conventions of the DICOM UI Value Representation. DICOM UIDs are not versioned. I.e., if a DICOM study is augmented with a new DICOM series, the DICOM `StudyInstanceUID` will not change. If an instance within an existing DICOM series is modified, DICOM `SeriesInstanceUID` or the `SOPInstanceUID` of the modified instance may or may not change.
62
62
***IDC UUID**s are **not** available as DICOM metadata attributes - they are generated for the DICOM studies, series and instances at the time of data ingestion, and are available in the IDC BigQuery tables. IDC UUIDs are tied to the content of the entity they correspond to. I.e., if anything within a DICOM study/series/instance is changed in a given IDC data release, a new UUID at the corresponding level of data hierarchy will be generated, while the previous version will be indexed and available via the prior UUID. 
63
63
{% endhint %}
64
64
65
-
The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, seriesand studies. This set is defined in terms of the corresponding set of instance UUIDs, series UUIDs and study UUIDs. This means that if, e.g., some version of an instance having UUID _UUIDx_ that was in IDC version _Vm_ is changed, a new UUID, _UUIDy_, will be assigned to the new instance version. Subsequent IDC versions, _Vm+1_, _Vm+2, ..._ will include that new instance version identified by _UUIDy_ unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes.
65
+
The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, series, studies, patients and collections. This set is defined in terms of the corresponding set of instance, series, study, patient and collection UUIDs. This means that if, e.g., some version of an instance having UUID _UUIDx_ that was in IDC version _Vm_ is changed, a new UUID, _UUIDy_, will be assigned to the new instance version. Subsequent IDC versions, _Vm+1_, _Vm+2, ..._ will include that new instance version identified by _UUIDy_ unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes, and similarly for patients and collections.
66
66
67
-
A corollary is that only a single version of an instance, seriesor study is in an IDC version.
67
+
A corollary is that only a single version of an instance, series, study, patient or collection is in an IDC version.
68
68
69
-
Note that instances, series and studies do not have an explicit version number in their metadata. Versioning of an object is implicit in the associated UUIDs.
69
+
Note that the DICOM does not include such version information. Versioning of an object is implicit in the associated UUIDs.
70
70
71
71
As we will see in [Organization of data](organization-of-data/organization-of-data-v1.md), the UUID of a (version of an) instance, and the UUID of the (version of a) series to which it belongs, are used in forming the object (file) name of the corresponding GCS and AWS objects. In addition, each instance version has a corresponding GA4GH DRS object, identified by a GUID based on the instance version's UUID. Refer to the [GA4GH DRS Objects](organization-of-data/organization-of-data-v2-through-v13-deprecated/guids-and-uuids.md) section for details.
Copy file name to clipboardExpand all lines: data/downloading-data/downloading-data-with-s5cmd.md
+5-8Lines changed: 5 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Downloading data with s5cmd
1
+
# s5cmd
2
2
3
3
{% hint style="info" %}
4
4
Make sure you first review the[](./)[Downloading data](./) section to learn about the simpler interfaces that provide access to IDC data.
@@ -11,7 +11,7 @@ With this approach you will follow a a 2-step process covered on this page:
11
11
***Step 1:** create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, [export the `s5cmd` manifest fist](../../portal/cohort-manifests.md), and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;
12
12
***Step 2**: given the manifest, download files to your computer or to a cloud VM using `s5cmd` command line tool.
13
13
14
-
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting\_started), which demonstrates how to query and download IDC data!
14
+
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting_started), which demonstrates how to query and download IDC data!
15
15
16
16
### Step 1: Create the manifest
17
17
@@ -21,7 +21,7 @@ You will need to complete prerequisites described in [getting-started-with-gcp.m
21
21
22
22
A download manifest can be created using either the IDC Portal, or by executing a BQ query. **If you have generated a manifest using the IDC Portal, as discussed**[**here**](../../portal/cohort-manifests.md)**, proceed to Step 2!** In the remainder of this section we describe creating a manifest from a BigQuery query.
23
23
24
-
The [`dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc\_current\&t=dicom\_all\&page=table) BigQuery table discussed in [this documentation article](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#bigquery-tables) can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The `gcs_url` and `aws_url` columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
24
+
The [`dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc_current\&t=dicom_all\&page=table) BigQuery table discussed in [this documentation article](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#bigquery-tables) can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The `gcs_url` and `aws_url` columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
25
25
26
26
Start with the query templates provided below, modify them based on your needs, and save the result in a file `query.txt`. The specific values for `PatientID`, `SeriesInstanceUID`, `StudyInstanceUID` are chosen to serve as examples.
27
27
@@ -111,11 +111,8 @@ Install `s5cmd` following the instructions in [https://github.com/peak/s5cmd#ins
111
111
112
112
You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.
0 commit comments