Skip to content

Commit a4a8a3d

Browse files
Bill Cliffordgitbook-bot
authored andcommitted
GITBOOK-461: Bill's Jan 12 changes
1 parent 41cc5f4 commit a4a8a3d

File tree

10 files changed

+1085
-112
lines changed

10 files changed

+1085
-112
lines changed

data/data-model.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Data model
22

3-
IDC relies on DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The _Entity-Relationship (E-R) diagram_ and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in [this page](https://mermaid.js.org/syntax/entityRelationshipDiagram.html) from Mermaid documentation).
3+
IDC relies on the DICOM data model for organizing images and image-derived data. At the same time, IDC includes certain attributes and data types that are outside of the DICOM data model. The _Entity-Relationship (E-R) diagram_ and examples below summarize a simplified view of the IDC data model (you will find the explanation of how to interpret the notation used in this E-R diagram in [this page](https://mermaid.js.org/syntax/entityRelationshipDiagram.html) from Mermaid documentation).
44

55
```mermaid
66
erDiagram
@@ -45,12 +45,12 @@ erDiagram
4545
4646
```
4747

48-
IDC content is organized in **Collections**: groups of DICOM files that were collected through certain research activity.
48+
IDC content is organized in **Collections**: groups of DICOM files that were collected through certain research activity. We sometimes refer to these as **Original Collections** to distinguish them from Analysis Results collections described below.
4949

5050
Collections are organized into **Programs**, which group related collections, or those collections that were contributed under the same funding initiative or a consortium. Example: TCGA program contains TCGA-GBM, TCGA-BRCA and other collections. You will see Collections nested under Programs in the upper left section of the [IDC Portal](https://portal.imaging.datacommons.cancer.gov/explore/). You will also see the list of collections that meet the filter criteria in the top table on the right-hand side of the portal interface. 
5151

5252
Individual DICOM files included in the collection contain attributes that organize content according to the [data-model.md](../dicom/data-model.md "mention"). 
5353

54-
Each collection will contain data for one or more case, or **patient**. Data for the individual patient is organized in DICOM **studies**, which group images corresponding to a single imaging exam/enconter, and collected in a given session. Studies are composed of DICOM **series**, which in turn consist of DICOM **instances**. Each DICOM instance correspond to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.
54+
Each collection will contain data for one or more cases, or **patients**. Data for the individual patient is organized in DICOM **studies**, which group images corresponding to a single imaging exam/encounter, and collected in a given session. Studies are composed of DICOM **series**, which in turn consist of DICOM **instances**. Each DICOM instance corresponds to a single file on disk. As an example, in radiology imaging, individual instances would correspond to image slices in multi-slice acquisitions, and in digital pathology you will see a separate file/instance for each resolution layer of the image pyramid. When using the IDC Portal, you will never encounter individual instances - you will only see them if you download data to your computer.
5555

56-
**Analysis results collection** is a very important concept in IDC. These contain analysis results that were not contributed as part of any specific collection. Such analysis results might be contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.
56+
The **Analysis results collection** is a very important concept in IDC. An analysis result is the DICOM encoded result of some analysis performed on data from one or more original collections. Such analysis results are often contributed by investigators unrelated to those that submitted the analyzed images, and may span images across multiple collections.

data/data-versioning.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Summary
44

5-
IDC updates its data offering at the intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in [this page](data-release-notes.md#idc-releases-summary-view). 
5+
IDC updates its data offering at intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in [this page](data-release-notes.md#idc-releases-summary-view). 
66

7-
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries, the result of those queries will change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
7+
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries against the IDC idc\_current and idc\_current\_clinical datasets, the result of those queries may change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
88

99
Here is how you can learn what version of IDC data you are interacting with, depending on what interface to the data you are using:
1010

@@ -19,7 +19,7 @@ from idc_index import IDCClient
1919
idc_version = IDCClient.get_idc_version()
2020
```
2121

22-
* **BigQuery**: within `bigquery-public-data`project, `idc_current`dataset contains table "views" to effectively provide an alias for the latest IDC data release. To find the actual IDC data release number, expand the list of datasets under `bigquery-public-data`project, and search for the ones that follow the pattern \`idc\_v\<number>\`. The one with the largest number corresponds to the latest released version, and will match the content in `idc_current` (related Google bug [here](https://issuetracker.google.com/issues/324112186)).
22+
* **BigQuery**: the `bigquery-public-data` project, `idc_current` datasets are effectively aliases of the latest IDC BQ datasets. To find the actual IDC data release number, expand the list of datasets under `bigquery-public-data`project, and search for the ones that follow the pattern \`idc\_v\<number>\`. The one with the largest number corresponds to the latest released version, and will match the content in `idc_current` (related Google bug [here](https://issuetracker.google.com/issues/324112186)).
2323

2424
<figure><img src="../.gitbook/assets/image (38).png" alt="" width="408"><figcaption></figcaption></figure>
2525

@@ -29,21 +29,19 @@ idc_version = IDCClient.get_idc_version()
2929

3030
## Implementation details
3131

32-
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (extremely rare).
32+
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (very rare).
3333

3434
Users interact with IDC using one of the following interfaces to define cohorts, and then perform analyses on these cohorts:
3535

3636
* [IDC Portal](https://portal.imaging.datacommons.cancer.gov/explore/) directly or using [IDC API](https://learn.canceridc.dev/api/getting-started): while this approach is most convenient, it allows searching using a small subset of attributes, defines cohorts only in terms of cases that meet the defined criteria, and has very limited options for combining multiple search criteria
3737
* [IDC BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc_current\&t=dicom_all\&page=table) tables via [SQL interface](https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction): this approach is most powerful, as it allows the use of [any of the DICOM metadata attributes](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)
3838

39-
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
39+
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching with respect to an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
4040

4141
In order to reproduce the result of such analysis, it must be possible to precisely recreate a cohort. For this purpose an IDC cohort as defined in the Portal is specified and saved as a filter applied against a specified IDC data version. Alternatively, the cohort can be defined as an SQL query, or as a list of unique identifiers selecting specific files within a defined data release version.
4242

4343
Because an IDC version exactly defines the set of data against which the filter/query is applied, and because all versions of all data, except data removed due to PHI/PII concerns, should continue to be available, a cohort is therefore persistent over the course of the evolution of IDC data.
4444

45-
46-
4745
There are various reasons that can cause modification of the existing collections in IDC:
4846

4947
* images for new patients can be added to an existing collections;
@@ -55,17 +53,19 @@ These and other possible changes mean that DICOM instances, series and studies c
5553

5654
Because DICOM `SOPInstanceUIDs`, `SeriesInstanceUIDs` or `StudyInstanceUIDs` can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a [_UUID_](https://en.wikipedia.org/wiki/Universally_unique_identifier) to uniquely identify it and differentiate it from other versions of the same DICOM object.
5755

56+
In the same way patients and collections are versioned and each version of each patient and collection is assigned a unique UUID. As with instance, series and study UUIDs, the UUID of the patient and collection to which an instance belongs is available in the auxiliary\_metadata\_table.
57+
5858
{% hint style="info" %}
5959
It is very important to appreciate the difference between DICOM Unique Identifiers (UIDs) and CRDC Universally Unique Identifiers (UUIDs) assigned at the various levels of the DICOM hierarchy:
6060

61-
* **DICOM UID**s are available as DICOM metadata attributes within the DICOM files for each DICOM Study, Series and Instance. Those UIDs follow the conventions of the DICOM UI Value Representation. DICOM UIDs are not versioned. I.e., if a DICOM study is augmented with a new DICOM series, DICOM `StudyInstanceUID` will not change. If an instance within an existing DICOM series is modified, DICOM `SeriesInstanceUID` or the `SOPInstanceUID` of the modified instance may or may not change.
61+
* **DICOM UID**s are available as DICOM metadata attributes within the DICOM files for each DICOM Study, Series and Instance. Those UIDs follow the conventions of the DICOM UI Value Representation. DICOM UIDs are not versioned. I.e., if a DICOM study is augmented with a new DICOM series, the DICOM `StudyInstanceUID` will not change. If an instance within an existing DICOM series is modified, DICOM `SeriesInstanceUID` or the `SOPInstanceUID` of the modified instance may or may not change.
6262
* **IDC UUID**s are **not** available as DICOM metadata attributes - they are generated for the DICOM studies, series and instances at the time of data ingestion, and are available in the IDC BigQuery tables. IDC UUIDs are tied to the content of the entity they correspond to. I.e., if anything within a DICOM study/series/instance is changed in a given IDC data release, a new UUID at the corresponding level of data hierarchy will be generated, while the previous version will be indexed and available via the prior UUID.&#x20;
6363
{% endhint %}
6464

65-
The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, series and studies. This set is defined in terms of the corresponding set of instance UUIDs, series UUIDs and study UUIDs. This means that if, e.g., some version of an instance having UUID _UUIDx_ that was in IDC version _Vm_ is changed, a new UUID, _UUIDy_, will be assigned to the new instance version. Subsequent IDC versions, _Vm+1_, _Vm+2, ..._ will include that new instance version identified by _UUIDy_ unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes.
65+
The data in each IDC version, then, can be thought of as some set of versioned DICOM instances, series, studies, patients and collections. This set is defined in terms of the corresponding set of instance, series, study, patient and collection UUIDs. This means that if, e.g., some version of an instance having UUID _UUIDx_ that was in IDC version _Vm_ is changed, a new UUID, _UUIDy_, will be assigned to the new instance version. Subsequent IDC versions, _Vm+1_, _Vm+2, ..._ will include that new instance version identified by _UUIDy_ unless and until that instance is again changed. Similarly if the composition of some series changes, either because an instance in the series is changed, or an instance is added or removed from that series, a new UUID is assigned to the new version of that series and identifies that version of the series in subsequent IDC versions. Similarly, a study is assigned a new UUID when its composition changes, and similarly for patients and collections.
6666

67-
A corollary is that only a single version of an instance, series or study is in an IDC version.
67+
A corollary is that only a single version of an instance, series, study, patient or collection is in an IDC version.
6868

69-
Note that instances, series and studies do not have an explicit version number in their metadata. Versioning of an object is implicit in the associated UUIDs.
69+
Note that the DICOM does not include such version information. Versioning of an object is implicit in the associated UUIDs.
7070

7171
As we will see in [Organization of data](organization-of-data/organization-of-data-v1.md), the UUID of a (version of an) instance, and the UUID of the (version of a) series to which it belongs, are used in forming the object (file) name of the corresponding GCS and AWS objects. In addition, each instance version has a corresponding GA4GH DRS object, identified by a GUID based on the instance version's UUID. Refer to the [GA4GH DRS Objects](organization-of-data/organization-of-data-v2-through-v13-deprecated/guids-and-uuids.md) section for details.

data/downloading-data/downloading-data-with-s5cmd.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Downloading data with s5cmd
1+
# s5cmd
22

33
{% hint style="info" %}
44
Make sure you first review the[ ](./)[Downloading data](./) section to learn about the simpler interfaces that provide access to IDC data.
@@ -11,7 +11,7 @@ With this approach you will follow a a 2-step process covered on this page:
1111
* **Step 1:** create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, [export the `s5cmd` manifest fist](../../portal/cohort-manifests.md), and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;
1212
* **Step 2**: given the manifest, download files to your computer or to a cloud VM using `s5cmd` command line tool.
1313

14-
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting\_started), which demonstrates how to query and download IDC data!
14+
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting_started), which demonstrates how to query and download IDC data!
1515

1616
### Step 1: Create the manifest
1717

@@ -21,7 +21,7 @@ You will need to complete prerequisites described in [getting-started-with-gcp.m
2121

2222
A download manifest can be created using either the IDC Portal, or by executing a BQ query. **If you have generated a manifest using the IDC Portal, as discussed** [**here**](../../portal/cohort-manifests.md)**, proceed to Step 2!** In the remainder of this section we describe creating a manifest from a BigQuery query.
2323

24-
The [`dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc\_current\&t=dicom\_all\&page=table) BigQuery table discussed in [this documentation article](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#bigquery-tables) can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The `gcs_url` and `aws_url` columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
24+
The [`dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc_current\&t=dicom_all\&page=table) BigQuery table discussed in [this documentation article](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#bigquery-tables) can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The `gcs_url` and `aws_url` columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
2525

2626
Start with the query templates provided below, modify them based on your needs, and save the result in a file `query.txt`. The specific values for `PatientID`, `SeriesInstanceUID`, `StudyInstanceUID` are chosen to serve as examples.
2727

@@ -111,11 +111,8 @@ Install `s5cmd` following the instructions in [https://github.com/peak/s5cmd#ins
111111

112112
You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.
113113

114-
{% code overflow="wrap" %}
115-
```shell
116-
s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp s3://public-datasets-idc/cdac3f73-4fc9-4e0d-913b-b64aa3100977/902b4588-6f10-4342-9c80-f1054e67ee83.dcm .
117-
```
118-
{% endcode %}
114+
<pre class="language-shell" data-overflow="wrap"><code class="lang-shell"><strong>s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp s3://idc-open-data/cdac3f73-4fc9-4e0d-913b-b64aa3100977/902b4588-6f10-4342-9c80-f1054e67ee83.dcm .
115+
</strong></code></pre>
119116

120117
Once `s5cmd` is installed, you can use `s5cmd run` command to download the files corresponding to the manifest.
121118

0 commit comments

Comments
 (0)