Skip to content

Commit 44c4179

Browse files
author
Jill Grant
authored
Merge pull request #1914 from mamtagiri/main
update genomics open data lake doc
2 parents 8557609 + 6f4bf76 commit 44c4179

12 files changed

+25
-62
lines changed

articles/open-datasets/dataset-1000-genomes.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@ ms.date: 07/10/2024
99

1010
# 1000 Genomes
1111

12-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
13-
1412
The 1000 Genomes Project ran between 2008 and 2015, to create the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, visit the 1000 Genome Project [website](https://www.internationalgenome.org/) and these publications:
1513

1614
[Pilot Analysis: A map of human genome variation from population-scale sequencing Nature 467, 1061-1073 (28 October 2010)](https://www.nature.com/articles/nature09534)
@@ -33,6 +31,16 @@ This dataset is a mirror of [this](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) FT
3331

3432
This dataset contains approximately 815 TB of data. It receives daily updates.
3533

34+
## Storage location
35+
36+
This dataset is stored in the West US 2 and West Central US Azure regions. We recommend locating compute resources in West US 2 or West Central US for affinity.
37+
38+
## Data access
39+
40+
West US 2:"https://dataset1000genomes.blob.core.windows.net/dataset'"
41+
42+
West Central US: "https://dataset1000genomes-secondary.blob.core.windows.net/dataset"
43+
3644
## Use Terms
3745

3846
Following the final publications, data from the 1000 Genomes Project is publicly available, without embargo, to anyone for use under the terms provided by the [dataset source](http://www.internationalgenome.org/data). Use of the data should be cited per details available in the 1000 Genome Project [FAQ resource](https://www.internationalgenome.org/faq).

articles/open-datasets/dataset-clinvar-annotations.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@ ms.date: 06/13/2024
99

1010
# ClinVar Annotations
1111

12-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
13-
1412
The [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) resource is a freely accessible, public archive of reports - with supporting evidence - about the relationships among human variations and phenotypes. It facilitates access to and communication about the claimed relationships between human variation and observed health status, and about the history of that interpretation. It provides access to a broader set of clinical interpretations that researchers can incorporate into genomics workflows and applications.
1513

1614
Visit the [Data Dictionary](https://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVarDataDictionary.pdf) and the [FAQ resource](https://www.ncbi.nlm.nih.gov/clinvar/docs/faq/) for more information about the data.
@@ -20,26 +18,31 @@ Visit the [Data Dictionary](https://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVa
2018
## Data source
2119

2220
This dataset is a mirror of the National Library of Medicine ClinVar [FTP resource](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/).
21+
[FTP resource](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/)
22+
23+
[FTP Overview](https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/)
2324

2425
## Data update frequency
2526

2627
This dataset receives daily updates.
2728

28-
## Data Access
29+
## Storage location
2930

30-
[FTP resource](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/)
31+
This dataset is stored in the West US 2 and West Central US Azure regions. We recommend locating compute resources in West US 2 or West Central US for affinity.
3132

32-
[FTP Overview](https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/)
33+
## Data Access
34+
35+
West US 2:"https://datasetclinvar.blob.core.windows.net/dataset'"
36+
West Central US: "https://datasetclinvar-secondary.blob.core.windows.net/dataset"
3337

3438
## Use Terms
39+
3540
Data is available without restrictions. More information and citation details, see [Accessing and using data in ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/).
3641

3742
## Contact
3843

3944
For any questions or feedback about this dataset, contact [[email protected]](mailto:[email protected]).
4045

41-
## Data access
42-
4346
### Azure Notebooks
4447

4548
# [azure-storage](#tab/azure-storage)

articles/open-datasets/dataset-encode.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ ms.date: 04/16/2021
88

99
# ENCODE: Encyclopedia of DNA Elements
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
12-
1311
The [Encyclopedia of DNA Elements (ENCODE) Consortium](https://www.encodeproject.org/help/project-overview/) is an ongoing international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). ENCODE's goal is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
1412

1513
ENCODE investigators employ various assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, that is, modified histones, transcription factors, chromatin regulators, and RNA-binding proteins, followed by sequencing.
@@ -34,8 +32,6 @@ West US 2: 'https://datasetencode.blob.core.windows.net/dataset'
3432

3533
West Central US: 'https://datasetencode-secondary.blob.core.windows.net/dataset'
3634

37-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2019-10-10&si=prod&sr=c&sig=9qSQZo4ggrCNpybBExU8SypuUZV33igI11xw0P7rB3c%3D
38-
3935
## Use Terms
4036

4137
External data users may freely download, analyze, and publish results based on any ENCODE data without restrictions, regardless of type or size, and includes no grace period for ENCODE data producers, either as individual members or as part of the Consortium. Researchers using unpublished ENCODE data are encouraged to contact the data producers to discuss possible publications. The Consortium will continue to publish the results of its own analysis efforts in independent publications.

articles/open-datasets/dataset-gatk-resource-bundle.md

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ ms.date: 04/16/2021
88

99
# GATK Resource Bundle
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
1211

1312
The [GATK resource bundle](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle) is a collection of standard files for working with human resequencing data with the GATK.
1413

@@ -39,49 +38,35 @@ This dataset is stored in the West US 2 and West Central US Azure regions. Alloc
3938
West US 2: 'https://datasetgatkbestpractices.blob.core.windows.net/dataset'
4039

4140
West Central US: 'https://datasetgatkbestpractices-secondary.blob.core.windows.net/dataset'
42-
43-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2020-04-08&si=prod&sr=c&sig=6SaDfKtXAIfdpO%2BkvNA%2FsTNmNij%2Byh%2F%2F%2Bf98WAUqs7I%3D
4441

4542
2. datasetgatklegacybundles
4643

4744
West US 2: 'https://datasetgatklegacybundles.blob.core.windows.net/dataset'
4845

4946
West Central US: 'https://datasetgatklegacybundles-secondary.blob.core.windows.net/dataset'
50-
51-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2020-04-08&si=prod&sr=c&sig=xBfxOPBqHKUCszzwbNCBYF0k9osTQjKnZbEjXCW7gU0%3D
5247

5348
3. datasetgatktestdata
5449

5550
West US 2: 'https://datasetgatktestdata.blob.core.windows.net/dataset'
5651

5752
West Central US: 'https://datasetgatktestdata-secondary.blob.core.windows.net/dataset'
58-
59-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2020-04-08&si=prod&sr=c&sig=fzLts1Q2vKjuvR7g50vE4HteEHBxTcJbNvf%2FZCeDMO4%3D
6053

6154
4. datasetpublicbroadref
6255

6356
West US 2: 'https://datasetpublicbroadref.blob.core.windows.net/dataset'
6457

6558
West Central US: 'https://datasetpublicbroadref-secondary.blob.core.windows.net/dataset'
66-
67-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D
6859

6960
South Central US: 'https://datasetpublicbroadrefsc.blob.core.windows.net/dataset'
7061

71-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2023-01-03&st=2024-02-12T19%3A56%3A11Z&se=2029-02-13T19%3A56%3A00Z&sr=c&sp=rl&sig=oGiNUGZ08PaabHVNtIiVEpJ1kcyqcL6ZadQcuN2ns%2FM%3D
72-
7362
6. datasetbroadpublic
7463

7564
West US 2: 'https://datasetbroadpublic.blob.core.windows.net/dataset'
7665

7766
West Central US: 'https://datasetbroadpublic-secondary.blob.core.windows.net/dataset'
78-
79-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2020-04-08&si=prod&sr=c&sig=u%2Bg2Ab7WKZEGiAkwlj6nKiEeZ5wdoJb10Az7uUwis%2Fg%3D
8067

8168
South Central US: 'https://datasetbroadpublicsc.blob.core.windows.net/dataset'
82-
83-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sv=2023-01-03&st=2024-02-12T19%3A58%3A33Z&se=2029-02-13T19%3A58%3A00Z&sr=c&sp=rl&sig=C2lDhe1uwu%2FJnC9rbQO65G6%2BdEUQ%2Fl0VheXrlnIQVAs%3D
84-
69+
8570
## Use Terms
8671

8772
Visit the [GATK resource bundle official site](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle).

articles/open-datasets/dataset-human-reference-genomes.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ ms.date: 04/16/2021
88

99
# Human Reference Genomes
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
12-
1311
This dataset includes two human-genome references assembled by the [Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc): Hg19 and Hg38.
1412

1513
For more information on Hg19 (GRCh37) data, see the [GRCh37 report at NCBI](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/).
@@ -44,12 +42,8 @@ West US 2: 'https://datasetreferencegenomes.blob.core.windows.net/dataset'
4442

4543
West Central US: 'https://datasetreferencegenomes-secondary.blob.core.windows.net/dataset'
4644

47-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=JtQoPFqiC24GiEB7v9zHLi4RrA2Kd1r%2F3iFt2l9%2FlV8%3D
48-
4945
South Central US: 'https://datasetreferencegenomesc.blob.core.windows.net/dataset'
5046

51-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2023-01-03&st=2024-02-12T20%3A07%3A21Z&se=2029-02-13T20%3A07%3A00Z&sr=c&sp=rl&sig=ASZYVyhqLOXKsT%2BcTR8MMblFeI4uZ%2Bnno%2FCnQk2RaFs%3D
52-
5347
## Use Terms
5448

5549
Data is available without restrictions. For more information and citation details, see the [NCBI Reference Sequence Database site](https://www.ncbi.nlm.nih.gov/refseq/).

articles/open-datasets/dataset-illumina-platinum-genomes.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,7 @@ ms.date: 04/16/2021
88

99
# Illumina Platinum Genomes
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
12-
13-
Whole-genome sequencing is enabling researchers worldwide to characterize the human genome more fully and accurately. This requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes as a benchmark. Illumina has generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree. Illumina has called variants in each genome using a range of currently available algorithms.
11+
Whole-genome sequencing is enabling researchers worldwide to characterize the human genome more fully and accurately. This effort requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes as a benchmark. Illumina generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree. Illumina called variants in each genome using a range of currently available algorithms.
1412

1513
For more information on the data, see the official [Illumina site](https://www.illumina.com/platinumgenomes.html).
1614

@@ -34,8 +32,6 @@ West US 2: 'https://datasetplatinumgenomes.blob.core.windows.net/dataset'
3432

3533
West Central US: 'https://datasetplatinumgenomes-secondary.blob.core.windows.net/dataset'
3634

37-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=FFfZ0QaDcnEPQmWsshtpoYOjbzd4jtwIWeK%2Fc4i9MqM%3D
38-
3935
## Use Terms
4036

4137
Data is available without restrictions. For more information and citation details, see the [official Illumina site](https://www.illumina.com/platinumgenomes.html).
@@ -55,7 +51,7 @@ For any questions or feedback about the dataset, contact platinumgenomes@illumin
5551

5652
## Getting the Illumina Platinum Genomes from Azure Open Datasets and Doing Initial Analysis
5753

58-
Use Jupyter notebooks, GATK, and Picard to do the following:
54+
Use Jupyter notebooks, GATK, and Picard in analyses such as:
5955

6056
1. Annotate genotypes using VariantFiltration
6157
2. Select Specific Variants
@@ -77,7 +73,7 @@ This notebook requires the following libraries:
7773

7874
## Getting the Genomics data from Azure Open Datasets
7975

80-
Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Dataset for `Illumina Platinum Genomes` datasets in below:
76+
Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Dataset for `Illumina Platinum Genomes` datasets as:
8177

8278
### Downloading the specific 'Illumina Platinum Genomes'
8379

@@ -164,7 +160,7 @@ Extract fields from a VCF file to a tab-delimited table. This tool extracts spec
164160

165161
INFO/site-level fields:
166162

167-
Use the `-F` argument to extract INFO fields; each field will occupy a single column in the output file. The field can be any standard VCF column (for example, CHROM, ID, QUAL) or any annotation name in the INFO field (for example, AC, AF). The tool also supports the following fields:
163+
Use the `-F` argument to extract INFO fields; each field occupies a single column in the output file. The field can be any standard VCF column (for example, CHROM, ID, QUAL) or any annotation name in the INFO field (for example, AC, AF). The tool also supports the following fields:
168164

169165
EVENTLENGTH (length of the event)
170166
TRANSITION (1 for a bi-allelic transition (SNP), 0 for bi-allelic transversion (SNP), -1 for INDELs and multi-allelics)

articles/open-datasets/dataset-immunecode.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,6 @@ West US 2: 'https://dataset1000genomes.blob.core.windows.net/dataset'
3838

3939
West Central US: 'https://dataset1000genomes-secondary.blob.core.windows.net/dataset'
4040

41-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2019-10-10&si=prod&sr=c&sig=9nzcxaQn0NprMPlSh4RhFQHcXedLQIcFgbERiooHEqM%3D
42-
4341
## Use terms
4442

4543
To learn more about the data use terms refer the [publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7418738/) and [Terms of Use](https://clients.adaptivebiotech.com/terms-of-use).

articles/open-datasets/dataset-open-cravat.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ ms.date: 04/16/2021
88

99
# OpenCravat: Open Custom Ranked Analysis of Variants Toolkit
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
12-
1311
OpenCRAVAT is a Python package that performs genomic variant interpretation including variant impact, annotation, and scoring. OpenCRAVAT has a modular architecture with a wide variety of analysis modules and annotation resources that can be selected and installed/run based on the needs of a given study.
1412

1513
For more information on the data, see the [OpenCravat](https://opencravat.org/).
@@ -34,8 +32,6 @@ West US 2: 'https://datasetopencravat.blob.core.windows.net/dataset'
3432

3533
West Central US: 'https://datasetopencravat-secondary.blob.core.windows.net/dataset'
3634

37-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2020-04-08&st=2021-03-11T23%3A50%3A01Z&se=2025-07-26T22%3A50%3A00Z&sr=c&sp=rl&sig=J9J9wnJOXsmEy7TFMq9wjcxjXDE%2B7KhGpCUL4elsC14%3D
38-
3935
## Use Terms
4036

4137
OpenCRAVAT is available with a GPLv3 license. Most data sources are free for non-commercial use. For commercial use, consult the institutional contacts for each data source.

articles/open-datasets/dataset-open-targets.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ ms.date: 04/16/2021
88

99
# Open Targets
1010

11-
[!INCLUDE [Open Dataset access change notice](./includes/open-datasets-change-note.md)]
12-
1311
The Open Targets Platform is a data resource to facilitate the systematic identification and prioritization of potential therapeutic drug targets. This resource integrates publicly available datasets, including those datasets that are generated by the Open Targets consortium, to build and score target-disease associations, aiding in the identification and prioritization of drug targets. Additionally, it incorporates pertinent annotation information about targets, diseases, phenotypes, drugs, and their key relationships.
1412

1513
The Open Targets Genetics highlights variant-centric statistical evidence to allow both prioritization of candidate causal variants at trait-associated loci and identification of potential drug targets. It collects and combines genetic associations gathered from published literature as well as newly derived data from sources like UK Biobank and FinnGen. Additionally, it includes functional genomics information such as chromatin conformation and interactions, along with quantitative trait loci (eQTLs, pQTLs, and sQTLs). Large-scale pipelines apply statistical fine-mapping across thousands of trait-associated loci to resolve association signals and link each variant to its proximal and distal target genes using a 'Locus2Gene' assessment. Integrated cross-trait colocalisation analyses and linking to detailed pharmaceutical compounds extend the capacity of Open Targets Genetics to explore drug repositioning opportunities and shared genetic architecture.
@@ -35,8 +33,6 @@ This dataset is stored in the West US 2 Azure region. Allocating compute resourc
3533

3634
West US 2: `https://datasetopentargets.blob.core.windows.net/dataset`
3735

38-
[SAS Token](/azure/storage/common/storage-sas-overview): sv=2019-10-10&si=prod&sr=c&sig=9nzcxaQn0NprMPlSh4RhFQHcXedLQIcFgbERiooHEqM%3D
39-
4036

4137
## Use terms
4238

articles/open-datasets/dataset-panancestry-uk-bio-bank.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,6 @@ This dataset is stored in the East US Azure region. We recommend locating comput
2929

3030
East US: 'https://datasetpanukbb.blob.core.windows.net/dataset'
3131

32-
[SAS Token](/azure/storage/common/storage-sas-overview): ?sp=rl&st=2023-05-17T21:26:19Z&se=2050-05-18T05:26:19Z&spr=https&sv=2022-11-02&sr=c&sig=MGvVbVHbmkGKmWmfkHzpcaJEf5G0ljLnBQy6cbrmR%2FA%3D
33-
3432
## Use Terms
3533

3634
The GWAS results data produced by the Pan-UKB are available free of restrictions under the Creative Commons Attribution 4.0 International (CC BY 4.0). The team requests that you acknowledge and give attribution to both the Pan-UKB project and UK Biobank, and link back to the relevant page, wherever possible. Full terms of use can be found [here](https://pan.ukbb.broadinstitute.org/downloads)

0 commit comments

Comments
 (0)