Merge pull request #4675 from mamtagiri/main

denrea · web-flow · commit 51c94deb9613 · 2025-05-09T10:50:12.000-07:00
[SCOPED] dataset change notice
diff --git a/articles/open-datasets/dataset-1000-genomes.md b/articles/open-datasets/dataset-1000-genomes.md
@@ -8,6 +8,7 @@ ms.date: 07/10/2024
 ---
 
 # 1000 Genomes
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
 
 The 1000 Genomes Project ran between 2008 and 2015, to create the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, visit the 1000 Genome Project [website](https://www.internationalgenome.org/) and these publications:
 
diff --git a/articles/open-datasets/dataset-clinvar-annotations.md b/articles/open-datasets/dataset-clinvar-annotations.md
@@ -9,6 +9,8 @@ ms.date: 06/13/2024
 
 # ClinVar Annotations
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) resource is a freely accessible, public archive of reports - with supporting evidence - about the relationships among human variations and phenotypes. It facilitates access to and communication about the claimed relationships between human variation and observed health status, and about the history of that interpretation. It provides access to a broader set of clinical interpretations that researchers can incorporate into genomics workflows and applications.
 
 Visit the [Data Dictionary](https://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVarDataDictionary.pdf) and the [FAQ resource](https://www.ncbi.nlm.nih.gov/clinvar/docs/faq/) for more information about the data.
diff --git a/articles/open-datasets/dataset-encode.md b/articles/open-datasets/dataset-encode.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # ENCODE: Encyclopedia of DNA Elements
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The [Encyclopedia of DNA Elements (ENCODE) Consortium](https://www.encodeproject.org/help/project-overview/) is an ongoing international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). ENCODE's goal is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
 
 ENCODE investigators employ various assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, that is, modified histones, transcription factors, chromatin regulators, and RNA-binding proteins, followed by sequencing.
diff --git a/articles/open-datasets/dataset-gatk-resource-bundle.md b/articles/open-datasets/dataset-gatk-resource-bundle.md
@@ -7,7 +7,7 @@ ms.date: 04/16/2021
 ---
 
 # GATK Resource Bundle
-
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
 
 The [GATK resource bundle](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle) is a collection of standard files for working with human resequencing data with the GATK.
 
diff --git a/articles/open-datasets/dataset-gnomad.md b/articles/open-datasets/dataset-gnomad.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # Genome Aggregation Database (gnomAD)
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The [Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org/) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
 
 [!INCLUDE [Open Dataset usage notice](./includes/open-datasets-usage-note.md)]
diff --git a/articles/open-datasets/dataset-human-reference-genomes.md b/articles/open-datasets/dataset-human-reference-genomes.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # Human Reference Genomes
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 This dataset includes two human-genome references assembled by the [Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc): Hg19 and Hg38.
 
 For more information on Hg19 (GRCh37) data, see the [GRCh37 report at NCBI](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/).
@@ -34,7 +36,7 @@ This dataset contains approximately 10 GB of data and is updated daily.
 
 ## Storage location
 
-This dataset is stored in the West US 2, West Central US and South Central US Azure regions. Allocating compute resources in West US 2 or West Central US or South Central US is recommended for affinity.
+This dataset is stored in the West US 2, West Central US, and South Central US Azure regions. Allocating compute resources in West US 2 or West Central US or South Central US is recommended for affinity.
 
 ## Data Access
 
@@ -63,11 +65,11 @@ For any questions or feedback about this dataset, contact the [Genome Reference
 
 ## Getting the Reference Genomes from Azure Open Datasets
 
-Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Datasets for `Reference Genomes` dataset in below:
+Several public genomics data is uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Datasets for `Reference Genomes` dataset in below:
 
 Users can call and download the following path with this notebook: 'https://datasetreferencegenomes.blob.core.windows.net/dataset/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/genomic_regions_definitions.txt'
 
-**Important note:** Users need to log in their Azure Account via Azure CLI for viewing the data with Azure ML SDK. On the other hand, they do not need do any actions for downloading the data.
+**Important note:** Users need to log in their Azure Account via Azure CLI for viewing the data with Azure ML SDK. On the other hand, they don't need do any actions for downloading the data.
 
 [Install the Azure CLI](/cli/azure/install-azure-cli).
 
diff --git a/articles/open-datasets/dataset-illumina-platinum-genomes.md b/articles/open-datasets/dataset-illumina-platinum-genomes.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # Illumina Platinum Genomes
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 Whole-genome sequencing is enabling researchers worldwide to characterize the human genome more fully and accurately. This effort requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes as a benchmark. Illumina generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree. Illumina called variants in each genome using a range of currently available algorithms.
 
 For more information on the data, see the official [Illumina site](https://www.illumina.com/platinumgenomes.html).
@@ -32,6 +34,8 @@ West US 2: 'https://datasetplatinumgenomes.blob.core.windows.net/dataset'
 
 West Central US: 'https://datasetplatinumgenomes-secondary.blob.core.windows.net/dataset'
 
+[SAS Token](/azure/storage/common/storage-sas-overview): sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=FFfZ0QaDcnEPQmWsshtpoYOjbzd4jtwIWeK%2Fc4i9MqM%3D
+
 ## Use Terms
 
 Data is available without restrictions. For more information and citation details, see the [official Illumina site](https://www.illumina.com/platinumgenomes.html).
@@ -51,7 +55,7 @@ For any questions or feedback about the dataset, contact platinumgenomes@illumin
 
 ## Getting the Illumina Platinum Genomes from Azure Open Datasets and Doing Initial Analysis 
 
-Use Jupyter notebooks, GATK, and Picard in analyses such as:
+Use Jupyter notebooks, GATK, and Picard to complete the following tasks:
 
 1. Annotate genotypes using VariantFiltration
 2. Select Specific Variants
@@ -73,7 +77,7 @@ This notebook requires the following libraries:
 
 ## Getting the Genomics data from Azure Open Datasets
 
-Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Dataset for `Illumina Platinum Genomes` datasets as:
+Several public genomics data are available as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Dataset for `Illumina Platinum Genomes` datasets as follows:
 
 ### Downloading the specific 'Illumina Platinum Genomes'
 
@@ -106,7 +110,7 @@ There are many different options for selecting subsets of variants from a larger
 Extract one or more samples from a call set based on either a complete sample name or a pattern match.
 Specify criteria for inclusion that place thresholds on annotation values, **for example "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25)**. These criteria are written as "JEXL expressions", which are documented in the article about using JEXL expressions.
 Provide concordance or discordance tracks in order to include or exclude variants that are also present in other given call sets.
-Select variants based on criteria like their type (for example, INDELs only), evidence of mendelian violation, filtering status, allelicity, etc.
+Select variants based on criteria like their type (for example, INDELs only), evidence of Mendelian violation, filtering status, allelicity, etc.
 There are also several options for recording the original values of certain annotations, which are recalculated when one subsets the new call set, trims alleles, etc.
 
 Input: A variant call set in VCF format from which a subset can be selected.
@@ -121,7 +125,7 @@ run gatk SelectVariants -R Homo_sapiens_assembly38.fasta -V outputannot.vcf --se
 
 Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. 
 
-This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.
+This conversion is necessary because downstream tools don't parse the FORMAT-level filter field.
 
 How can we filter the variants with **'No call'**
 
@@ -160,7 +164,7 @@ Extract fields from a VCF file to a tab-delimited table. This tool extracts spec
 
 INFO/site-level fields:
 
-Use the `-F` argument to extract INFO fields; each field occupies a single column in the output file. The field can be any standard VCF column (for example, CHROM, ID, QUAL) or any annotation name in the INFO field (for example, AC, AF). The tool also supports the following fields:
+Use the `-F` argument to extract INFO fields; each field will occupy a single column in the output file. The field can be any standard VCF column (for example, CHROM, ID, QUAL) or any annotation name in the INFO field (for example, AC, AF). The tool also supports the following fields:
 
 EVENTLENGTH (length of the event)
 TRANSITION (1 for a bi-allelic transition (SNP), 0 for bi-allelic transversion (SNP), -1 for INDELs and multi-allelics)
diff --git a/articles/open-datasets/dataset-immunecode.md b/articles/open-datasets/dataset-immunecode.md
@@ -8,6 +8,8 @@ ms.date: 11/09/2023
 
 # ImmuneCODE database
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The ImmuneCODE™ database, which includes hundreds of millions of T-cell Receptor (TCR) sequences from over 1,400 subjects exposed to or infected with the SARS-CoV-2 virus, and over 160,000 high-confidence SARS-CoV-2-specific TCRs. 
 The database is accessible at no cost. Its data can be analyzed to aid global initiatives aimed at comprehending the immune response to the SARS-CoV-2 virus and crafting novel interventions. To learn more about the dataset refer the associated [publication.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7418738/)
 
diff --git a/articles/open-datasets/dataset-open-cravat.md b/articles/open-datasets/dataset-open-cravat.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # OpenCravat: Open Custom Ranked Analysis of Variants Toolkit
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 OpenCRAVAT is a Python package that performs genomic variant interpretation including variant impact, annotation, and scoring. OpenCRAVAT has a modular architecture with a wide variety of analysis modules and annotation resources that can be selected and installed/run based on the needs of a given study.
 
 For more information on the data, see the [OpenCravat](https://opencravat.org/).
diff --git a/articles/open-datasets/dataset-open-targets.md b/articles/open-datasets/dataset-open-targets.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # Open Targets
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The Open Targets Platform is a data resource to facilitate the systematic identification and prioritization of potential therapeutic drug targets. This resource integrates publicly available datasets, including those datasets that are generated by the Open Targets consortium, to build and score target-disease associations, aiding in the identification and prioritization of drug targets. Additionally, it incorporates pertinent annotation information about targets, diseases, phenotypes, drugs, and their key relationships.
 
 The Open Targets Genetics highlights variant-centric statistical evidence to allow both prioritization of candidate causal variants at trait-associated loci and identification of potential drug targets. It collects and combines genetic associations gathered from published literature as well as newly derived data from sources like UK Biobank and FinnGen. Additionally, it includes functional genomics information such as chromatin conformation and interactions, along with quantitative trait loci (eQTLs, pQTLs, and sQTLs). Large-scale pipelines apply statistical fine-mapping across thousands of trait-associated loci to resolve association signals and link each variant to its proximal and distal target genes using a 'Locus2Gene' assessment. Integrated cross-trait colocalisation analyses and linking to detailed pharmaceutical compounds extend the capacity of Open Targets Genetics to explore drug repositioning opportunities and shared genetic architecture.
diff --git a/articles/open-datasets/dataset-panancestry-uk-bio-bank.md b/articles/open-datasets/dataset-panancestry-uk-bio-bank.md
@@ -9,6 +9,8 @@ ms.date: 05/17/2023
 
 # Pan UK-Biobank: Pan-ancestry genetic analysis of the UK Biobank
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The [Pan-ancestry genetic analysis of the UK Biobank(Pan-UKBB)](https://pan.ukbb.broadinstitute.org) is a resource to researchers that promotes more inclusive research practices, accelerates scientific discoveries, and improves the health of all people equitably. In genetics research, it's statistically necessary to study groups of individuals together with similar ancestries. In practice, this method has meant that most previous research has excluded individuals with non-European ancestries. The Pan-ancestry of UK-biobank is a resource using one of the most widely accessed sources of genetic data, the UK Biobank, in a manner that is more inclusive than most previous efforts--namely studying groups of individuals with diverse ancestries. The results of this research have many important limitations, which should be carefully considered when researchers use this resource in their work and when they and others interpret subsequent findings.
 
 [!INCLUDE [Open Dataset usage notice](./includes/open-datasets-usage-note.md)]
diff --git a/articles/open-datasets/dataset-snpeff.md b/articles/open-datasets/dataset-snpeff.md
@@ -8,6 +8,8 @@ ms.date: 04/16/2021
 
 # SnpEff: Genomic variant annotations and functional effect prediction toolbox
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 [SnpEff](https://pcingola.github.io/SnpEff/) Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).
 
 For more information on the data, see the [User Manual](https://pcingola.github.io/SnpEff/snpeff/introduction/).
diff --git a/articles/open-datasets/dataset-the-cancer-genome-atlas.md b/articles/open-datasets/dataset-the-cancer-genome-atlas.md
@@ -12,6 +12,8 @@ ms.date: 09/22/2022
 
 # TCGA Open Data
 
+[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-change-notice.md)]
+
 The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types[[1]](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The TCGA cancer data made available publically are two tiers: open or controlled access. 
 
 - Open access [available on Azure]: This dataset contains deindentified clinical and biospecimen data or summarized data that doesn't contain any individually identifiable information. The data types included are Gene expression, methylation beta values and protein quantification. DNA level datatype includes gene level copy number and masked copy number segment.
diff --git a/articles/open-datasets/includes/open-datasets-change-notice.md b/articles/open-datasets/includes/open-datasets-change-notice.md
@@ -0,0 +1,14 @@
+---
+author: mamtagiri
+ms.service: azure-open-datasets
+ms.topic: include
+ms.date: 05/08/2025
+ms.author: mamtagiri
+---
+> [!NOTE]
+> Important Update May 2025: 
+Dear Community,
+We’d like to inform you of an upcoming change regarding the Genomics open datasets currently available through Azure.
+After careful consideration, we decided to shift our focus to new initiatives that will better serve our community and align with our long-term goals. As such, access to the Genomics open datasets on Azure will be deprecated in the coming months.
+We understand these datasets were valuable for research, development, and learning, and we deeply appreciate the contributions and engagement from our community over time.
+Thank you for your understanding and support.