awslabs
diff --git a/‎datasets/ai3.yaml‎
Lines changed: 37 additions & 0 deletions b/‎datasets/ai3.yaml‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎datasets/apex.yaml‎
Lines changed: 50 additions & 0 deletions b/‎datasets/apex.yaml‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎datasets/caladapt-wildfire-dataset.yaml‎
Lines changed: 2 additions & 1 deletion b/‎datasets/caladapt-wildfire-dataset.yaml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎datasets/clinical-ultrasound-image-data.yaml‎
Lines changed: 20 additions & 0 deletions b/‎datasets/clinical-ultrasound-image-data.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎datasets/cmas-data-warehouse.yaml‎
Lines changed: 6 additions & 0 deletions b/‎datasets/cmas-data-warehouse.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎datasets/coralreef-image-classification-training.yaml‎
Lines changed: 48 additions & 0 deletions b/‎datasets/coralreef-image-classification-training.yaml‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎datasets/deepdrug-dpeb.yaml‎
Lines changed: 38 additions & 0 deletions b/‎datasets/deepdrug-dpeb.yaml‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎datasets/eai-essential-web-v1.yaml‎
Lines changed: 28 additions & 0 deletions b/‎datasets/eai-essential-web-v1.yaml‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎datasets/mosaic.yaml‎
Lines changed: 31 additions & 0 deletions b/‎datasets/mosaic.yaml‎
Lines changed: 31 additions & 0 deletions
@@ -0,0 +1,37 @@
+Name: AI3 Protein-Ligand Binding Affinity Dataset
+Description: >
+  The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
+Documentation: https://github.com/devalab/AI3
+Contact: [email protected]
+ManagedBy: International Institute of Information Technology Hyderabad
+UpdateFrequency: Not updated
+Tags:
+  - pharmaceutical
+  - simulations
+  - health
+  - life sciences
+  - machine learning
+  - protein
+  - molecular dynamics
+  - aws-pds
+License: https://devalab.in/AI3.html
+Resources:
+  - Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
+    ARN: arn:aws:s3:::ai3data
+    Region: us-east-1
+    Type: S3 Bucket
+DataAtWork:
+  Tutorials:
+    - Title: "AI3: Protein-Ligand Binding Affinity Dataset"
+      URL: https://github.com/devalab/AI3
+      AuthorName: Deva Priyakumar Lab
+      AuthorURL: https://github.com/devalab  
+  Publications:
+    - Title: "PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications"
+      URL: https://www.nature.com/articles/s41597-022-01631-9
+      AuthorName: U. Deva Priyakumar
+      AuthorURL: https://devalab.in/
+    - Title: "PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications"
+      URL: https://www.nature.com/articles/s41597-023-02872-y
+      AuthorName: U. Deva Priyakumar
+      AuthorURL: https://devalab.in
@@ -0,0 +1,50 @@
+Name: APEX-CONNECTS
+Description: >
+  The BRAIN Initiative Connectivity Across Scales (CONNECTS) program is working to create detailed maps of brain 
+  wiring across different species and scales, using advanced imaging technologies. 
+  APEX supports this effort by serving as a central hub that brings together and coordinates data and tools 
+  from research focused on brain connectivity in humans and animals. Together, these efforts aim to improve our 
+  understanding of how the brain is structured and functions.
+Documentation: https://brainlife.io
+Contact: [email protected]
+ManagedBy: "[Brainlife Team](https://brainlife.io/team/)"
+UpdateFrequency: New datasets are added monthly
+Tags:
+  - neuroscience
+  - neuroimaging
+  - microscopy
+  - life sciences
+  - zarr
+  - metadata
+  - machine learning
+  - infrastructure
+  - json
+  - imaging
+  - brain images
+  - brain models
+  - analysis ready data
+  - nifti
+  - aws-pds
+License: '[CC BY](https://creativecommons.org/licenses/by/4.0)'
+Citation:
+Resources:
+  - Description: All APEX datasets are available for download
+    ARN: arn:aws:s3:::apex-connects
+    Region: us-east-2
+    Type: S3 Bucket
+DataAtWork:
+  Tutorials:
+    - Title: Brainlife AWS Tutorials
+      URL: https://brainlife.io/docs/tutorial/aws-brainlife
+      AuthorName: Brainlife
+      AuthorURL: https://brainlife.io
+  Tools & Applications:
+    - Title: Brainlife Web App
+      URL: https://brainlife.io
+      AuthorName: Brainlife
+      AuthorURL: https://brainlife.io
+    - Title: Brainlife CLI (Command Line Interface)
+      URL: https://github.com/brainlife/cli
+      AuthorName: Brainlife
+      AuthorURL: https://github.com/brainlife/cli
+  Publications:
@@ -12,6 +12,7 @@ Collabs:
     Tags:
       - climate
 Tags:
+  - aws-pds
   - climate
   - climate model
   - climate projections
@@ -61,4 +62,4 @@ DataAtWork:
       AuthorName: "Cal-Adapt: Analytics Engine Team"
       AuthorURL: https://github.com/cal-adapt
 ADXCategories:
-  - Environmental Data
+  - Environmental Data
@@ -0,0 +1,20 @@
+Name: Clinical Ultrasound Image Repository
+Description: Generic Clinical Ultrasound Data from Random Subjects acquired for Clinical Reasons, to be used for Developing Artificial Intelligence Applications. This dataset is complete with 2000 studies from 2000 subjects (one third each from abdominal, cardiac, and OB/GYN cases)
+Documentation: https://clinical-ultrasound-image-repository.s3.amazonaws.com/index.html
+Contact: [email protected]
+ManagedBy: "[MONAI Development Team](https://github.com/Project-MONAI/MONAI)"
+UpdateFrequency: This is a static dataset; however, tutorials and resources will be updated as they are developed.
+Tags:
+  - medicine
+  - medical imaging
+  - machine learning
+  - life sciences
+  - aws-pds
+License: "[CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)"
+Resources:
+  - Description: Clinical Ultrasound Image Repository
+    ARN: arn:aws:s3:::clinical-ultrasound-image-repository
+    Region: us-west-2
+    Type: S3 Bucket
+    Explore:
+    - "[Browse Bucket](https://clinical-ultrasound-image-repository.s3.amazonaws.com/download.html)"
@@ -73,6 +73,12 @@ Resources:
     Type: S3 Bucket
     Explore:
     - '[Browse Bucket](https://cmaq-12us4-cracmm3-modeling-platform-2023.s3.amazonaws.com/index.html)'
+  - Description: CMAQ Model Versions 5.5 CRACMM2 Input Data (2022r1) -- 12/22/2021 - 12/31/2022 12km CONUS
+    ARN:  arn:aws:s3::::::cmaq-12us1-cracmm2-modeling-platform-2022
+    Region: us-east-1
+    Type: S3 Bucket
+    Explore:
+    - '[Browse Bucket](https://cmaq-12us1-cracmm2-modeling-platform-2022.s3.amazonaws.com/index.html)'
   - Description: EPA 2022 Modeling Platform
     ARN: arn:aws:s3:::epa-2022-modeling-platform
     Region: us-east-1
 
@@ -0,0 +1,48 @@
+Name: Community coral reef image classification training data
+Description: "Community-sourced repository of coral reef image classification training data, including continually updated confirmed annotations from [MERMAID](https://datamermaid.org/)"
+Documentation: https://github.com/data-mermaid/image-classification-open-data
+Contact: [email protected]
+ManagedBy: "[MERMAID](https://datamermaid.org/)"
+UpdateFrequency: Each partner organization updates on their own cadence. MERMAID updates once per day.
+Tags:
+  - aws-pds
+  - coastal
+  - conservation
+  - coral reef
+  - csv
+  - global
+  - machine learning
+  - marine
+  - parquet
+  - survey
+License: "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
+Resources:
+  - Description: "The coral-reef-training AWS S3 bucket provides a single, open, well-structured, growing, community-sourced repository of coral reef image classification training data. Hosted at s3://coral-reef-training, this bucket supports global efforts in coral reef conservation through standardized, machine-learning-ready imagery and annotations.
+
+The bucket serves as the image storage backend for MERMAID’s image classification workflows and to distribute confirmed and scrubbed MERMAID coral reef image data, but it also provides a shared location where partners including CoralNet can contribute to and benefit from collective ML model development, each according to its own data structures and policies. Data in the bucket is free and open for public access; only contributing organizations have write access to their own data prefixes.
+
+By centralizing and standardizing coral reef image data, this initiative accelerates collaboration across scientific, conservation, and machine learning communities and facilitates the creation of a common, evolving image classification model for coral reefs worldwide."
+    ARN: arn:aws:s3:::coral-reef-training
+    Region: us-east-1
+    Type: S3 Bucket
+    Explore:
+    - "[Browse Bucket](https://coral-reef-training.s3.amazonaws.com/index.html)"
+DataAtWork:
+  Tutorials:
+    - Title: MERMAID Image Classification Open Data Tutorial - Python version
+      URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_Python.html
+      AuthorName: Domazetoski V, Caldwell I
+      AuthorURL: https://github.com/ViktorDomazetoski, https://github.com/ircaldwell
+    - Title: MERMAID Image Classification Open Data Tutorial - R version
+      URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_R.html
+      AuthorName: Caldwell I
+      AuthorURL: https://github.com/ircaldwell
+  Tools & Applications:
+    - Title: MERMAID Collect
+      URL: https://app.datamermaid.org/
+      AuthorName: MERMAID
+      AuthorURL: https://datamermaid.org/
+    - Title: MERMAID Explore
+      URL: https://explore.datamermaid.org/
+      AuthorName: MERMAID
+      AuthorURL: https://datamermaid.org/
@@ -0,0 +1,38 @@
+Name: DeepDrug Protein Embeddings Bank (DPEB)
+Description: DPEB is a multimodal database of human protein embeddings integrating four biologically complementary representations—AlphaFold2, BioEmbeddings, ESM-2, and ProtVec—designed for enhanced protein-protein interaction prediction and functional classification.
+Documentation: https://github.com/deepdrugai/DPEB
+Contact: https://github.com/deepdrugai/DPEB/issues
+ManagedBy: "Louisiana State University"
+UpdateFrequency: Initial release; maintained for at least 2 years with updates planned based on new embedding models and protein coverage.
+Tags:
+  - bioinformatics
+  - protein
+  - structural biology
+  - machine learning
+  - life sciences
+  - aws-pds
+License: MIT
+Citation: "Sajol MSI et al. DeepDrug Protein Embeddings Bank (DPEB) was accessed on [DATE] at https://registry.opendata.aws/dpeb"
+Resources:
+  - Description: Multimodal human protein embeddings (AlphaFold2, BioEmbeddings, ESM-2, ProtVec) with JSONL-formatted metadata containing FASTA, UniProt IDs, and embeddings.
+    ARN: arn:aws:s3:::deepdrug-dpeb-human-protein-embeddings
+    Region: us-east-1
+    Type: S3 Bucket
+DataAtWork:
+  Tutorials:
+    - Title: Aggregating and Clustering AlphaFold2 Embeddings from DPEB
+      URL: https://github.com/deepdrugai/DPEB/tree/main
+      AuthorName: Md. Saiful Islam Sajol
+      AuthorURL: https://github.com/deepdrugai
+  Tools & Applications:
+    - Title: DPEB Explorer Tool
+      URL: https://github.com/deepdrugai/DPEB
+      AuthorName: DeepDrug Lab
+      AuthorURL: https://github.com/deepdrugai
+  Publications:
+    - Title: A Multimodal Human Protein Embeddings Database - DeepDrug Protein Embeddings Bank (DPEB)
+      URL: https://doi.org/10.XXXX/nar.dpeb2025
+      AuthorName: Sajol MSI, Rajasekaran M, Bess A, Alvin C, Mukhopadhyay S
+      AuthorURL: https://github.com/deepdrugai/DPEB
+ADXCategories:
+  - Healthcare & Life Sciences Data
@@ -0,0 +1,28 @@
+Name: 'Essential-Web v1.0: 24T tokens of organized web data'
+Description: A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
+Documentation: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
+Contact: [email protected]
+ManagedBy: '[EssentialAI](https://www.essential.ai)'
+UpdateFrequency: Not updated
+Tags:
+  - aws-pds
+  - machine learning
+  - natural language processing
+  - web archive
+  - text analysis
+License: 'Essential-Web-v1.0 contributions are made available under the [ODC attribution license](https://opendatacommons.org/licenses/by/odc_by_1.0_public_text.txt); however, users should also abide by the [Common Crawl - Terms of Use](https://commoncrawl.org/terms-of-use). We do not alter the license of any of the underlying data.'
+Resources:
+  - Description: 'Essential-Web v1.0: 24T tokens of organized web data'
+    ARN: arn:aws:s3:::essential-web-v1.0
+    Region: us-west-2
+    Type: S3 Bucket
+  - Description: Notifications for new Essential-Web v1.0 data
+    ARN: arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
+    Region: us-west-2
+    Type: SNS Topic
+DataAtWork:
+  Publications:
+    - Title: 'Essential-Web v1.0: 24T tokens of organized web data'
+      URL: https://arxiv.org/abs/2506.14111
+      AuthorName: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar et al.
+      AuthorURL: https://arxiv.org/abs/2506.14111
@@ -0,0 +1,31 @@
+Name: Meta-Organized Stimuli And fMRI Imaging data for Computational modeling (MOSAIC)
+Description: This extensible dataset, MOSAIC, aggregates individual functional magnetic resonance imaging (fMRI) datasets by leveraging a shared preprocessing pipeline and stimulus curation procedure. This dataset aggregation procedure achieves the scale necessary for neural network training and the diversity needed for generalizable results.
+Documentation: https://github.com/blahner/mosaic-preprocessing
+Contact: [email protected]
+ManagedBy: Massachusetts Institute of Technology, Georgia Tech
+UpdateFrequency: New data is uploaded as researchers preprocess their fMRI data according to MOSAIC format and submit.
+Tags:
+  - aws-pds
+  - brain images
+  - brain models
+  - hdf5
+  - neuroimaging
+  - neuroscience
+  - machine learning
+License: CC BY 4.0
+Citation:
+Resources:
+  - Description: HDF5 files containing preprocessed fMRI data
+    ARN: arn:aws:s3:::mosaicfmri
+    Region: us-west-2
+    Type: S3 Bucket
+    Explore:
+    - '[Browse Bucket](https://mosaicfmri.s3.amazonaws.com/index.html)'
+DataAtWork:
+  Tutorials:
+    - Title: Load HDF5 file (Jupyter notebook)
+      URL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
+      NotebookURL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
+      AuthorName: Benjamin Lahner
+ADXCategories:
+  - Healthcare & Life Sciences Data