Skip to content

Commit 90a9cf7

Browse files
authored
Merge branch 'main' into main
2 parents 78de456 + 6789780 commit 90a9cf7

21 files changed

+437
-58
lines changed

datasets/ai3.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Name: AI3 Protein-Ligand Binding Affinity Dataset
2+
Description: >
3+
The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
4+
Documentation: https://github.com/devalab/AI3
5+
6+
ManagedBy: International Institute of Information Technology Hyderabad
7+
UpdateFrequency: Not updated
8+
Tags:
9+
- pharmaceutical
10+
- simulations
11+
- health
12+
- life sciences
13+
- machine learning
14+
- protein
15+
- molecular dynamics
16+
- aws-pds
17+
License: https://devalab.in/AI3.html
18+
Resources:
19+
- Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
20+
ARN: arn:aws:s3:::ai3data
21+
Region: us-east-1
22+
Type: S3 Bucket
23+
DataAtWork:
24+
Tutorials:
25+
- Title: "AI3: Protein-Ligand Binding Affinity Dataset"
26+
URL: https://github.com/devalab/AI3
27+
AuthorName: Deva Priyakumar Lab
28+
AuthorURL: https://github.com/devalab
29+
Publications:
30+
- Title: "PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications"
31+
URL: https://www.nature.com/articles/s41597-022-01631-9
32+
AuthorName: U. Deva Priyakumar
33+
AuthorURL: https://devalab.in/
34+
- Title: "PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications"
35+
URL: https://www.nature.com/articles/s41597-023-02872-y
36+
AuthorName: U. Deva Priyakumar
37+
AuthorURL: https://devalab.in

datasets/aodn_radar_newcastle_velocity_hourly_averaged_delayed_qc.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Collabs:
2020
Tags:
2121
- oceans
2222
Tags:
23+
- aws-pds
2324
- oceans
2425
- ocean currents
2526
- ocean velocity

datasets/apex.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Name: APEX-CONNECTS
2+
Description: >
3+
The BRAIN Initiative Connectivity Across Scales (CONNECTS) program is working to create detailed maps of brain
4+
wiring across different species and scales, using advanced imaging technologies.
5+
APEX supports this effort by serving as a central hub that brings together and coordinates data and tools
6+
from research focused on brain connectivity in humans and animals. Together, these efforts aim to improve our
7+
understanding of how the brain is structured and functions.
8+
Documentation: https://brainlife.io
9+
10+
ManagedBy: "[Brainlife Team](https://brainlife.io/team/)"
11+
UpdateFrequency: New datasets are added monthly
12+
Tags:
13+
- neuroscience
14+
- neuroimaging
15+
- microscopy
16+
- life sciences
17+
- zarr
18+
- metadata
19+
- machine learning
20+
- infrastructure
21+
- json
22+
- imaging
23+
- brain images
24+
- brain models
25+
- analysis ready data
26+
- nifti
27+
- aws-pds
28+
License: '[CC BY](https://creativecommons.org/licenses/by/4.0)'
29+
Citation:
30+
Resources:
31+
- Description: All APEX datasets are available for download
32+
ARN: arn:aws:s3:::apex-connects
33+
Region: us-east-2
34+
Type: S3 Bucket
35+
DataAtWork:
36+
Tutorials:
37+
- Title: Brainlife AWS Tutorials
38+
URL: https://brainlife.io/docs/tutorial/aws-brainlife
39+
AuthorName: Brainlife
40+
AuthorURL: https://brainlife.io
41+
Tools & Applications:
42+
- Title: Brainlife Web App
43+
URL: https://brainlife.io
44+
AuthorName: Brainlife
45+
AuthorURL: https://brainlife.io
46+
- Title: Brainlife CLI (Command Line Interface)
47+
URL: https://github.com/brainlife/cli
48+
AuthorName: Brainlife
49+
AuthorURL: https://github.com/brainlife/cli
50+
Publications:

datasets/caladapt-wildfire-dataset.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Collabs:
1212
Tags:
1313
- climate
1414
Tags:
15+
- aws-pds
1516
- climate
1617
- climate model
1718
- climate projections
@@ -61,4 +62,4 @@ DataAtWork:
6162
AuthorName: "Cal-Adapt: Analytics Engine Team"
6263
AuthorURL: https://github.com/cal-adapt
6364
ADXCategories:
64-
- Environmental Data
65+
- Environmental Data
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Name: Clinical Ultrasound Image Repository
2+
Description: Generic Clinical Ultrasound Data from Random Subjects acquired for Clinical Reasons, to be used for Developing Artificial Intelligence Applications. This dataset is complete with 2000 studies from 2000 subjects (one third each from abdominal, cardiac, and OB/GYN cases)
3+
Documentation: https://clinical-ultrasound-image-repository.s3.amazonaws.com/index.html
4+
5+
ManagedBy: "[MONAI Development Team](https://github.com/Project-MONAI/MONAI)"
6+
UpdateFrequency: This is a static dataset; however, tutorials and resources will be updated as they are developed.
7+
Tags:
8+
- medicine
9+
- medical imaging
10+
- machine learning
11+
- life sciences
12+
- aws-pds
13+
License: "[CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)"
14+
Resources:
15+
- Description: Clinical Ultrasound Image Repository
16+
ARN: arn:aws:s3:::clinical-ultrasound-image-repository
17+
Region: us-west-2
18+
Type: S3 Bucket
19+
Explore:
20+
- "[Browse Bucket](https://clinical-ultrasound-image-repository.s3.amazonaws.com/download.html)"

datasets/cmas-data-warehouse.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,12 @@ Resources:
7373
Type: S3 Bucket
7474
Explore:
7575
- '[Browse Bucket](https://cmaq-12us4-cracmm3-modeling-platform-2023.s3.amazonaws.com/index.html)'
76+
- Description: CMAQ Model Versions 5.5 CRACMM2 Input Data (2022r1) -- 12/22/2021 - 12/31/2022 12km CONUS
77+
ARN: arn:aws:s3::::::cmaq-12us1-cracmm2-modeling-platform-2022
78+
Region: us-east-1
79+
Type: S3 Bucket
80+
Explore:
81+
- '[Browse Bucket](https://cmaq-12us1-cracmm2-modeling-platform-2022.s3.amazonaws.com/index.html)'
7682
- Description: EPA 2022 Modeling Platform
7783
ARN: arn:aws:s3:::epa-2022-modeling-platform
7884
Region: us-east-1
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
Name: State of Colorado Elevation Data
2+
Description: The State of Colorado has gathered public historical elevation data.
3+
Documentation: https://docs.google.com/document/d/1HMO-d4cCrBvFa2F6-N3lhP6rkezlvBmSUFA5S8t_ekQ/edit?usp=sharing
4+
5+
ManagedBy: State of Colorado Governor's Office of Information Technology (OIT) GIS team
6+
UpdateFrequency: Periodically
7+
Tags:
8+
- aws-pds
9+
- geospatial
10+
- imaging
11+
- mapping
12+
License: https://creativecommons.org/publicdomain/zero/1.0/legalcode
13+
Resources:
14+
- Description: Colorado Elevation Data (LiDAR)
15+
ARN: arn:aws:s3:::colorado-public-elevation-data
16+
Region: us-west-2
17+
Type: S3 Bucket
18+
- Description: Notifications for new Colorado Elevation data
19+
ARN: arn:aws:sns:us-west-2:180294215083:colorado-public-elevation-data-object_created
20+
Region: us-west-2
21+
Type: SNS Topic
22+
DataAtWork:
23+
Tutorials:
24+
- Title: Colorado AWS Open Data Elevation Data Guide
25+
URL: https://docs.google.com/document/d/1pAHZB6SgSE4QTawEbSnIIHpxVCTBg-IjQ6X9KJP28BM/edit?usp=sharing
26+
AuthorName: State of Colorado OIT-GIS
27+
AuthorURL: https://geodata.colorado.gov/
28+
- Title: Colorado Public Elevation Data s3 Browser
29+
URL: https://colorado-public-elevation-data.s3.amazonaws.com/index.html
30+
AuthorName: State of Colorado OIT-GIS
31+
AuthorURL: https://geodata.colorado.gov/
32+
ADXCategories:
33+
- Public Sector Data
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
Name: Community coral reef image classification training data
2+
Description: "Community-sourced repository of coral reef image classification training data, including continually updated confirmed annotations from [MERMAID](https://datamermaid.org/)"
3+
Documentation: https://github.com/data-mermaid/image-classification-open-data
4+
5+
ManagedBy: "[MERMAID](https://datamermaid.org/)"
6+
UpdateFrequency: Each partner organization updates on their own cadence. MERMAID updates once per day.
7+
Tags:
8+
- aws-pds
9+
- coastal
10+
- conservation
11+
- coral reef
12+
- csv
13+
- global
14+
- machine learning
15+
- marine
16+
- parquet
17+
- survey
18+
License: "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
19+
Resources:
20+
- Description: "The coral-reef-training AWS S3 bucket provides a single, open, well-structured, growing, community-sourced repository of coral reef image classification training data. Hosted at s3://coral-reef-training, this bucket supports global efforts in coral reef conservation through standardized, machine-learning-ready imagery and annotations.
21+
22+
The bucket serves as the image storage backend for MERMAID’s image classification workflows and to distribute confirmed and scrubbed MERMAID coral reef image data, but it also provides a shared location where partners including CoralNet can contribute to and benefit from collective ML model development, each according to its own data structures and policies. Data in the bucket is free and open for public access; only contributing organizations have write access to their own data prefixes.
23+
24+
By centralizing and standardizing coral reef image data, this initiative accelerates collaboration across scientific, conservation, and machine learning communities and facilitates the creation of a common, evolving image classification model for coral reefs worldwide."
25+
ARN: arn:aws:s3:::coral-reef-training
26+
Region: us-east-1
27+
Type: S3 Bucket
28+
Explore:
29+
- "[Browse Bucket](https://coral-reef-training.s3.amazonaws.com/index.html)"
30+
DataAtWork:
31+
Tutorials:
32+
- Title: MERMAID Image Classification Open Data Tutorial - Python version
33+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_Python.html
34+
AuthorName: Domazetoski V, Caldwell I
35+
AuthorURL: https://github.com/ViktorDomazetoski, https://github.com/ircaldwell
36+
- Title: MERMAID Image Classification Open Data Tutorial - R version
37+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_R.html
38+
AuthorName: Caldwell I
39+
AuthorURL: https://github.com/ircaldwell
40+
Tools & Applications:
41+
- Title: MERMAID Collect
42+
URL: https://app.datamermaid.org/
43+
AuthorName: MERMAID
44+
AuthorURL: https://datamermaid.org/
45+
- Title: MERMAID Explore
46+
URL: https://explore.datamermaid.org/
47+
AuthorName: MERMAID
48+
AuthorURL: https://datamermaid.org/

datasets/eai-essential-web-v1.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Name: 'Essential-Web v1.0: 24T tokens of organized web data'
2+
Description: A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
3+
Documentation: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
4+
5+
ManagedBy: '[EssentialAI](https://www.essential.ai)'
6+
UpdateFrequency: Not updated
7+
Tags:
8+
- aws-pds
9+
- machine learning
10+
- natural language processing
11+
- web archive
12+
- text analysis
13+
License: 'Essential-Web-v1.0 contributions are made available under the [ODC attribution license](https://opendatacommons.org/licenses/by/odc_by_1.0_public_text.txt); however, users should also abide by the [Common Crawl - Terms of Use](https://commoncrawl.org/terms-of-use). We do not alter the license of any of the underlying data.'
14+
Resources:
15+
- Description: 'Essential-Web v1.0: 24T tokens of organized web data'
16+
ARN: arn:aws:s3:::essential-web-v1.0
17+
Region: us-west-2
18+
Type: S3 Bucket
19+
- Description: Notifications for new Essential-Web v1.0 data
20+
ARN: arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
21+
Region: us-west-2
22+
Type: SNS Topic
23+
DataAtWork:
24+
Publications:
25+
- Title: 'Essential-Web v1.0: 24T tokens of organized web data'
26+
URL: https://arxiv.org/abs/2506.14111
27+
AuthorName: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar et al.
28+
AuthorURL: https://arxiv.org/abs/2506.14111

datasets/ibl-autism.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
Name: IBL Neuropixels Brainwide Map on AWS
2+
Description: Electrophysiological recordings of mouse brain activity acquired during a decision making task in multiple autism mice models.
3+
Documentation: https://docs.internationalbrainlab.org/notebooks_external/2025_data_release_autism_noel.html
4+
5+
ManagedBy: "[International Brain Laboratory](https://www.internationalbrainlab.com)"
6+
UpdateFrequency: TBD
7+
Tags:
8+
- aws-pds
9+
- life sciences
10+
- neuroscience
11+
- neurophysiology
12+
- open source software
13+
- Mus musculus
14+
- autism spectrum disorder
15+
License: CC-BY 4.0
16+
Resources:
17+
- Description: Project data in public bucket
18+
ARN: arn:aws:s3:::ibl-brain-wide-map-public
19+
Region: us-east-1
20+
Type: S3 Bucket
21+
DataAtWork:
22+
Tutorials:
23+
- Title: Intermediate Datasets and Analysis Code
24+
URL: https://osf.io/fap2s/ and https://osf.io/fap2s/wiki/home/
25+
AuthorName: Noel et al.
26+
- Title: Download the public data via ONE
27+
URL: https://docs.internationalbrainlab.org/notebooks_external/data_download.html
28+
AuthorName: IBL Data Architecture Working Group
29+
AuthorURL: https://github.com/orgs/int-brain-lab/teams/data-architecture-wg/members
30+
- Title: Find data associated with a release or publication
31+
URL: https://docs.internationalbrainlab.org/notebooks_external/data_download.html#Find-data-associated-with-a-release-or-publication
32+
AuthorName: IBL Data Architecture Working Group
33+
AuthorURL: https://github.com/orgs/int-brain-lab/teams/data-architecture-wg/members
34+
- Title: Loading Data
35+
URL: https://docs.internationalbrainlab.org/loading_examples.html
36+
AuthorName: IBL Data Architecture Working Group
37+
AuthorURL: https://github.com/orgs/int-brain-lab/teams/data-architecture-wg/members
38+
Publications:
39+
- Title: A common computational and neural anomaly across mouse models of autism
40+
URL: https://doi.org/10.1038/s41593-025-01965-8
41+
AuthorName: Noel et al.

0 commit comments

Comments
 (0)