Skip to content

Commit ff43ff7

Browse files
authored
Merge branch 'main' into main
2 parents a787460 + 8eb570e commit ff43ff7

16 files changed

+403
-39
lines changed

datasets/ai3.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Name: AI3 Protein-Ligand Binding Affinity Dataset
2+
Description: >
3+
The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
4+
Documentation: https://github.com/devalab/AI3
5+
6+
ManagedBy: International Institute of Information Technology Hyderabad
7+
UpdateFrequency: Not updated
8+
Tags:
9+
- pharmaceutical
10+
- simulations
11+
- health
12+
- life sciences
13+
- machine learning
14+
- protein
15+
- molecular dynamics
16+
- aws-pds
17+
License: https://devalab.in/AI3.html
18+
Resources:
19+
- Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
20+
ARN: arn:aws:s3:::ai3data
21+
Region: us-east-1
22+
Type: S3 Bucket
23+
DataAtWork:
24+
Tutorials:
25+
- Title: "AI3: Protein-Ligand Binding Affinity Dataset"
26+
URL: https://github.com/devalab/AI3
27+
AuthorName: Deva Priyakumar Lab
28+
AuthorURL: https://github.com/devalab
29+
Publications:
30+
- Title: "PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications"
31+
URL: https://www.nature.com/articles/s41597-022-01631-9
32+
AuthorName: U. Deva Priyakumar
33+
AuthorURL: https://devalab.in/
34+
- Title: "PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications"
35+
URL: https://www.nature.com/articles/s41597-023-02872-y
36+
AuthorName: U. Deva Priyakumar
37+
AuthorURL: https://devalab.in

datasets/apex.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Name: APEX-CONNECTS
2+
Description: >
3+
The BRAIN Initiative Connectivity Across Scales (CONNECTS) program is working to create detailed maps of brain
4+
wiring across different species and scales, using advanced imaging technologies.
5+
APEX supports this effort by serving as a central hub that brings together and coordinates data and tools
6+
from research focused on brain connectivity in humans and animals. Together, these efforts aim to improve our
7+
understanding of how the brain is structured and functions.
8+
Documentation: https://brainlife.io
9+
10+
ManagedBy: "[Brainlife Team](https://brainlife.io/team/)"
11+
UpdateFrequency: New datasets are added monthly
12+
Tags:
13+
- neuroscience
14+
- neuroimaging
15+
- microscopy
16+
- life sciences
17+
- zarr
18+
- metadata
19+
- machine learning
20+
- infrastructure
21+
- json
22+
- imaging
23+
- brain images
24+
- brain models
25+
- analysis ready data
26+
- nifti
27+
- aws-pds
28+
License: '[CC BY](https://creativecommons.org/licenses/by/4.0)'
29+
Citation:
30+
Resources:
31+
- Description: All APEX datasets are available for download
32+
ARN: arn:aws:s3:::apex-connects
33+
Region: us-east-2
34+
Type: S3 Bucket
35+
DataAtWork:
36+
Tutorials:
37+
- Title: Brainlife AWS Tutorials
38+
URL: https://brainlife.io/docs/tutorial/aws-brainlife
39+
AuthorName: Brainlife
40+
AuthorURL: https://brainlife.io
41+
Tools & Applications:
42+
- Title: Brainlife Web App
43+
URL: https://brainlife.io
44+
AuthorName: Brainlife
45+
AuthorURL: https://brainlife.io
46+
- Title: Brainlife CLI (Command Line Interface)
47+
URL: https://github.com/brainlife/cli
48+
AuthorName: Brainlife
49+
AuthorURL: https://github.com/brainlife/cli
50+
Publications:

datasets/caladapt-wildfire-dataset.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Collabs:
1212
Tags:
1313
- climate
1414
Tags:
15+
- aws-pds
1516
- climate
1617
- climate model
1718
- climate projections
@@ -61,4 +62,4 @@ DataAtWork:
6162
AuthorName: "Cal-Adapt: Analytics Engine Team"
6263
AuthorURL: https://github.com/cal-adapt
6364
ADXCategories:
64-
- Environmental Data
65+
- Environmental Data
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Name: Clinical Ultrasound Image Repository
2+
Description: Generic Clinical Ultrasound Data from Random Subjects acquired for Clinical Reasons, to be used for Developing Artificial Intelligence Applications. This dataset is complete with 2000 studies from 2000 subjects (one third each from abdominal, cardiac, and OB/GYN cases)
3+
Documentation: https://clinical-ultrasound-image-repository.s3.amazonaws.com/index.html
4+
5+
ManagedBy: "[MONAI Development Team](https://github.com/Project-MONAI/MONAI)"
6+
UpdateFrequency: This is a static dataset; however, tutorials and resources will be updated as they are developed.
7+
Tags:
8+
- medicine
9+
- medical imaging
10+
- machine learning
11+
- life sciences
12+
- aws-pds
13+
License: "[CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)"
14+
Resources:
15+
- Description: Clinical Ultrasound Image Repository
16+
ARN: arn:aws:s3:::clinical-ultrasound-image-repository
17+
Region: us-west-2
18+
Type: S3 Bucket
19+
Explore:
20+
- "[Browse Bucket](https://clinical-ultrasound-image-repository.s3.amazonaws.com/download.html)"

datasets/cmas-data-warehouse.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,12 @@ Resources:
7373
Type: S3 Bucket
7474
Explore:
7575
- '[Browse Bucket](https://cmaq-12us4-cracmm3-modeling-platform-2023.s3.amazonaws.com/index.html)'
76+
- Description: CMAQ Model Versions 5.5 CRACMM2 Input Data (2022r1) -- 12/22/2021 - 12/31/2022 12km CONUS
77+
ARN: arn:aws:s3::::::cmaq-12us1-cracmm2-modeling-platform-2022
78+
Region: us-east-1
79+
Type: S3 Bucket
80+
Explore:
81+
- '[Browse Bucket](https://cmaq-12us1-cracmm2-modeling-platform-2022.s3.amazonaws.com/index.html)'
7682
- Description: EPA 2022 Modeling Platform
7783
ARN: arn:aws:s3:::epa-2022-modeling-platform
7884
Region: us-east-1
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
Name: Community coral reef image classification training data
2+
Description: "Community-sourced repository of coral reef image classification training data, including continually updated confirmed annotations from [MERMAID](https://datamermaid.org/)"
3+
Documentation: https://github.com/data-mermaid/image-classification-open-data
4+
5+
ManagedBy: "[MERMAID](https://datamermaid.org/)"
6+
UpdateFrequency: Each partner organization updates on their own cadence. MERMAID updates once per day.
7+
Tags:
8+
- aws-pds
9+
- coastal
10+
- conservation
11+
- coral reef
12+
- csv
13+
- global
14+
- machine learning
15+
- marine
16+
- parquet
17+
- survey
18+
License: "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
19+
Resources:
20+
- Description: "The coral-reef-training AWS S3 bucket provides a single, open, well-structured, growing, community-sourced repository of coral reef image classification training data. Hosted at s3://coral-reef-training, this bucket supports global efforts in coral reef conservation through standardized, machine-learning-ready imagery and annotations.
21+
22+
The bucket serves as the image storage backend for MERMAID’s image classification workflows and to distribute confirmed and scrubbed MERMAID coral reef image data, but it also provides a shared location where partners including CoralNet can contribute to and benefit from collective ML model development, each according to its own data structures and policies. Data in the bucket is free and open for public access; only contributing organizations have write access to their own data prefixes.
23+
24+
By centralizing and standardizing coral reef image data, this initiative accelerates collaboration across scientific, conservation, and machine learning communities and facilitates the creation of a common, evolving image classification model for coral reefs worldwide."
25+
ARN: arn:aws:s3:::coral-reef-training
26+
Region: us-east-1
27+
Type: S3 Bucket
28+
Explore:
29+
- "[Browse Bucket](https://coral-reef-training.s3.amazonaws.com/index.html)"
30+
DataAtWork:
31+
Tutorials:
32+
- Title: MERMAID Image Classification Open Data Tutorial - Python version
33+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_Python.html
34+
AuthorName: Domazetoski V, Caldwell I
35+
AuthorURL: https://github.com/ViktorDomazetoski, https://github.com/ircaldwell
36+
- Title: MERMAID Image Classification Open Data Tutorial - R version
37+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_R.html
38+
AuthorName: Caldwell I
39+
AuthorURL: https://github.com/ircaldwell
40+
Tools & Applications:
41+
- Title: MERMAID Collect
42+
URL: https://app.datamermaid.org/
43+
AuthorName: MERMAID
44+
AuthorURL: https://datamermaid.org/
45+
- Title: MERMAID Explore
46+
URL: https://explore.datamermaid.org/
47+
AuthorName: MERMAID
48+
AuthorURL: https://datamermaid.org/

datasets/deepdrug-dpeb.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
Name: DeepDrug Protein Embeddings Bank (DPEB)
2+
Description: DPEB is a multimodal database of human protein embeddings integrating four biologically complementary representations—AlphaFold2, BioEmbeddings, ESM-2, and ProtVec—designed for enhanced protein-protein interaction prediction and functional classification.
3+
Documentation: https://github.com/deepdrugai/DPEB
4+
Contact: https://github.com/deepdrugai/DPEB/issues
5+
ManagedBy: "Louisiana State University"
6+
UpdateFrequency: Initial release; maintained for at least 2 years with updates planned based on new embedding models and protein coverage.
7+
Tags:
8+
- bioinformatics
9+
- protein
10+
- structural biology
11+
- machine learning
12+
- life sciences
13+
- aws-pds
14+
License: MIT
15+
Citation: "Sajol MSI et al. DeepDrug Protein Embeddings Bank (DPEB) was accessed on [DATE] at https://registry.opendata.aws/dpeb"
16+
Resources:
17+
- Description: Multimodal human protein embeddings (AlphaFold2, BioEmbeddings, ESM-2, ProtVec) with JSONL-formatted metadata containing FASTA, UniProt IDs, and embeddings.
18+
ARN: arn:aws:s3:::deepdrug-dpeb-human-protein-embeddings
19+
Region: us-east-1
20+
Type: S3 Bucket
21+
DataAtWork:
22+
Tutorials:
23+
- Title: Aggregating and Clustering AlphaFold2 Embeddings from DPEB
24+
URL: https://github.com/deepdrugai/DPEB/tree/main
25+
AuthorName: Md. Saiful Islam Sajol
26+
AuthorURL: https://github.com/deepdrugai
27+
Tools & Applications:
28+
- Title: DPEB Explorer Tool
29+
URL: https://github.com/deepdrugai/DPEB
30+
AuthorName: DeepDrug Lab
31+
AuthorURL: https://github.com/deepdrugai
32+
Publications:
33+
- Title: A Multimodal Human Protein Embeddings Database - DeepDrug Protein Embeddings Bank (DPEB)
34+
URL: https://doi.org/10.XXXX/nar.dpeb2025
35+
AuthorName: Sajol MSI, Rajasekaran M, Bess A, Alvin C, Mukhopadhyay S
36+
AuthorURL: https://github.com/deepdrugai/DPEB
37+
ADXCategories:
38+
- Healthcare & Life Sciences Data

datasets/eai-essential-web-v1.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Name: 'Essential-Web v1.0: 24T tokens of organized web data'
2+
Description: A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
3+
Documentation: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
4+
5+
ManagedBy: '[EssentialAI](https://www.essential.ai)'
6+
UpdateFrequency: Not updated
7+
Tags:
8+
- aws-pds
9+
- machine learning
10+
- natural language processing
11+
- web archive
12+
- text analysis
13+
License: 'Essential-Web-v1.0 contributions are made available under the [ODC attribution license](https://opendatacommons.org/licenses/by/odc_by_1.0_public_text.txt); however, users should also abide by the [Common Crawl - Terms of Use](https://commoncrawl.org/terms-of-use). We do not alter the license of any of the underlying data.'
14+
Resources:
15+
- Description: 'Essential-Web v1.0: 24T tokens of organized web data'
16+
ARN: arn:aws:s3:::essential-web-v1.0
17+
Region: us-west-2
18+
Type: S3 Bucket
19+
- Description: Notifications for new Essential-Web v1.0 data
20+
ARN: arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
21+
Region: us-west-2
22+
Type: SNS Topic
23+
DataAtWork:
24+
Publications:
25+
- Title: 'Essential-Web v1.0: 24T tokens of organized web data'
26+
URL: https://arxiv.org/abs/2506.14111
27+
AuthorName: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar et al.
28+
AuthorURL: https://arxiv.org/abs/2506.14111

datasets/mosaic.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
Name: Meta-Organized Stimuli And fMRI Imaging data for Computational modeling (MOSAIC)
2+
Description: This extensible dataset, MOSAIC, aggregates individual functional magnetic resonance imaging (fMRI) datasets by leveraging a shared preprocessing pipeline and stimulus curation procedure. This dataset aggregation procedure achieves the scale necessary for neural network training and the diversity needed for generalizable results.
3+
Documentation: https://github.com/blahner/mosaic-preprocessing
4+
5+
ManagedBy: Massachusetts Institute of Technology, Georgia Tech
6+
UpdateFrequency: New data is uploaded as researchers preprocess their fMRI data according to MOSAIC format and submit.
7+
Tags:
8+
- aws-pds
9+
- brain images
10+
- brain models
11+
- hdf5
12+
- neuroimaging
13+
- neuroscience
14+
- machine learning
15+
License: CC BY 4.0
16+
Citation:
17+
Resources:
18+
- Description: HDF5 files containing preprocessed fMRI data
19+
ARN: arn:aws:s3:::mosaicfmri
20+
Region: us-west-2
21+
Type: S3 Bucket
22+
Explore:
23+
- '[Browse Bucket](https://mosaicfmri.s3.amazonaws.com/index.html)'
24+
DataAtWork:
25+
Tutorials:
26+
- Title: Load HDF5 file (Jupyter notebook)
27+
URL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
28+
NotebookURL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
29+
AuthorName: Benjamin Lahner
30+
ADXCategories:
31+
- Healthcare & Life Sciences Data

0 commit comments

Comments
 (0)