Skip to content

Commit 8d1d412

Browse files
authored
Merge branch 'main' into main
2 parents 9901ffa + 582d6b6 commit 8d1d412

File tree

9 files changed

+238
-38
lines changed

9 files changed

+238
-38
lines changed

datasets/ai3.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Name: AI3 Protein-Ligand Binding Affinity Dataset
2+
Description: >
3+
The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
4+
Documentation: https://github.com/devalab/AI3
5+
6+
ManagedBy: International Institute of Information Technology Hyderabad
7+
UpdateFrequency: Not updated
8+
Tags:
9+
- pharmaceutical
10+
- simulations
11+
- health
12+
- life sciences
13+
- machine learning
14+
- protein
15+
- molecular dynamics
16+
- aws-pds
17+
License: https://devalab.in/AI3.html
18+
Resources:
19+
- Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
20+
ARN: arn:aws:s3:::ai3data
21+
Region: us-east-1
22+
Type: S3 Bucket
23+
DataAtWork:
24+
Tutorials:
25+
- Title: "AI3: Protein-Ligand Binding Affinity Dataset"
26+
URL: https://github.com/devalab/AI3
27+
AuthorName: Deva Priyakumar Lab
28+
AuthorURL: https://github.com/devalab
29+
Publications:
30+
- Title: "PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications"
31+
URL: https://www.nature.com/articles/s41597-022-01631-9
32+
AuthorName: U. Deva Priyakumar
33+
AuthorURL: https://devalab.in/
34+
- Title: "PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications"
35+
URL: https://www.nature.com/articles/s41597-023-02872-y
36+
AuthorName: U. Deva Priyakumar
37+
AuthorURL: https://devalab.in

datasets/caladapt-wildfire-dataset.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Collabs:
1212
Tags:
1313
- climate
1414
Tags:
15+
- aws-pds
1516
- climate
1617
- climate model
1718
- climate projections
@@ -61,4 +62,4 @@ DataAtWork:
6162
AuthorName: "Cal-Adapt: Analytics Engine Team"
6263
AuthorURL: https://github.com/cal-adapt
6364
ADXCategories:
64-
- Environmental Data
65+
- Environmental Data
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
Name: Community coral reef image classification training data
2+
Description: "Community-sourced repository of coral reef image classification training data, including continually updated confirmed annotations from [MERMAID](https://datamermaid.org/)"
3+
Documentation: https://github.com/data-mermaid/image-classification-open-data
4+
5+
ManagedBy: "[MERMAID](https://datamermaid.org/)"
6+
UpdateFrequency: Each partner organization updates on their own cadence. MERMAID updates once per day.
7+
Tags:
8+
- aws-pds
9+
- coastal
10+
- conservation
11+
- coral reef
12+
- csv
13+
- global
14+
- machine learning
15+
- marine
16+
- parquet
17+
- survey
18+
License: "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
19+
Resources:
20+
- Description: "The coral-reef-training AWS S3 bucket provides a single, open, well-structured, growing, community-sourced repository of coral reef image classification training data. Hosted at s3://coral-reef-training, this bucket supports global efforts in coral reef conservation through standardized, machine-learning-ready imagery and annotations.
21+
22+
The bucket serves as the image storage backend for MERMAID’s image classification workflows and to distribute confirmed and scrubbed MERMAID coral reef image data, but it also provides a shared location where partners including CoralNet can contribute to and benefit from collective ML model development, each according to its own data structures and policies. Data in the bucket is free and open for public access; only contributing organizations have write access to their own data prefixes.
23+
24+
By centralizing and standardizing coral reef image data, this initiative accelerates collaboration across scientific, conservation, and machine learning communities and facilitates the creation of a common, evolving image classification model for coral reefs worldwide."
25+
ARN: arn:aws:s3:::coral-reef-training
26+
Region: us-east-1
27+
Type: S3 Bucket
28+
Explore:
29+
- "[Browse Bucket](https://coral-reef-training.s3.amazonaws.com/index.html)"
30+
DataAtWork:
31+
Tutorials:
32+
- Title: MERMAID Image Classification Open Data Tutorial - Python version
33+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_Python.html
34+
AuthorName: Domazetoski V, Caldwell I
35+
AuthorURL: https://github.com/ViktorDomazetoski, https://github.com/ircaldwell
36+
- Title: MERMAID Image Classification Open Data Tutorial - R version
37+
URL: https://data-mermaid.github.io/image-classification-open-data/image-classification-open-data-tutorial_R.html
38+
AuthorName: Caldwell I
39+
AuthorURL: https://github.com/ircaldwell
40+
Tools & Applications:
41+
- Title: MERMAID Collect
42+
URL: https://app.datamermaid.org/
43+
AuthorName: MERMAID
44+
AuthorURL: https://datamermaid.org/
45+
- Title: MERMAID Explore
46+
URL: https://explore.datamermaid.org/
47+
AuthorName: MERMAID
48+
AuthorURL: https://datamermaid.org/

datasets/eai-essential-web-v1.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Name: 'Essential-Web v1.0: 24T tokens of organized web data'
2+
Description: A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
3+
Documentation: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
4+
5+
ManagedBy: '[EssentialAI](https://www.essential.ai)'
6+
UpdateFrequency: Not updated
7+
Tags:
8+
- aws-pds
9+
- machine learning
10+
- natural language processing
11+
- web archive
12+
- text analysis
13+
License: 'Essential-Web-v1.0 contributions are made available under the [ODC attribution license](https://opendatacommons.org/licenses/by/odc_by_1.0_public_text.txt); however, users should also abide by the [Common Crawl - Terms of Use](https://commoncrawl.org/terms-of-use). We do not alter the license of any of the underlying data.'
14+
Resources:
15+
- Description: 'Essential-Web v1.0: 24T tokens of organized web data'
16+
ARN: arn:aws:s3:::essential-web-v1.0
17+
Region: us-west-2
18+
Type: S3 Bucket
19+
- Description: Notifications for new Essential-Web v1.0 data
20+
ARN: arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
21+
Region: us-west-2
22+
Type: SNS Topic
23+
DataAtWork:
24+
Publications:
25+
- Title: 'Essential-Web v1.0: 24T tokens of organized web data'
26+
URL: https://arxiv.org/abs/2506.14111
27+
AuthorName: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar et al.
28+
AuthorURL: https://arxiv.org/abs/2506.14111

datasets/mosaic.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
Name: Meta-Organized Stimuli And fMRI Imaging data for Computational modeling (MOSAIC)
2+
Description: This extensible dataset, MOSAIC, aggregates individual functional magnetic resonance imaging (fMRI) datasets by leveraging a shared preprocessing pipeline and stimulus curation procedure. This dataset aggregation procedure achieves the scale necessary for neural network training and the diversity needed for generalizable results.
3+
Documentation: https://github.com/blahner/mosaic-preprocessing
4+
5+
ManagedBy: Massachusetts Institute of Technology, Georgia Tech
6+
UpdateFrequency: New data is uploaded as researchers preprocess their fMRI data according to MOSAIC format and submit.
7+
Tags:
8+
- aws-pds
9+
- brain images
10+
- brain models
11+
- hdf5
12+
- neuroimaging
13+
- neuroscience
14+
- machine learning
15+
License: CC BY 4.0
16+
Citation:
17+
Resources:
18+
- Description: HDF5 files containing preprocessed fMRI data
19+
ARN: arn:aws:s3:::mosaicfmri
20+
Region: us-west-2
21+
Type: S3 Bucket
22+
Explore:
23+
- '[Browse Bucket](https://mosaicfmri.s3.amazonaws.com/index.html)'
24+
DataAtWork:
25+
Tutorials:
26+
- Title: Load HDF5 file (Jupyter notebook)
27+
URL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
28+
NotebookURL: https://github.com/blahner/mosaic-preprocessing/blob/main/src/fmriDatasetPreparation/create_hdf5/load_hdf5.ipynb
29+
AuthorName: Benjamin Lahner
30+
ADXCategories:
31+
- Healthcare & Life Sciences Data

datasets/noaa-nos-cora.yaml

Lines changed: 77 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,61 @@
1-
Name: NOAA's Coastal Ocean Reanalysis (CORA) Dataset
1+
Name: "NOAA's Coastal Ocean Reanalysis (CORA) Dataset: 1979-2022"
2+
23
Description: |
3-
NOAA's Coastal Ocean Reanalysis (CORA) for the Gulf of Mexico and East Coast (GEC) is produced using verified hourly water levels from the Center of Operational Oceanographic Products & Services (CO-OPS), through hydrodynamic modeling from Advanced Circulation "[ADCIRC](https://adcirc.org/)" and Simulating WAves Nearshore "[SWAN](https://swanmodel.sourceforge.io/)" models. Data are assimilated, processed, corrected, and processed again before quality assurance and skill assessment with additional verified tide station-based observations.
4-
<br/>
5-
<br/>
6-
Details for CORA Dataset
7-
<br/>
8-
<br/>
9-
**Timeseries** - 1979 to 2022
10-
<br/>
11-
**Size** - Approx. 20.5TB
12-
<br/>
13-
**Domain** - Lat 5.8 to 45.8 ; Long -98.0 to -53.8
14-
<br/>
15-
**Nodes** - 1813443 centroids, 3564104 elements
16-
<br/>
17-
**Grid cells** - Currently apporximately 505
18-
<br/>
19-
**Spatial Resolution** - 500m, 1983 Contiguous USA Albers projection (EPSG:5070)
20-
<br/>
21-
Documentation: https://tidesandcurrents.noaa.gov/
4+
NOAA's [Coastal Ocean Reanalysis (CORA)](https://tidesandcurrents.noaa.gov/cora.html) for the Gulf, East Coast/Atlantic, and Caribbean (GEC) is produced using verified hourly water levels from the National Ocean Service’s [Center of Operational Oceanographic Products & Services](https://tidesandcurrents.noaa.gov/) (CO-OPS). [ADvanced CIRCulation Model (ADCIRC)](https://www.erdc.usace.army.mil/Media/Fact-Sheets/Fact-Sheet-Article-View/Article/476698/advanced-circulation-model/) and [Simulating WAves Nearshore (SWAN)](https://www.tudelft.nl/en/ceg/about-faculty/departments/hydraulic-engineering/sections/environmental-fluid-mechanics/research/swan) models are coupled to model coastal water levels and nearshore waves. Hourly water level observations are used for data assimilation and validation to improve the accuracy of modeled water levels and wave datasets.
5+
<br><br>
6+
<b>Additional Details:</b><br>
7+
Metadata associated with model domain and time span:
8+
- Timeseries - 1979 to 2022
9+
- Size - Approx. 44.6 TB
10+
- Domain - Lat 5.8 to 45.8 ; Long -98.0 to -53.8
11+
- Nodes - [CORA Metadata Library](https://www.fisheries.noaa.gov/inport/item/75048)
12+
- Grid cells - [CORA Metadata Library](https://www.fisheries.noaa.gov/inport/item/75048)
13+
- Spatial Resolution:
14+
- Centroids: 300-400 meters
15+
- Gridded: 500 meters
16+
- Projection: 1983 Contiguous USA Albers projection (EPSG:5070)
17+
<br><br>
18+
19+
<b>Datasets:</b><br>
20+
Water level and wave datasets resulting from the computation, assimilation, validation, and optimization reanalysis datasets. All products are available in NetCDF (.nc) format:
21+
- fort.63.nc - Water level elevation
22+
- fort.73.nc - Atmospheric pressure at sea level
23+
- fort.74.nc - Wind Velocity - 10 m elevation
24+
- maxele.63.nc - Maximum water elevation
25+
- swan_DIR.63.nc - Spectral mean wave direction
26+
- swan_TMM10.63.nc - Spectral mean wave period
27+
- swan_TPS.63.nc - Spectral peak wave period
28+
- swan_HS.63.nc - Spectral zeroth moment wave height
29+
- swan_HS_max.63.nc - Maximum spectral zeroth moment wave height
30+
<br><br>
31+
32+
<b>Derived Products:</b><br>
33+
Datasets resulting from the computation, modeling, or other processing using existing/collected data. All products are available in NetCDF (.nc) format:
34+
- CORA-V1.1-fort.63: Hourly water levels
35+
- CORA-V1.1-swan_DIR.63: Hourly mean wave direction
36+
- CORA-V1.1-swan_TPS.63: Hourly peak wave periods
37+
- CORA-V1.1-swan_HS.63: Hourly significant wave heights
38+
- CORA-V1.1-Grid: Hourly water levels interpolated from model nodes to uniform 500-meter resolution grid
39+
<br><br>
40+
41+
Documentation: |
42+
[NOAA Technical Report NOS CO-OPS 108: NOAA’s Coastal Ocean Reanalysis: Gulf of Mexico, Atlantic, and Caribbean (January 2025)](https://doi.org/10.25923/5ypp-4e84)
43+
44+
UpdateFrequency: Product dependent. At minimum, annually.
45+
46+
License: |
47+
NOAA data disseminated through NODD are open to the public and can be used as desired.
48+
49+
NOAA makes data openly available to ensure maximum use of our data, and to spur and encourage exploration and innovation throughout the industry. NOAA requests attribution for the use or dissemination of unaltered NOAA data. However, it is not permissible to state or imply endorsement by or affiliation with NOAA. If you modify NOAA data, you may not state or imply that it is original, unaltered NOAA data.
50+
51+
ManagedBy: |
52+
[NOAA’s National Ocean Service, The Center for Operational Oceanographic Products and Services (CO-OPS)](https://tidesandcurrents.noaa.gov/about_us.html)
53+
2254
Contact: |
2355
For questions regarding data content or quality, email [email protected]
24-
<br/>
2556
This data is made available to the public through the NOAA Open Data Dissemination (NODD) Program. For questions regarding this program, email [email protected].
26-
<br/>
27-
We also seek to identify case studies on how NOAA data is being used and will be featuring those stories in joint publications and in upcoming events. If you are interested in seeing your story highlighted, please share it with the NOAA NODD team at [email protected].
28-
ManagedBy: "[NOAA’s National Ocean Service, The Center for Operational Oceanographic Products and Services (CO-OPS)](https://tidesandcurrents.noaa.gov/about_us.html)"
29-
UpdateFrequency: Monthly, quarterly, and annually, depending on the dataset.
57+
We also seek to identify case studies on how NOAA data is being used and will be featuring those stories in joint publications and in upcoming events. If you are interested in seeing your story highlighted, please share it with the NOAA NODD team at [email protected].
58+
3059
Collabs:
3160
ASDI:
3261
Tags:
@@ -41,21 +70,37 @@ Tags:
4170
- agriculture
4271
- transportation
4372
- oceans
44-
License: NOAA data disseminated through NODD are open to the public and can be used as desired.<br/> <br/>NOAA makes data openly available to ensure maximum use of our data, and to spur and encourage exploration and innovation throughout the industry. NOAA requests attribution for the use or dissemination of unaltered NOAA data. However, it is not permissible to state or imply endorsement by or affiliation with NOAA. If you modify NOAA data, you may not state or imply that it is original, unaltered NOAA data.
73+
4574
Resources:
46-
- Description: NOAA’s Coastal Ocean Reanalysis (CORA) Dataset NetCDF
75+
- Description: "NOAA’s Coastal Ocean Reanalysis (CORA) Dataset NetCDF"
4776
ARN: arn:aws:s3:::noaa-nos-cora-pds
4877
Region: us-east-1
4978
Type: S3 Bucket
5079
Explore:
5180
- '[Browse Bucket](https://noaa-nos-cora-pds.s3.amazonaws.com/index.html)'
52-
- Description: NOAA’s Coastal Ocean Reanalysis (CORA) Dataset Notifications
81+
- Description: "NOAA’s Coastal Ocean Reanalysis (CORA) Dataset Notifications"
5382
ARN: arn:aws:sns:us-east-1:709902155096:NewNOSCORAObject
5483
Region: us-east-1
5584
Type: SNS Topic
85+
5686
DataAtWork:
5787
Tutorials:
58-
- Title: Notebooks for working with CORA Data
59-
URL: https://github.com/NOAA-CO-OPS/CORA-Coastal-Ocean-ReAnalysis-CORA
60-
AuthorName: John Ratcliff
61-
AuthorURL: https://www.linkedin.com/in/johndratcliff/
88+
- Title: "Using Python to Access Coastal Ocean Reanalysis (CORA) Data"
89+
URL: https://github.com/NOAA-CO-OPS/CORA-Coastal-Ocean-Reanalysis-CORA
90+
AuthorName: "NOAA's Center for Operational Oceanographic Products and Services"
91+
AuthorURL: https://tidesandcurrents.noaa.gov/
92+
93+
Tools & Applications:
94+
- Title: Coastal Ocean Reanalysis Use cases
95+
URL: https://tidesandcurrents.noaa.gov/cora.html#usecase
96+
AuthorName: "NOAA's Center for Operational Oceanographic Products and Services"
97+
AuthorURL: https://tidesandcurrents.noaa.gov/
98+
99+
Publications:
100+
- Title: "NOAA Technical Report NOS CO-OPS 108: NOAA’s Coastal Ocean Reanalysis: Gulf of Mexico, Atlantic, and Caribbean (January 2025)"
101+
URL: https://doi.org/10.25923/5ypp-4e84
102+
AuthorName: Keeney, Analise; Dusek, Gregory; Callahan, John; Ratcliff, John; Jima, Tigist; Brooks, William; Marcy, Doug; Blanton, Brian; Tilson, Jeffrey; Asher, Taylor G.; Leuttich, Richard A.; Widlansky, Matthew J.; Rose, Linta; Morse, Cheryl; Haddad, Jana; & Waring, Blake
103+
104+
- Title: "Assessment of water levels from 43 years of NOAA’s Coastal Ocean Reanalysis (CORA) for the Gulf of Mexico and East Coasts"
105+
URL: https://doi.org/10.3389/fmars.2024.1381228
106+
AuthorName: Rose, Linta; Widlansky, Matthew J.; Feng, Xue; Thompson, Thompson; Asher, Taylor G.; Dusek, Gregory; Blanton, Blanton; Luettich, Richard A. Jr.; Callahan, John; Brooks, William; Keeney, Analise; Haddad, Jana; Sweet, William; Genz, Ayesha; Hovenga, Paige; Marra, John & Tilson, Jeffrey

0 commit comments

Comments
 (0)