You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
ManagedBy: International Institute of Information Technology Hyderabad
7
+
UpdateFrequency: Not updated
8
+
Tags:
9
+
- pharmaceutical
10
+
- simulations
11
+
- health
12
+
- life sciences
13
+
- machine learning
14
+
- protein
15
+
- molecular dynamics
16
+
- aws-pds
17
+
License: https://devalab.in/AI3.html
18
+
Resources:
19
+
- Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
Name: Community coral reef image classification training data
2
+
Description: "Community-sourced repository of coral reef image classification training data, including continually updated confirmed annotations from [MERMAID](https://datamermaid.org/)"
UpdateFrequency: Each partner organization updates on their own cadence. MERMAID updates once per day.
7
+
Tags:
8
+
- aws-pds
9
+
- coastal
10
+
- conservation
11
+
- coral reef
12
+
- csv
13
+
- global
14
+
- machine learning
15
+
- marine
16
+
- parquet
17
+
- survey
18
+
License: "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
19
+
Resources:
20
+
- Description: "The coral-reef-training AWS S3 bucket provides a single, open, well-structured, growing, community-sourced repository of coral reef image classification training data. Hosted at s3://coral-reef-training, this bucket supports global efforts in coral reef conservation through standardized, machine-learning-ready imagery and annotations.
21
+
22
+
The bucket serves as the image storage backend for MERMAID’s image classification workflows and to distribute confirmed and scrubbed MERMAID coral reef image data, but it also provides a shared location where partners including CoralNet can contribute to and benefit from collective ML model development, each according to its own data structures and policies. Data in the bucket is free and open for public access; only contributing organizations have write access to their own data prefixes.
23
+
24
+
By centralizing and standardizing coral reef image data, this initiative accelerates collaboration across scientific, conservation, and machine learning communities and facilitates the creation of a common, evolving image classification model for coral reefs worldwide."
Name: 'Essential-Web v1.0: 24T tokens of organized web data'
2
+
Description: A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
License: 'Essential-Web-v1.0 contributions are made available under the [ODC attribution license](https://opendatacommons.org/licenses/by/odc_by_1.0_public_text.txt); however, users should also abide by the [Common Crawl - Terms of Use](https://commoncrawl.org/terms-of-use). We do not alter the license of any of the underlying data.'
14
+
Resources:
15
+
- Description: 'Essential-Web v1.0: 24T tokens of organized web data'
16
+
ARN: arn:aws:s3:::essential-web-v1.0
17
+
Region: us-west-2
18
+
Type: S3 Bucket
19
+
- Description: Notifications for new Essential-Web v1.0 data
Name: Meta-Organized Stimuli And fMRI Imaging data for Computational modeling (MOSAIC)
2
+
Description: This extensible dataset, MOSAIC, aggregates individual functional magnetic resonance imaging (fMRI) datasets by leveraging a shared preprocessing pipeline and stimulus curation procedure. This dataset aggregation procedure achieves the scale necessary for neural network training and the diversity needed for generalizable results.
NOAA's Coastal Ocean Reanalysis (CORA) for the Gulf of Mexico and East Coast (GEC) is produced using verified hourly water levels from the Center of Operational Oceanographic Products & Services (CO-OPS), through hydrodynamic modeling from Advanced Circulation "[ADCIRC](https://adcirc.org/)" and Simulating WAves Nearshore "[SWAN](https://swanmodel.sourceforge.io/)" models. Data are assimilated, processed, corrected, and processed again before quality assurance and skill assessment with additional verified tide station-based observations.
4
-
<br/>
5
-
<br/>
6
-
Details for CORA Dataset
7
-
<br/>
8
-
<br/>
9
-
**Timeseries** - 1979 to 2022
10
-
<br/>
11
-
**Size** - Approx. 20.5TB
12
-
<br/>
13
-
**Domain** - Lat 5.8 to 45.8 ; Long -98.0 to -53.8
14
-
<br/>
15
-
**Nodes** - 1813443 centroids, 3564104 elements
16
-
<br/>
17
-
**Grid cells** - Currently apporximately 505
18
-
<br/>
19
-
**Spatial Resolution** - 500m, 1983 Contiguous USA Albers projection (EPSG:5070)
20
-
<br/>
21
-
Documentation: https://tidesandcurrents.noaa.gov/
4
+
NOAA's [Coastal Ocean Reanalysis (CORA)](https://tidesandcurrents.noaa.gov/cora.html) for the Gulf, East Coast/Atlantic, and Caribbean (GEC) is produced using verified hourly water levels from the National Ocean Service’s [Center of Operational Oceanographic Products & Services](https://tidesandcurrents.noaa.gov/) (CO-OPS). [ADvanced CIRCulation Model (ADCIRC)](https://www.erdc.usace.army.mil/Media/Fact-Sheets/Fact-Sheet-Article-View/Article/476698/advanced-circulation-model/) and [Simulating WAves Nearshore (SWAN)](https://www.tudelft.nl/en/ceg/about-faculty/departments/hydraulic-engineering/sections/environmental-fluid-mechanics/research/swan) models are coupled to model coastal water levels and nearshore waves. Hourly water level observations are used for data assimilation and validation to improve the accuracy of modeled water levels and wave datasets.
5
+
<br><br>
6
+
<b>Additional Details:</b><br>
7
+
Metadata associated with model domain and time span:
- Projection: 1983 Contiguous USA Albers projection (EPSG:5070)
17
+
<br><br>
18
+
19
+
<b>Datasets:</b><br>
20
+
Water level and wave datasets resulting from the computation, assimilation, validation, and optimization reanalysis datasets. All products are available in NetCDF (.nc) format:
21
+
- fort.63.nc - Water level elevation
22
+
- fort.73.nc - Atmospheric pressure at sea level
23
+
- fort.74.nc - Wind Velocity - 10 m elevation
24
+
- maxele.63.nc - Maximum water elevation
25
+
- swan_DIR.63.nc - Spectral mean wave direction
26
+
- swan_TMM10.63.nc - Spectral mean wave period
27
+
- swan_TPS.63.nc - Spectral peak wave period
28
+
- swan_HS.63.nc - Spectral zeroth moment wave height
29
+
- swan_HS_max.63.nc - Maximum spectral zeroth moment wave height
30
+
<br><br>
31
+
32
+
<b>Derived Products:</b><br>
33
+
Datasets resulting from the computation, modeling, or other processing using existing/collected data. All products are available in NetCDF (.nc) format:
34
+
- CORA-V1.1-fort.63: Hourly water levels
35
+
- CORA-V1.1-swan_DIR.63: Hourly mean wave direction
- CORA-V1.1-Grid: Hourly water levels interpolated from model nodes to uniform 500-meter resolution grid
39
+
<br><br>
40
+
41
+
Documentation: |
42
+
[NOAA Technical Report NOS CO-OPS 108: NOAA’s Coastal Ocean Reanalysis: Gulf of Mexico, Atlantic, and Caribbean (January 2025)](https://doi.org/10.25923/5ypp-4e84)
43
+
44
+
UpdateFrequency: Product dependent. At minimum, annually.
45
+
46
+
License: |
47
+
NOAA data disseminated through NODD are open to the public and can be used as desired.
48
+
49
+
NOAA makes data openly available to ensure maximum use of our data, and to spur and encourage exploration and innovation throughout the industry. NOAA requests attribution for the use or dissemination of unaltered NOAA data. However, it is not permissible to state or imply endorsement by or affiliation with NOAA. If you modify NOAA data, you may not state or imply that it is original, unaltered NOAA data.
50
+
51
+
ManagedBy: |
52
+
[NOAA’s National Ocean Service, The Center for Operational Oceanographic Products and Services (CO-OPS)](https://tidesandcurrents.noaa.gov/about_us.html)
53
+
22
54
Contact: |
23
55
For questions regarding data content or quality, email [email protected]
24
-
<br/>
25
56
This data is made available to the public through the NOAA Open Data Dissemination (NODD) Program. For questions regarding this program, email [email protected].
26
-
<br/>
27
-
We also seek to identify case studies on how NOAA data is being used and will be featuring those stories in joint publications and in upcoming events. If you are interested in seeing your story highlighted, please share it with the NOAA NODD team at [email protected].
28
-
ManagedBy: "[NOAA’s National Ocean Service, The Center for Operational Oceanographic Products and Services (CO-OPS)](https://tidesandcurrents.noaa.gov/about_us.html)"
29
-
UpdateFrequency: Monthly, quarterly, and annually, depending on the dataset.
57
+
We also seek to identify case studies on how NOAA data is being used and will be featuring those stories in joint publications and in upcoming events. If you are interested in seeing your story highlighted, please share it with the NOAA NODD team at [email protected].
58
+
30
59
Collabs:
31
60
ASDI:
32
61
Tags:
@@ -41,21 +70,37 @@ Tags:
41
70
- agriculture
42
71
- transportation
43
72
- oceans
44
-
License: NOAA data disseminated through NODD are open to the public and can be used as desired.<br/> <br/>NOAA makes data openly available to ensure maximum use of our data, and to spur and encourage exploration and innovation throughout the industry. NOAA requests attribution for the use or dissemination of unaltered NOAA data. However, it is not permissible to state or imply endorsement by or affiliation with NOAA. If you modify NOAA data, you may not state or imply that it is original, unaltered NOAA data.
0 commit comments