Skip to content

Commit ee1cb46

Browse files
authored
Merge branch 'main' into main
2 parents a57a37b + fb69162 commit ee1cb46

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+1423
-104
lines changed

datasets/ai3.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Name: AI3 Protein-Ligand Binding Affinity Dataset
2+
Description: >
3+
The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
4+
Documentation: https://github.com/devalab/AI3
5+
6+
ManagedBy: International Institute of Information Technology Hyderabad
7+
UpdateFrequency: Not updated
8+
Tags:
9+
- pharmaceutical
10+
- simulations
11+
- health
12+
- life sciences
13+
- machine learning
14+
- protein
15+
- molecular dynamics
16+
- aws-pds
17+
License: https://devalab.in/AI3.html
18+
Resources:
19+
- Description: ai3data bucket includes coordinates and the energetics of ~20,000 protein-ligand binding affinity datasets. The subfolders of ai3data bucket consist of Version 1, Version2 and Version 3. Version1 contains the total Size of 10.4 GiB (Initial structure of the protein-ligand complex and the average binding affinities along with average energy components). Version2 contains the total Size of 1.2 TiB (Five trajectories of protein-ligand complex (200 snapshots in all) and the closest two water molecules for each of the protein-ligand complex, and the time series of the binding affinities along with average energy components). Version3 contains the total Size of 10.7 TiB (Five trajectories of completely solvated protein-ligand complex (200 snapshots in all), and the time series of binding affinities along with average energy components).
20+
ARN: arn:aws:s3:::ai3data
21+
Region: us-east-1
22+
Type: S3 Bucket
23+
DataAtWork:
24+
Tutorials:
25+
- Title: "AI3: Protein-Ligand Binding Affinity Dataset"
26+
URL: https://github.com/devalab/AI3
27+
AuthorName: Deva Priyakumar Lab
28+
AuthorURL: https://github.com/devalab
29+
Publications:
30+
- Title: "PLAS-5k: Dataset of Protein-Ligand Affinities from Molecular Dynamics for Machine Learning Applications"
31+
URL: https://www.nature.com/articles/s41597-022-01631-9
32+
AuthorName: U. Deva Priyakumar
33+
AuthorURL: https://devalab.in/
34+
- Title: "PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications"
35+
URL: https://www.nature.com/articles/s41597-023-02872-y
36+
AuthorName: U. Deva Priyakumar
37+
AuthorURL: https://devalab.in

datasets/aodn_radar_newcastle_velocity_hourly_averaged_delayed_qc.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Collabs:
2020
Tags:
2121
- oceans
2222
Tags:
23+
- aws-pds
2324
- oceans
2425
- ocean currents
2526
- ocean velocity

datasets/apex.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Name: APEX-CONNECTS
2+
Description: >
3+
The BRAIN Initiative Connectivity Across Scales (CONNECTS) program is working to create detailed maps of brain
4+
wiring across different species and scales, using advanced imaging technologies.
5+
APEX supports this effort by serving as a central hub that brings together and coordinates data and tools
6+
from research focused on brain connectivity in humans and animals. Together, these efforts aim to improve our
7+
understanding of how the brain is structured and functions.
8+
Documentation: https://brainlife.io
9+
10+
ManagedBy: "[Brainlife Team](https://brainlife.io/team/)"
11+
UpdateFrequency: New datasets are added monthly
12+
Tags:
13+
- neuroscience
14+
- neuroimaging
15+
- microscopy
16+
- life sciences
17+
- zarr
18+
- metadata
19+
- machine learning
20+
- infrastructure
21+
- json
22+
- imaging
23+
- brain images
24+
- brain models
25+
- analysis ready data
26+
- nifti
27+
- aws-pds
28+
License: '[CC BY](https://creativecommons.org/licenses/by/4.0)'
29+
Citation:
30+
Resources:
31+
- Description: All APEX datasets are available for download
32+
ARN: arn:aws:s3:::apex-connects
33+
Region: us-east-2
34+
Type: S3 Bucket
35+
DataAtWork:
36+
Tutorials:
37+
- Title: Brainlife AWS Tutorials
38+
URL: https://brainlife.io/docs/tutorial/aws-brainlife
39+
AuthorName: Brainlife
40+
AuthorURL: https://brainlife.io
41+
Tools & Applications:
42+
- Title: Brainlife Web App
43+
URL: https://brainlife.io
44+
AuthorName: Brainlife
45+
AuthorURL: https://brainlife.io
46+
- Title: Brainlife CLI (Command Line Interface)
47+
URL: https://github.com/brainlife/cli
48+
AuthorName: Brainlife
49+
AuthorURL: https://github.com/brainlife/cli
50+
Publications:

datasets/askap.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
Name: ASKAP Radio Telescope
2+
Description: |
3+
4+
ASKAP is the CSIRO’s newest radio telescope. It is situated at the Inyarrimanha Ilgari Bundara, the CSIRO Murchison Radio-astronomy Observatory on Wajarri Yamaji Country in the Murchison region of Western Australia, about 800 km north of Perth.
5+
6+
ASKAP consists of 36 12m dishes, spread-out as far as 6km apart. It uses a new technology called Phased Array Feeds (PAFs), which allows it to see more of the sky at once. This novel technology allows ASKAP to achieve extremely high survey speed, making it one of the best instruments in the world for mapping the sky at radio wavelengths.
7+
8+
Initial dataset available - The Rapid ASKAP Continuum Survey (RACS)
9+
10+
RACS is the first large-area survey completed with ASKAP. This survey is revolutionary as the entire sky was observed in a matter of weeks, doing what previously took telescopes years to do. RACS initially covered the whole sky at 890 MHz (RACS-Low), and has since expanded to ASKAP’s other bands (1.4 and 1.7 GHz). RACS also covers the sky in multiple epochs, with a second epoch of RACS-Low and RACS-Mid obtained and processed.
11+
12+
RACS provides astronomers with a unique opportunity to study the radio sky and radio populations, in particular supermassive blackholes (active galactic nuclei) and their role in galaxy evolution. The multi-epoch approach also allows a study of the transient sky and testing and verification of calibration methods. The large area allows for cosmological studies, such as a search for anisotropy in the galaxy population, or cosmic dipole.
13+
14+
Documentation: https://www.atnf.csiro.au/facilities/askap-radio-telescope/
15+
16+
ManagedBy: "[Australia Telescope National Facility, CSIRO](http://www.atnf.csiro.au/)"
17+
Citation: Please see the [ATNF acknowledgement page](https://www.atnf.csiro.au/resources/publications/atnf-publication-acknowledgement-statements/) for full citation instructions.
18+
UpdateFrequency: Roughly quarterly
19+
Tags:
20+
- aws-pds
21+
- astronomy
22+
- archives
23+
License: CC-BY-4.0. Attribution required for refereed scientific papers.
24+
Resources:
25+
- Description: The Rapid ASKAP Continuum Survey (RACS) Public Data Releases
26+
ARN: arn:aws:s3:::askap-odp/racs-low1/
27+
Region: ap-southeast-2
28+
Type: S3 Bucket
29+
RequesterPays: False
30+
- Description: Notifications for new ASKAP data
31+
ARN: arn:aws:sns:ap-southeast-2:336305517014:askap-odp-object_created
32+
Region: ap-southeast-2
33+
Type: SNS Topic
34+
DataAtWork:
35+
Tutorials:
36+
- Title: CSIRO ASKAP Science Data Archive User Guide
37+
URL: https://research.csiro.au/casda/casda-user-guide/
38+
AuthorName: CSIRO, ATNF
39+
- Title: Rapid Askap Continuum Survey (RACS) Home Page
40+
URL: https://research.csiro.au/racs/
41+
AuthorName: CSIRO, ATNF
42+
Tools & Applications:
43+
Publications:
44+
- Title: ASKAP Publication List
45+
URL: https://www.atnf.csiro.au/facilities/askap-radio-telescope/publications/
46+
AuthorName: various, list maintained by CSIRO, ATNF
47+
- Title: ASKAP System Description paper
48+
URL: https://doi.org/10.1017/pasa.2021.1
49+
AuthorName: Hotan, A. et al.

datasets/aws-public-blockchain.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Description: >
1313
- XRP Ledger - SonarX - <code>s3://aws-public-blockchain/v1.1/sonarx/xrp/</code><br>
1414
- Stellar(<a href="https://developers.stellar.org/docs/learn/fundamentals/data-format/xdr" rel="noopener noreferrer">XDR files</a>) - Stellar - <code>s3://aws-public-blockchain/v1.1/stellar/</code><br>
1515
- The Open Network (TON) - TON - <code>s3://aws-public-blockchain/v1.1/ton/</code><br>
16+
- Cronos - Cronos - <code>s3://aws-public-blockchain/v1.1/cronos/</code><br>
1617
</br>
1718
1819
<h4>Become a Data Provider</h4>
@@ -24,6 +25,7 @@ Contact: [email protected]
2425
ManagedBy: "[Amazon Web Services](https://aws.amazon.com/)"
2526
UpdateFrequency: New data is delivered daily to the current date folders Parquet files.
2627
Tags:
28+
- aws-pds
2729
- blockchain
2830
- web3
2931
License: https://github.com/aws-samples/digital-assets-examples/blob/main/LICENSE
@@ -39,10 +41,10 @@ DataAtWork:
3941
Publications:
4042
- Title: "Exploring Arbitrum Data: Analyze L2 Activity with AWS Public Blockchain Datasets"
4143
URL: https://repost.aws/articles/ARpnBONglsT2e6D-hZZmxVvA/exploring-arbitrum-data-analyze-l2-activity-with-aws-public-blockchain-datasets
42-
AuthorName: Simon Goldberd, Everton Fraga
44+
AuthorName: Simon Goldberg, Everton Fraga
4345
- Title: "Unlocking XRP Ledger Data: Comprehensive Analysis with AWS Public Blockchain Datasets"
4446
URL: https://repost.aws/articles/ARg_zMIXlhTG2hSDFZDfF6hQ/unlocking-xrp-ledger-data-comprehensive-analysis-with-aws-public-blockchain-datasets
45-
AuthorName: Simon Goldberd, Everton Fraga
47+
AuthorName: Simon Goldberg, Everton Fraga
4648
- Title: New datasets added to the AWS Public Blockchain Datasets — available for analytics and research
4749
URL: https://repost.aws/articles/AR3gztQGeSS8CfaKNNeyYwsQ
4850
AuthorName: Everton Fraga, Simon Goldberg

datasets/biolip.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
Name: BioLiP
2+
Description: BioLiP is a semi-manually curated database for high-quality, biologically relevant ligand-protein binding interactions. The structure data are collected primarily from the Protein Data Bank (PDB), with biological insights mined from literature and other specific databases. BioLiP aims to construct the most comprehensive and accurate database for serving the needs of ligand-protein docking, virtual ligand screening and protein function annotation.
3+
Documentation: https://zhanggroup.org/BioLiP
4+
5+
ManagedBy: "[Zhang Lab](https://zhanggroup.org/)"
6+
UpdateFrequency: No regular schedule; updated upon availability of major dataset revisions
7+
Tags:
8+
- protein
9+
- structural biology
10+
- molecular docking
11+
- bioinformatics
12+
- molecule
13+
- life sciences
14+
- chemistry
15+
License: No explicit license stated (publicly available for academic and research use).
16+
Citation: "Chengxin Zhang, Xi Zhang, Peter L Freddolino, and Yang Zhang. BioLiP2: an updated structure database for biologically relevent ligand-protein interactions, Nucleic Acids Research, gkad630 (2023)."
17+
Resources:
18+
- Description: BioLiP dataset
19+
ARN: arn:aws:s3:::biolip
20+
Region: ap-southeast-1
21+
Type: S3 Bucket
22+
- Description: BioLiP interaction structures
23+
ARN: arn:aws:s3:::biolip/weekly
24+
Region: ap-southeast-1
25+
Type: S3 Bucket
26+
DataAtWork:
27+
Tutorials:
28+
- Title: BioLiP API usage
29+
URL: https://zhanggroup.org/BioLiP/help.html
30+
AuthorName: Zhang Lab
31+
Publications:
32+
- Title: "BioLiP2: an updated structure database for biologically relevant ligand-protein interactions"
33+
URL: https://academic.oup.com/nar/article/52/D1/D404/7233921
34+
AuthorName: Chengxin Zhang, Xi Zhang, Peter L Freddolino, and Yang Zhang
35+
- Title: "BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions"
36+
URL: https://academic.oup.com/nar/article/41/D1/D1096/1074898
37+
AuthorName: Jianyi Yang, Ambrish Roy, and Yang Zhang
38+

datasets/busco-data.yaml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
Name: BUSCO Datasets
2+
Description: Lineage datasets for use with BUSCO software package. Each dataset contains HMM profiles for clade specific, universal, single-copy marker genes. Datasets are available across archaea, bacteria, eukaryota and virus domains. The repository also includes necessary data files for phylogenetic placement of an input assembly.
3+
Documentation: https://busco.ezlab.org/busco_userguide.html#lineage-datasets
4+
Contact: https://gitlab.com/ezlab/busco/-/issues
5+
ManagedBy: Computational Evolutionary Genomics Group, University of Geneva
6+
UpdateFrequency: New datasets are released to correspond with updates in OrthoDB versions. Maintenance updates occur a few times a year if necessary to fix any bugs or update metadata.
7+
Tags:
8+
- assembly
9+
- bacteria
10+
- bioinformatics
11+
- genomic
12+
- life sciences
13+
- metagenomics
14+
- open source software
15+
- protein
16+
- virus
17+
- aws-pds
18+
License: The BUSCO datasets are licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Any use of these datasets for analyses in a publication or product must include the citation of the corresponding paper - https://doi.org/10.1093/molbev/msab199
19+
Citation:
20+
Resources:
21+
- Description: BUSCO datasets and companion files for use with BUSCO pipeline
22+
ARN: arn:aws:s3:::busco-data
23+
Region: us-east-1
24+
Type: S3 Bucket
25+
- Description: Notifications for new BUSCO data
26+
ARN: arn:aws:sns:us-east-1:622022425660:my-dataset-object_created
27+
Region: us-east-1
28+
Type: SNS Topic
29+
DataAtWork:
30+
Tutorials:
31+
- Title: BUSCO - from QC to gene prediction and phylogenomics
32+
URL: https://www.youtube.com/watch?v=9SjVY3BT8JU
33+
AuthorName: Matthew Berkeley
34+
AuthorURL: https://github.com/berkelem
35+
Services:
36+
Publications:
37+
- Title: OrthoDB and BUSCO update - annotation of orthologs with wider sampling of genomes.
38+
URL: https://academic.oup.com/nar/article/53/D1/D516/7899526?login=true
39+
AuthorName: Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva
40+
- Title: BUSCO Update - Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.
41+
URL: https://academic.oup.com/mbe/article/38/10/4647/6329644?login=true
42+
AuthorName: Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, Evgeny M Zdobnov
43+
- Title: BUSCO - assessing genomic data quality and beyond.
44+
URL: https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpz1.323
45+
AuthorName: Mosè Manni, Matthew R. Berkeley, Mathieu Seppey, Evgeny M. Zdobnov
46+
DeprecatedNotice:
47+
ADXCategories:
48+
- Healthcare & Life Sciences Data

datasets/caladapt-wildfire-dataset.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Collabs:
1212
Tags:
1313
- climate
1414
Tags:
15+
- aws-pds
1516
- climate
1617
- climate model
1718
- climate projections
@@ -61,4 +62,4 @@ DataAtWork:
6162
AuthorName: "Cal-Adapt: Analytics Engine Team"
6263
AuthorURL: https://github.com/cal-adapt
6364
ADXCategories:
64-
- Environmental Data
65+
- Environmental Data

0 commit comments

Comments
 (0)