Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,17 +57,17 @@ Eventually, the scraper will produce two files: `zenodo_datasets.parquet` and `z

Note that "[false positives](docs/false_positives.md)" have been removed in the scraping proccess.

## Scrape FigShare
## Scrape Figshare

Have a look to the notes regarding [Figshare](docs/figshare.md) and how its API works.

Scrape FigShare for MD-related datasets and files:
Scrape Figshare for MD-related datasets and files:

```bash
uv run scrape-figshare --query-file params/query.yml --output-dir data
```

Scrape FigShare with a small query, for development or demo purpose:
Scrape Figshare with a small query, for development or demo purpose:

```bash
uv run scrape-figshare --query-file params/query_dev.yml --output-dir tmp
Expand Down
28 changes: 15 additions & 13 deletions docs/zenodo.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ According to Zenodo [FAQ](https://help.zenodo.org/):

So we don't expect much files to have an individual size above 50 GB.


## API

[REST API](https://developers.zenodo.org/)
Expand All @@ -21,17 +20,17 @@ List of [HTTP status codes](https://developers.zenodo.org/#http-status-codes)

Example of direct API link for a given dataset: <https://zenodo.org/api/records/8183728>


## Query

[Search guide](https://help.zenodo.org/guides/search/)

## Dataset examples
## Examples of datasets

### MD-related file types

Query:
```

```none
resource_type.type:"dataset" AND filetype:"tpr"
```

Expand All @@ -40,11 +39,13 @@ Datasets:
- [Simulations of a beta-2 adrenergic receptor monomer on a flat membrane](https://zenodo.org/record/4114422)
- [GROMACS simulations of unfolding of ubiqutin in a strong electric field](https://zenodo.org/record/4056037)

### Zip files

### zip files
Many MD simulation files are archived in zip files.

Query:
```

```none
resource_type.type:"dataset" AND filetype:"zip" AND (simulation* "molecular dynamics" gromacs charmm namd)
```

Expand All @@ -55,7 +56,6 @@ Datasets:
- [Exploring the interaction of a curcumin azobioisostere with Abeta42 dimers using replica exchange molecular dynamics simulations](https://zenodo.org/record/5573728)
- [Molecular dynamics simulation data of regulatory ACT domain dimer of human phenylalanine hydroxylase (PAH) (with unbound ligand)](https://zenodo.org/record/3814193) (with multiple zip files)


Some dataset cannot be found with keywords. For instance:

- [Ternary lipid composition in a curved geometry, more flat surface](https://zenodo.org/record/4644379)
Expand All @@ -64,28 +64,30 @@ Some dataset cannot be found with keywords. For instance:

Zip file content can be obtained through a preview page provided by Zenodo.

Example dataset: [All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E](https://zenodo.org/record/4743386)
From dataset [All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E](https://zenodo.org/record/4743386)

- preview for [NoPTM-2_Mix_CHARMM36m_0.1x3mks.zip](https://zenodo.org/record/4743386/preview/NoPTM-2_Mix_CHARMM36m_0.1x3mks.zip)
- preview for [NoPTM-4_POPC_CHARMM36m_0.1x3mks.zip](https://zenodo.org/record/4743386/preview/NoPTM-4_POPC_CHARMM36m_0.1x3mks.zip)

Note that the preview is available for the first 1000 files only.

### Zip files with tree-like structure
File name and file size are the only metadata available from the preview.

#### Zip files with tree-like structure

Some zip file content are really dense, with many folders and sub-folders.
Some zip file content are dense, with many folders and sub-folders.

Examples:

- In the dataset "[Input files and scripts for Hamiltonian replica-exchange molecular dynamics simulations of intrinsically disordered proteins using a software GROMACS patched with PLUMED](https://zenodo.org/record/4319228)", a preview of the file `hremd-idp.zip` is available [here](https://zenodo.org/record/4319228/preview/hremd-idp.zip).
- In the dataset "[2DUV Machine Learning Protocol Code](https://zenodo.org/record/4444751/)", a preview of the file `code.zip` is available [here](https://zenodo.org/record/4444751/preview/code.zip).

These complexe zip files are handled by the current implementation of the Zenodo crawler.
These complexe zip files are handled by the current implementation of the Zenodo scraper.

#### Issues with zip file content

Sometimes, zip file contents are not accessible.

Example dataset: [G-Protein Coupled Receptor-Ligand Dissociation Rates and Mechanisms from tauRAMD Simulations](https://zenodo.org/record/5151217)
For the dataset: [G-Protein Coupled Receptor-Ligand Dissociation Rates and Mechanisms from tauRAMD Simulations](https://zenodo.org/record/5151217)

- preview not available for [Example_b2AR-alprenolol.zip](https://zenodo.org/record/5151217/preview/Example_b2AR-alprenolol.zip) probably because file is too large (5.4 GB)
- The preview for [Example_b2AR-alprenolol.zip](https://zenodo.org/record/5151217/preview/Example_b2AR-alprenolol.zip) is not available, probably because the file is too large (5.4 GB).