Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Commit c5b11a6

Browse files
authored
Merge pull request #61 from MDverse/feature/add-mdposit-scraper
Feature/add mdposit scraper
2 parents 7a1426f + 6658973 commit c5b11a6

File tree

10 files changed

+1152
-25
lines changed

10 files changed

+1152
-25
lines changed

AGENTS.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@ When writing code:
2424

2525
When writing functions, always:
2626

27-
- Add descriptive docstrings.
27+
- Add descriptive docstrings
2828
- Use early returns for error conditions
29+
- Limit size of try / except blocks to the strict minimum
2930

3031
Never import libraries by yourself. Always ask before adding dependencies.
3132

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,24 @@ This command will:
170170
4. Validate entries using Pydantic models
171171
5. Save the extracted metadata to Parquet files
172172

173+
## Scrape MDDB
174+
175+
See [MDDB](docs/mddb.md) to understand how with use scrape metadata from MDDB.
176+
177+
Scrape MDDB to collect molecular dynamics (MD) datasets and files:
178+
179+
```bash
180+
uv run scrape-mddb --output-dir data
181+
```
182+
183+
This command will:
184+
185+
1. List all datasets and files through the main MDposit nodes.
186+
2. Parse metadata and validate them using the Pydantic models
187+
`DatasetMetadata` and `FileMetadata`.
188+
3. Save validated files and datasets metadata.
189+
190+
The scraping process takes about 2 hours, depending on your network connection and hardware.
173191

174192
## Analyze Gromacs mdp and gro files
175193

docs/mddb.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# MDDB
2+
3+
> The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research.
4+
5+
The MDposit infrastructure is distributed across several MDposit nodes. All metadata are accessible through the global node:
6+
7+
MDposit MMB node:
8+
9+
- web site: <https://mdposit.mddbr.eu/>
10+
- documentation: <https://mdposit.mddbr.eu/#/help>
11+
- API: <https://mdposit.mddbr.eu/api/rest/docs/>
12+
- API base URL: <https://mdposit.mddbr.eu/api/rest/v1>
13+
14+
No account / token is needed to access the MDposit API.
15+
16+
## Getting metadata
17+
18+
### Datasets
19+
20+
In MDposit, a dataset (a simulation and its related files) is called a "[project](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)".
21+
22+
API entrypoint to get the total number of projects:
23+
24+
- Endpoint: `/projects/summary`
25+
- HTTP method: GET
26+
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)
27+
28+
A project can contain multiple replicas, each identified by `project_id`.`replica_id`.
29+
30+
For example, the project [MD-A003ZP](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview) contains ten replicas:
31+
32+
- `MD-A003ZP.1`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/overview
33+
- `MD-A003ZP.2`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.2/overview
34+
- `MD-A003ZP.3`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.3/overview
35+
- ...
36+
37+
API entrypoint to get all datasets at once:
38+
39+
- Endpoint: `/projects`
40+
- HTTP method: GET
41+
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects)
42+
43+
### Files
44+
45+
API endpoint to get files for a given replica of a project:
46+
47+
- Endpoint: `/projects/{project_id.replica_id}/filenotes`
48+
- HTTP method: GET
49+
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes)
50+
51+
## Examples
52+
53+
### Project `MD-A003ZP`
54+
55+
Title:
56+
57+
> MDBind 3x1k
58+
59+
Description:
60+
61+
> 10 ns simulation of 1ma4m pdb structure from MDBind dataset, a dynamic view of the PDBBind database
62+
63+
- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview)
64+
- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP)
65+
66+
Files for replica 1:
67+
68+
- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/files)
69+
- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP.1/filenotes)
70+
71+
### Project `MD-A001T1`
72+
73+
Title:
74+
75+
> All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E in the monomeric form, C4 popc
76+
77+
Description:
78+
79+
> The trajectories of all-atom MD simulations were obtained based on 4 starting representative conformations from the CG simulation. For each starting structure, there are six trajectories of the E protein: 3 with the protein embedded in the membrane containing POPC, and 3 with the membrane mimicking the natural ERGIC membrane (Mix: 50% POPC, 25% POPE, 10% POPI, 5% POPS, 10% cholesterol).
80+
81+
- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1/overview)
82+
- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1)
83+
84+
Files for replica 1:
85+
86+
- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1.1/files)
87+
- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1.1/filenotes)

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,4 @@ scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
7373
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
7474
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
7575
scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
76+
scrape-mddb = "mdverse_scrapers.scrapers.mddb:main"

src/mdverse_scrapers/models/dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:
170170
171171
Parameters
172172
----------
173-
cls : type[BaseDataset]
173+
cls : type[DatasetMetadata]
174174
The Pydantic model class being validated.
175175
value : datetime | str | None
176176
The input value of the 'date' field to validate.

src/mdverse_scrapers/models/enums.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,26 @@ class DatasetSourceName(StrEnum):
2020
ATLAS = "atlas"
2121
GPCRMD = "gpcrmd"
2222
NMRLIPIDS = "nmrlipids"
23+
MDDB = "mddb"
24+
MDPOSIT_INRIA_NODE = "mdposit_inria_node"
25+
MDPOSIT_MMB_NODE = "mdposit_mmb_node"
26+
MDPOSIT_CINECA_NODE = "mdposit_cineca_node"
2327

2428

2529
class ExternalDatabaseName(StrEnum):
2630
"""External database names."""
2731

2832
PDB = "pdb"
2933
UNIPROT = "uniprot"
34+
35+
36+
class MoleculeType(StrEnum):
37+
"""Common molecular types found in molecular dynamics simulations."""
38+
39+
PROTEIN = "protein"
40+
NUCLEIC_ACID = "nucleic_acid"
41+
ION = "ion"
42+
LIPID = "lipid"
43+
CARBOHYDRATE = "carbohydrate"
44+
SOLVENT = "solvent"
45+
SMALL_MOLECULE = "small_molecule"

src/mdverse_scrapers/models/simulation.py

Lines changed: 51 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,16 @@
33
import re
44
from typing import Annotated
55

6-
from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator
6+
from pydantic import (
7+
BaseModel,
8+
ConfigDict,
9+
Field,
10+
StringConstraints,
11+
field_validator,
12+
model_validator,
13+
)
714

8-
from .enums import ExternalDatabaseName
15+
from .enums import ExternalDatabaseName, MoleculeType
916

1017
DOI = Annotated[
1118
str,
@@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel):
3744
None, min_length=1, description="Direct URL to the identifier into the database"
3845
)
3946

47+
@model_validator(mode="after")
48+
def compute_url(self) -> "ExternalIdentifier":
49+
"""Compute the URL for the external identifier.
50+
51+
Parameters
52+
----------
53+
self: ExternalIdentifier
54+
The model instance being validated, with all fields already validated.
55+
56+
Returns
57+
-------
58+
ExternalIdentifier
59+
The model instance with the URL field computed if it was not provided.
60+
"""
61+
if self.url is not None:
62+
return self
63+
64+
if self.database_name == ExternalDatabaseName.PDB:
65+
self.url = f"https://www.rcsb.org/structure/{self.identifier}"
66+
elif self.database_name == ExternalDatabaseName.UNIPROT:
67+
self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}"
68+
69+
return self
70+
4071

4172
class Molecule(BaseModel):
4273
"""Molecule in a simulation."""
@@ -45,18 +76,25 @@ class Molecule(BaseModel):
4576
model_config = ConfigDict(extra="forbid")
4677

4778
name: str = Field(..., description="Name of the molecule.")
79+
type: MoleculeType | None = Field(
80+
None,
81+
description="Type of the molecule."
82+
"Allowed values in the MoleculeType enum. "
83+
"Examples: PROTEIN, ION, LIPID...",
84+
)
85+
number_of_molecules: int | None = Field(
86+
None,
87+
ge=0,
88+
description="Number of molecules of this type in the simulation.",
89+
)
4890
number_of_atoms: int | None = Field(
4991
None, ge=0, description="Number of atoms in the molecule."
5092
)
5193
formula: str | None = Field(None, description="Chemical formula of the molecule.")
5294
sequence: str | None = Field(
5395
None, description="Sequence of the molecule for protein and nucleic acid."
5496
)
55-
number_of_molecules: int | None = Field(
56-
None,
57-
ge=0,
58-
description="Number of molecules of this type in the simulation.",
59-
)
97+
inchikey: str | None = Field(None, description="InChIKey of the molecule.")
6098
external_identifiers: list[ExternalIdentifier] | None = Field(
6199
None,
62100
description=("List of external database identifiers for this molecule."),
@@ -66,8 +104,9 @@ class Molecule(BaseModel):
66104
class ForceFieldModel(BaseModel):
67105
"""Forcefield or Model used in a simulation."""
68106

69-
# Ensure scraped metadata matches the expected schema exactly.
70-
model_config = ConfigDict(extra="forbid")
107+
# Ensure scraped metadata matches the expected schema exactly
108+
# and version is coerced to string when needed.
109+
model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)
71110

72111
name: str = Field(
73112
...,
@@ -81,8 +120,9 @@ class ForceFieldModel(BaseModel):
81120
class Software(BaseModel):
82121
"""Simulation software or tool used in a simulation."""
83122

84-
# Ensure scraped metadata matches the expected schema exactly.
85-
model_config = ConfigDict(extra="forbid")
123+
# Ensure scraped metadata matches the expected schema exactly
124+
# and version is coerced to string when needed.
125+
model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)
86126

87127
name: str = Field(
88128
...,

0 commit comments

Comments
 (0)