-
Notifications
You must be signed in to change notification settings - Fork 4
Feature/add mdposit scraper #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
0a37b89
caf2865
9147f32
f809832
e1a4e9d
fb283e1
064d94b
e150d24
e3c5e38
cfe2622
5b01789
5533d8b
9ebc838
96793e5
f031e28
6cb949d
3871d22
c9be76f
542f54a
21943fc
d826989
671008c
f987ea7
059d51f
ebf4470
63181fa
7a5f580
d0324ee
8b57c76
024efa9
88b9955
dd724a7
9e0374f
6b959da
e3a353c
7068584
9cd0a88
40ea3ca
a8ed77b
7884275
71f7c43
3d003b3
cf32a04
91595f1
6658973
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -171,6 +171,26 @@ This command will: | |||||
| 5. Save the extracted metadata to Parquet files | ||||||
|
|
||||||
|
|
||||||
| ## Scrape MDDB | ||||||
|
|
||||||
| Have a look at the notes regarding [MDDB](docs/mddb.md) and its API. | ||||||
|
|
||||||
| Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files: | ||||||
|
|
||||||
| ```bash | ||||||
| uv run scrape-mddb --output-dir data | ||||||
| ``` | ||||||
|
|
||||||
| This command will: | ||||||
|
|
||||||
| 1. Search for molecular dynamics datasets and files through the MDposit API nodes. | ||||||
| 2. Parse metadata and validate them using the Pydantic models | ||||||
| `DatasetMetadata` and `FileMetadata`. | ||||||
| 3. Save validated files and datasets metadata. | ||||||
|
|
||||||
| The scraping takes about 13 minutes. | ||||||
|
||||||
| The scraping takes about 13 minutes. | |
| The scraping may take several minutes, depending on your network connection and hardware. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,77 @@ | ||||||||||||||||||||||||||||||
| # MDDB | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| > The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| The MDDB infrastructure is distributed across **two MDposit nodes**. Both nodes expose the same REST API entry points. The only difference is the base URL used to access the API. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## MDposit MMB node | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - web site: <https://mmb-dev.mddbr.eu/#/browse> | ||||||||||||||||||||||||||||||
| - documentation: <https://mmb.mddbr.eu/#/help> | ||||||||||||||||||||||||||||||
| - API: <https://mmb.mddbr.eu/api/rest/docs/> | ||||||||||||||||||||||||||||||
| - API base URL: <https://mmb.mddbr.eu/api/rest/v1> | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## MDposit INRIA node | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - web site: <https://dynarepo.inria.fr/#/browse> | ||||||||||||||||||||||||||||||
| - documentation: <https://dynarepo.inria.fr/#/help> | ||||||||||||||||||||||||||||||
| - API: <https://dynarepo.inria.fr/api/rest/docs/> | ||||||||||||||||||||||||||||||
| - API base URL: <https://inria.mddbr.eu/api/rest/v1> | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| No account / token is needed to access the MDposit API. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Finding molecular dynamics datasets and files | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Datasets | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| In MDposit, a dataset (a simulation and its related files) is called an "[project](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)" and a project can contains multiple replicas, each identified by `project_id`.`replica_id`. | ||||||||||||||||||||||||||||||
Essmaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| For exemple, the project [A026F](https://mmb.mddbr.eu/#/id/A026F/overview) contains four replicas: | ||||||||||||||||||||||||||||||
Essmaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||
| - `A026F.1`: https://mmb.mddbr.eu/#/id/A026F.1/overview | ||||||||||||||||||||||||||||||
| - `A026F.2`: https://mmb.mddbr.eu/#/id/A026F.2/overview | ||||||||||||||||||||||||||||||
| - `A026F.3`: https://mmb.mddbr.eu/#/id/A026F.3/overview | ||||||||||||||||||||||||||||||
| - `A026F.4`: https://mmb.mddbr.eu/#/id/A026F.4/overview | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| API entrypoint to search for all datasets at once: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - Endpoint: `/projects` | ||||||||||||||||||||||||||||||
| - HTTP method: GET | ||||||||||||||||||||||||||||||
| - [documentation](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Files | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| API endpoint to get files for a given replica of a project: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - Endpoint: `/projects/{project_id.replica_id}/filenotes` | ||||||||||||||||||||||||||||||
| - HTTP method: GET | ||||||||||||||||||||||||||||||
| - [documentation](https://mmb.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Examples | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Project `A026F` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - Project id: `A026F.1` | ||||||||||||||||||||||||||||||
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/overview) | ||||||||||||||||||||||||||||||
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Description: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| > Multi-scale simulation approaches which couple the molecular and neuronal simulations to predict the variation in the membrane potential and the neural spikes. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/files) | ||||||||||||||||||||||||||||||
| - [files on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1/filenotes) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Project `A025U` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - Project id: `A025U.1` | ||||||||||||||||||||||||||||||
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview) | ||||||||||||||||||||||||||||||
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Remark: no description is provided for this dataset. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| - Project id: `A025U.1` | |
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview) | |
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | |
| Remark: no description is provided for this dataset. | |
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files) | |
| - Project id: `A025U.2` | |
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview) | |
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | |
| Remark: no description is provided for this dataset. | |
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,6 +41,7 @@ extend-select = [ | |
| ignore = [ | ||
| "COM812", # Redundant with ruff formatter. See: https://docs.astral.sh/ruff/rules/missing-trailing-comma/ | ||
| "G004", # f-strings are allowed with the loguru module. See https://docs.astral.sh/ruff/rules/logging-f-string/ | ||
| "PERF401", # list.extend suggestion is not applicable when appending model instances. | ||
| ] | ||
|
Comment on lines
41
to
44
|
||
|
|
||
| # Force numpy-style for docstrings | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,9 +3,16 @@ | |||||||
| import re | ||||||||
| from typing import Annotated | ||||||||
|
|
||||||||
| from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator | ||||||||
| from pydantic import ( | ||||||||
| BaseModel, | ||||||||
| ConfigDict, | ||||||||
| Field, | ||||||||
| StringConstraints, | ||||||||
| field_validator, | ||||||||
| model_validator, | ||||||||
| ) | ||||||||
|
|
||||||||
| from .enums import ExternalDatabaseName | ||||||||
| from .enums import ExternalDatabaseName, MoleculeType | ||||||||
|
|
||||||||
| DOI = Annotated[ | ||||||||
| str, | ||||||||
|
|
@@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel): | |||||||
| None, min_length=1, description="Direct URL to the identifier into the database" | ||||||||
| ) | ||||||||
|
|
||||||||
| @model_validator(mode="after") | ||||||||
| def compute_url(self) -> "ExternalIdentifier": | ||||||||
| """Compute the URL for the external identifier. | ||||||||
|
|
||||||||
| Parameters | ||||||||
| ---------- | ||||||||
| self: ExternalIdentifier | ||||||||
| The model instance being validated, with all fields already validated. | ||||||||
|
|
||||||||
| Returns | ||||||||
| ------- | ||||||||
| ExternalIdentifier | ||||||||
| The model instance with the URL field computed if it was not provided. | ||||||||
| """ | ||||||||
| if self.url is not None: | ||||||||
| return self | ||||||||
|
|
||||||||
| if self.database_name == ExternalDatabaseName.PDB: | ||||||||
| self.url = f"https://www.rcsb.org/structure/{self.identifier}" | ||||||||
| elif self.database_name == ExternalDatabaseName.UNIPROT: | ||||||||
| self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}" | ||||||||
|
|
||||||||
| return self | ||||||||
|
|
||||||||
|
|
||||||||
| class Molecule(BaseModel): | ||||||||
| """Molecule in a simulation.""" | ||||||||
|
|
@@ -45,18 +76,24 @@ class Molecule(BaseModel): | |||||||
| model_config = ConfigDict(extra="forbid") | ||||||||
|
|
||||||||
| name: str = Field(..., description="Name of the molecule.") | ||||||||
| type: MoleculeType | None = Field( | ||||||||
| None, | ||||||||
| description="Type of the molecule." | ||||||||
Essmaw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
| "Allowed values in the MoleculeType enum. " | ||||||||
|
||||||||
| "Allowed values in the MoleculeType enum. " | |
| " Allowed values in the MoleculeType enum. " |
Essmaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep software as it is. Sofware with an S at the end does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this will break all previous scrapers. Avoid modifying data models in a PR without discussing it first.
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SimulationMetadata field 'software' has been renamed to 'softwares' in the model, but the existing GPCRMD scraper (at src/mdverse_scrapers/scrapers/gpcrmd.py:400) still uses the old field name 'software'. This will cause validation errors when the scraper runs. The field name needs to be updated to 'softwares' to match the model change.
| None, | |
| None, | |
| alias="software", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text says the scraper targets only the MMB and INRIA nodes, but the implementation includes an additional CINECA node in
MDDB_NODES. Update the README wording to match the actual supported nodes (or clarify which nodes are scraped).