MDverse · pierrepo · Feb 11, 2026 · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026
diff --git a/README.md b/README.md
@@ -171,6 +171,26 @@ This command will:
 5. Save the extracted metadata to Parquet files
 
 
+## Scrape MDDB
+
+Have a look at the notes regarding [MDDB](docs/mddb.md) and its API.
+
+Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files:
-Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files:
+Scrape MDDB (MDposit MMB, INRIA, and CINECA nodes) to collect molecular dynamics (MD) datasets and files:
-Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files:
+Scrape MDDB (MDposit MMB, INRIA, and CINECA nodes) to collect molecular dynamics (MD) datasets and files:
+
+```bash
+uv run scrape-mddb --output-dir data
+```
+
+This command will:
+
+1. Search for molecular dynamics datasets and files through the MDposit API nodes.
+2. Parse metadata and validate them using the Pydantic models
+   `DatasetMetadata` and `FileMetadata`.
+3. Save validated files and datasets metadata.
+
+The scraping takes about 13 minutes.
-The scraping takes about 13 minutes.
+The scraping may take several minutes, depending on your network connection and hardware.
-The scraping takes about 13 minutes.
+The scraping may take several minutes, depending on your network connection and hardware.
+
+
 ## Analyze Gromacs mdp and gro files
 
 ### Download files

diff --git a/docs/mddb.md b/docs/mddb.md
@@ -0,0 +1,77 @@
+# MDDB
+
+> The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research.
+
+The MDDB infrastructure is distributed across **two MDposit nodes**. Both nodes expose the same REST API entry points. The only difference is the base URL used to access the API.
+
+## MDposit MMB node
+
+- web site: <https://mmb-dev.mddbr.eu/#/browse>
+- documentation: <https://mmb.mddbr.eu/#/help>
+- API: <https://mmb.mddbr.eu/api/rest/docs/>
+- API base URL: <https://mmb.mddbr.eu/api/rest/v1>
+
+## MDposit INRIA node
+
+- web site: <https://dynarepo.inria.fr/#/browse>
+- documentation: <https://dynarepo.inria.fr/#/help>
+- API: <https://dynarepo.inria.fr/api/rest/docs/>
+- API base URL: <https://inria.mddbr.eu/api/rest/v1>
+
+
+No account / token is needed to access the MDposit API.
+
+## Finding molecular dynamics datasets and files
+
+### Datasets
+
+In MDposit, a dataset (a simulation and its related files) is called an "[project](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)" and a project can contains multiple replicas, each identified by `project_id`.`replica_id`.
+
+
+For exemple, the project [A026F](https://mmb.mddbr.eu/#/id/A026F/overview) contains four replicas:
+  - `A026F.1`: https://mmb.mddbr.eu/#/id/A026F.1/overview
+  - `A026F.2`: https://mmb.mddbr.eu/#/id/A026F.2/overview
+  - `A026F.3`: https://mmb.mddbr.eu/#/id/A026F.3/overview
+  - `A026F.4`: https://mmb.mddbr.eu/#/id/A026F.4/overview
+
+
+API entrypoint to search for all datasets at once:
+
+- Endpoint: `/projects`
+- HTTP method: GET
+- [documentation](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects)
+
+
+### Files
+
+API endpoint to get files for a given replica of a project:
+
+- Endpoint: `/projects/{project_id.replica_id}/filenotes`
+- HTTP method: GET
+- [documentation](https://mmb.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes)
+
+## Examples
+
+### Project `A026F`
+
+- Project id: `A026F.1`
+- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/overview)
+- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1)
+
+Description:
+
+> Multi-scale simulation approaches which couple the molecular and neuronal simulations to predict the variation in the membrane potential and the neural spikes.
+
+- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/files)
+- [files on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1/filenotes)
+
+### Project `A025U`
+
+- Project id: `A025U.1`
+- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
+- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
+
+Remark: no description is provided for this dataset.
+
+- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
-
-Remark: no description is provided for this dataset.
-
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
+- Project id: `A025U.2`
+- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview)
+- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
+
+Remark: no description is provided for this dataset.
+
+- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files)
- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
-
-Remark: no description is provided for this dataset.
-
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
+- Project id: `A025U.2`
+- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview)
+- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
+
+Remark: no description is provided for this dataset.
+
+- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files)
+- [files on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2/filenotes)
diff --git a/pyproject.toml b/pyproject.toml
@@ -73,3 +73,4 @@ scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
 scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
 scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
 scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
+scrape-mddb = "mdverse_scrapers.scrapers.mddb:main"
diff --git a/ruff.toml b/ruff.toml
@@ -41,6 +41,7 @@ extend-select = [
 ignore = [
     "COM812", # Redundant with ruff formatter. See: https://docs.astral.sh/ruff/rules/missing-trailing-comma/
     "G004", # f-strings are allowed with the loguru module. See https://docs.astral.sh/ruff/rules/logging-f-string/
+    "PERF401", # list.extend suggestion is not applicable when appending model instances.
 ]
 
 # Force numpy-style for docstrings

diff --git a/src/mdverse_scrapers/models/dataset.py b/src/mdverse_scrapers/models/dataset.py
@@ -170,7 +170,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:
 
         Parameters
         ----------
-        cls : type[BaseDataset]
+        cls : type[DatasetMetadata]
             The Pydantic model class being validated.
         value : datetime | str | None
             The input value of the 'date' field to validate.

diff --git a/src/mdverse_scrapers/models/enums.py b/src/mdverse_scrapers/models/enums.py
@@ -20,10 +20,24 @@ class DatasetSourceName(StrEnum):
     ATLAS = "atlas"
     GPCRMD = "gpcrmd"
     NMRLIPIDS = "nmrlipids"
+    MDDB = "mddb"
+    MDPOSIT_INRIA_NODE = "mdposit_inria_node"
+    MDPOSIT_MMB_NODE = "mdposit_mmb_node"
 
 
 class ExternalDatabaseName(StrEnum):
     """External database names."""
 
     PDB = "pdb"
     UNIPROT = "uniprot"
+
+
+class MoleculeType(StrEnum):
+    """Common molecular types found in molecular dynamics simulations."""
+
+    PROTEIN = "protein"
+    NUCLEIC_ACID = "nucleic_acid"
+    ION = "ion"
+    LIPID = "lipid"
+    CARBOHYDRATE = "carbohydrate"
+    SOLVENT = "solvent"
diff --git a/src/mdverse_scrapers/models/simulation.py b/src/mdverse_scrapers/models/simulation.py
@@ -3,9 +3,16 @@
 import re
 from typing import Annotated
 
-from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator
+from pydantic import (
+    BaseModel,
+    ConfigDict,
+    Field,
+    StringConstraints,
+    field_validator,
+    model_validator,
+)
 
-from .enums import ExternalDatabaseName
+from .enums import ExternalDatabaseName, MoleculeType
 
 DOI = Annotated[
     str,
@@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel):
         None, min_length=1, description="Direct URL to the identifier into the database"
     )
 
+    @model_validator(mode="after")
+    def compute_url(self) -> "ExternalIdentifier":
+        """Compute the URL for the external identifier.
+
+        Parameters
+        ----------
+        self: ExternalIdentifier
+            The model instance being validated, with all fields already validated.
+
+        Returns
+        -------
+        ExternalIdentifier
+            The model instance with the URL field computed if it was not provided.
+        """
+        if self.url is not None:
+            return self
+
+        if self.database_name == ExternalDatabaseName.PDB:
+            self.url = f"https://www.rcsb.org/structure/{self.identifier}"
+        elif self.database_name == ExternalDatabaseName.UNIPROT:
+            self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}"
+
+        return self
+
 
 class Molecule(BaseModel):
     """Molecule in a simulation."""
@@ -45,18 +76,24 @@ class Molecule(BaseModel):
     model_config = ConfigDict(extra="forbid")
 
     name: str = Field(..., description="Name of the molecule.")
+    type: MoleculeType | None = Field(
+        None,
+        description="Type of the molecule."
+        "Allowed values in the MoleculeType enum. "
-        "Allowed values in the MoleculeType enum. "
+        " Allowed values in the MoleculeType enum. "
-        "Allowed values in the MoleculeType enum. "
+        " Allowed values in the MoleculeType enum. "
+        "Examples: PROTEIN, ION, LIPID...",
+    )
+    number_of_this_molecule_type_in_system: int | None = Field(
+        None,
+        ge=0,
+        description="Number of molecules of this type in the simulation.",
+    )
     number_of_atoms: int | None = Field(
         None, ge=0, description="Number of atoms in the molecule."
     )
     formula: str | None = Field(None, description="Chemical formula of the molecule.")
     sequence: str | None = Field(
         None, description="Sequence of the molecule for protein and nucleic acid."
     )
-    number_of_molecules: int | None = Field(
-        None,
-        ge=0,
-        description="Number of molecules of this type in the simulation.",
-    )
     external_identifiers: list[ExternalIdentifier] | None = Field(
         None,
         description=("List of external database identifiers for this molecule."),
@@ -103,7 +140,7 @@ class SimulationMetadata(BaseModel):
     # Ensure scraped metadata matches the expected schema exactly.
     model_config = ConfigDict(extra="forbid")
 
-    software: list[Software] | None = Field(
+    softwares: list[Software] | None = Field(
         None,
-        None,
+        None,
+        alias="software",
-        None,
+        None,
+        alias="software",
         description="List of molecular dynamics tool or software.",
     )