Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0a37b89
feat: add MDposit dataset scraping script.
Jan 19, 2026
caf2865
feat(models): add MDPOSIT repository and MDDB project fields.
Jan 19, 2026
9147f32
feat(cli): add README command and scrape-mdposit entry point.
Jan 19, 2026
f809832
merge: sync main into update-mdposit-scraper
Jan 29, 2026
e1a4e9d
refactor(simulation-model): add molecule type field (protein, lipid, …
Jan 29, 2026
fb283e1
chore(ruff): disable PERF401 for model instance appends
Jan 29, 2026
064d94b
refactor(mdposit-scraper): update to scrape using both nodes of MDDB …
Jan 29, 2026
e150d24
docs: adding the mddb documentation + update the readme and command …
Feb 4, 2026
e3c5e38
feat: refactor the code and resolve AttributeError
Feb 4, 2026
cfe2622
merge: sync main into update-mdposit-scraper
Essmaw Feb 5, 2026
5b01789
feat: add URL computation for ExternalIdentifier based on database name
Essmaw Feb 5, 2026
5533d8b
Fix merging of new datasource names into DatasetSourceName instead of…
Essmaw Feb 5, 2026
9ebc838
feat: enhance molecule extraction to fit the new model and adding Un…
Essmaw Feb 5, 2026
96793e5
test(simulation): test URL computation for ExternalIdentifier
Essmaw Feb 5, 2026
f031e28
tests: refactor tests for ExternalIdentifier to account for automatic…
Essmaw Feb 5, 2026
6cb949d
refactor: rename number_of_molecules to number_of_this_molecule_type_…
Essmaw Feb 5, 2026
3871d22
refactor: rename number_of_this_molecule_type_in_system to number_of_…
Essmaw Feb 6, 2026
c9be76f
tests: refactor with `number_of_molecules` attribute and adding speci…
Essmaw Feb 6, 2026
542f54a
fixes(mddb scraper): correct spelling errors, improve parameter descr…
Essmaw Feb 6, 2026
21943fc
docs: correct spelling errors
Essmaw Feb 6, 2026
d826989
fix: Revert to 'software' field
pierrepo Feb 6, 2026
671008c
refactor: Reduce usage and scope of try/except blocks
pierrepo Feb 6, 2026
f987ea7
feat: Add default DatasetSourceName
pierrepo Feb 7, 2026
059d51f
feat: Coexerce verstion to str
pierrepo Feb 7, 2026
ebf4470
docs: Update MDDB documentation and examples
pierrepo Feb 7, 2026
63181fa
refactor: Remove more try/except
pierrepo Feb 7, 2026
7a5f580
refactor: Split log message
pierrepo Feb 7, 2026
d0324ee
fix: Fix error when forcefield metadata is undifiend
pierrepo Feb 7, 2026
8b57c76
fix: Handle case with no protein sequence nor Uniprot identifier
pierrepo Feb 7, 2026
024efa9
fix: Handle case when no software is available
pierrepo Feb 7, 2026
88b9955
feat: Add InChIKey field for Molecule model
pierrepo Feb 7, 2026
dd724a7
fix: Fix dataset_url_in_repository field
pierrepo Feb 7, 2026
9e0374f
docs: Print dataset URL in API
pierrepo Feb 7, 2026
6b959da
feat: Align uniprot identifiers with protein sequences
pierrepo Feb 7, 2026
e3a353c
feat: Add replicas logic in file metadata extraction
pierrepo Feb 7, 2026
7068584
feat: Add rules to avoid lengthy try / except blocks
pierrepo Feb 7, 2026
9cd0a88
fix: Add special case for 'inr' (INRIA) node name
pierrepo Feb 7, 2026
40ea3ca
feat: Add Cineca MDDB node
pierrepo Feb 8, 2026
a8ed77b
feat: Add another way to get protein name from Uniprot
pierrepo Feb 8, 2026
7884275
fix: Update logic to fetch protein name from Uniprot
pierrepo Feb 8, 2026
71f7c43
docs: Fix typos
pierrepo Feb 11, 2026
3d003b3
docs: Relax scraping time
pierrepo Feb 11, 2026
cf32a04
chore: Reallow PERF401 rules
pierrepo Feb 11, 2026
91595f1
docs: Remove MDDB node names
pierrepo Feb 11, 2026
6658973
refactor: Clean code
pierrepo Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,26 @@ This command will:
5. Save the extracted metadata to Parquet files


## Scrape MDDB

Have a look at the notes regarding [MDDB](docs/mddb.md) and its API.

Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the scraper targets only the MMB and INRIA nodes, but the implementation includes an additional CINECA node in MDDB_NODES. Update the README wording to match the actual supported nodes (or clarify which nodes are scraped).

Suggested change
Scrape MDDB (MDposit MMB node and MDposit Inria node) to collect molecular dynamics (MD) datasets and files:
Scrape MDDB (MDposit MMB, INRIA, and CINECA nodes) to collect molecular dynamics (MD) datasets and files:

Copilot uses AI. Check for mistakes.

```bash
uv run scrape-mddb --output-dir data
```

This command will:

1. Search for molecular dynamics datasets and files through the MDposit API nodes.
2. Parse metadata and validate them using the Pydantic models
`DatasetMetadata` and `FileMetadata`.
3. Save validated files and datasets metadata.

The scraping takes about 13 minutes.
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement “The scraping takes about 13 minutes.” is environment-dependent and likely to become outdated. Consider rephrasing to something less specific (e.g., “may take ~X minutes depending on network/CPU”) or dropping the timing entirely.

Suggested change
The scraping takes about 13 minutes.
The scraping may take several minutes, depending on your network connection and hardware.

Copilot uses AI. Check for mistakes.


## Analyze Gromacs mdp and gro files

### Download files
Expand Down
77 changes: 77 additions & 0 deletions docs/mddb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# MDDB

> The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research.

The MDDB infrastructure is distributed across **two MDposit nodes**. Both nodes expose the same REST API entry points. The only difference is the base URL used to access the API.

## MDposit MMB node

- web site: <https://mmb-dev.mddbr.eu/#/browse>
- documentation: <https://mmb.mddbr.eu/#/help>
- API: <https://mmb.mddbr.eu/api/rest/docs/>
- API base URL: <https://mmb.mddbr.eu/api/rest/v1>

## MDposit INRIA node

- web site: <https://dynarepo.inria.fr/#/browse>
- documentation: <https://dynarepo.inria.fr/#/help>
- API: <https://dynarepo.inria.fr/api/rest/docs/>
- API base URL: <https://inria.mddbr.eu/api/rest/v1>


No account / token is needed to access the MDposit API.

## Finding molecular dynamics datasets and files

### Datasets

In MDposit, a dataset (a simulation and its related files) is called an "[project](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)" and a project can contains multiple replicas, each identified by `project_id`.`replica_id`.


For exemple, the project [A026F](https://mmb.mddbr.eu/#/id/A026F/overview) contains four replicas:
- `A026F.1`: https://mmb.mddbr.eu/#/id/A026F.1/overview
- `A026F.2`: https://mmb.mddbr.eu/#/id/A026F.2/overview
- `A026F.3`: https://mmb.mddbr.eu/#/id/A026F.3/overview
- `A026F.4`: https://mmb.mddbr.eu/#/id/A026F.4/overview


API entrypoint to search for all datasets at once:

- Endpoint: `/projects`
- HTTP method: GET
- [documentation](https://mmb.mddbr.eu/api/rest/docs/#/projects/get_projects)


### Files

API endpoint to get files for a given replica of a project:

- Endpoint: `/projects/{project_id.replica_id}/filenotes`
- HTTP method: GET
- [documentation](https://mmb.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes)

## Examples

### Project `A026F`

- Project id: `A026F.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1)

Description:

> Multi-scale simulation approaches which couple the molecular and neuronal simulations to predict the variation in the membrane potential and the neural spikes.

- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A026F.1/files)
- [files on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A026F.1/filenotes)

### Project `A025U`

- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)

Remark: no description is provided for this dataset.

- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the A025U example, the text says the project id is A025U.1, but both the API link and the files link use A025U.2. Please make the example consistent so readers don't try to query the wrong replica.

Suggested change
- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
Remark: no description is provided for this dataset.
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
- Project id: `A025U.2`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
Remark: no description is provided for this dataset.
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files)

Copilot uses AI. Check for mistakes.
- [files on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2/filenotes)
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
scrape-mddb = "mdverse_scrapers.scrapers.mddb:main"
1 change: 1 addition & 0 deletions ruff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ extend-select = [
ignore = [
"COM812", # Redundant with ruff formatter. See: https://docs.astral.sh/ruff/rules/missing-trailing-comma/
"G004", # f-strings are allowed with the loguru module. See https://docs.astral.sh/ruff/rules/logging-f-string/
"PERF401", # list.extend suggestion is not applicable when appending model instances.
]
Comment on lines 41 to 44
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring PERF401 globally reduces signal across the whole repo. Since the codebase already suppresses this rule locally where needed (e.g., src/mdverse_scrapers/scrapers/nomad.py uses # noqa: PERF401), it would be better to keep the rule enabled globally and use targeted noqa/per-file ignores for the specific false positives.

Copilot uses AI. Check for mistakes.

# Force numpy-style for docstrings
Expand Down
2 changes: 1 addition & 1 deletion src/mdverse_scrapers/models/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:

Parameters
----------
cls : type[BaseDataset]
cls : type[DatasetMetadata]
The Pydantic model class being validated.
value : datetime | str | None
The input value of the 'date' field to validate.
Expand Down
14 changes: 14 additions & 0 deletions src/mdverse_scrapers/models/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,24 @@ class DatasetSourceName(StrEnum):
ATLAS = "atlas"
GPCRMD = "gpcrmd"
NMRLIPIDS = "nmrlipids"
MDDB = "mddb"
MDPOSIT_INRIA_NODE = "mdposit_inria_node"
MDPOSIT_MMB_NODE = "mdposit_mmb_node"


class ExternalDatabaseName(StrEnum):
"""External database names."""

PDB = "pdb"
UNIPROT = "uniprot"


class MoleculeType(StrEnum):
"""Common molecular types found in molecular dynamics simulations."""

PROTEIN = "protein"
NUCLEIC_ACID = "nucleic_acid"
ION = "ion"
LIPID = "lipid"
CARBOHYDRATE = "carbohydrate"
SOLVENT = "solvent"
53 changes: 45 additions & 8 deletions src/mdverse_scrapers/models/simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,16 @@
import re
from typing import Annotated

from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator
from pydantic import (
BaseModel,
ConfigDict,
Field,
StringConstraints,
field_validator,
model_validator,
)

from .enums import ExternalDatabaseName
from .enums import ExternalDatabaseName, MoleculeType

DOI = Annotated[
str,
Expand Down Expand Up @@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel):
None, min_length=1, description="Direct URL to the identifier into the database"
)

@model_validator(mode="after")
def compute_url(self) -> "ExternalIdentifier":
"""Compute the URL for the external identifier.

Parameters
----------
self: ExternalIdentifier
The model instance being validated, with all fields already validated.

Returns
-------
ExternalIdentifier
The model instance with the URL field computed if it was not provided.
"""
if self.url is not None:
return self

if self.database_name == ExternalDatabaseName.PDB:
self.url = f"https://www.rcsb.org/structure/{self.identifier}"
elif self.database_name == ExternalDatabaseName.UNIPROT:
self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}"

return self


class Molecule(BaseModel):
"""Molecule in a simulation."""
Expand All @@ -45,18 +76,24 @@ class Molecule(BaseModel):
model_config = ConfigDict(extra="forbid")

name: str = Field(..., description="Name of the molecule.")
type: MoleculeType | None = Field(
None,
description="Type of the molecule."
"Allowed values in the MoleculeType enum. "
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in description: Line 81 starts with 'Type of the molecule.' but line 82 continues with 'Allowed values' without a space between sentences. There should be a space at the beginning of line 82: ' Allowed values in the MoleculeType enum. '

Suggested change
"Allowed values in the MoleculeType enum. "
" Allowed values in the MoleculeType enum. "

Copilot uses AI. Check for mistakes.
"Examples: PROTEIN, ION, LIPID...",
)
number_of_this_molecule_type_in_system: int | None = Field(
None,
ge=0,
description="Number of molecules of this type in the simulation.",
)
number_of_atoms: int | None = Field(
None, ge=0, description="Number of atoms in the molecule."
)
formula: str | None = Field(None, description="Chemical formula of the molecule.")
sequence: str | None = Field(
None, description="Sequence of the molecule for protein and nucleic acid."
)
number_of_molecules: int | None = Field(
None,
ge=0,
description="Number of molecules of this type in the simulation.",
)
external_identifiers: list[ExternalIdentifier] | None = Field(
None,
description=("List of external database identifiers for this molecule."),
Expand Down Expand Up @@ -103,7 +140,7 @@ class SimulationMetadata(BaseModel):
# Ensure scraped metadata matches the expected schema exactly.
model_config = ConfigDict(extra="forbid")

software: list[Software] | None = Field(
softwares: list[Software] | None = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep software as it is. Sofware with an S at the end does not exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this will break all previous scrapers. Avoid modifying data models in a PR without discussing it first.

None,
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SimulationMetadata field 'software' has been renamed to 'softwares' in the model, but the existing GPCRMD scraper (at src/mdverse_scrapers/scrapers/gpcrmd.py:400) still uses the old field name 'software'. This will cause validation errors when the scraper runs. The field name needs to be updated to 'softwares' to match the model change.

Suggested change
None,
None,
alias="software",

Copilot uses AI. Check for mistakes.
description="List of molecular dynamics tool or software.",
)
Expand Down
Loading
Loading