Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0a37b89
feat: add MDposit dataset scraping script.
Jan 19, 2026
caf2865
feat(models): add MDPOSIT repository and MDDB project fields.
Jan 19, 2026
9147f32
feat(cli): add README command and scrape-mdposit entry point.
Jan 19, 2026
f809832
merge: sync main into update-mdposit-scraper
Jan 29, 2026
e1a4e9d
refactor(simulation-model): add molecule type field (protein, lipid, …
Jan 29, 2026
fb283e1
chore(ruff): disable PERF401 for model instance appends
Jan 29, 2026
064d94b
refactor(mdposit-scraper): update to scrape using both nodes of MDDB …
Jan 29, 2026
e150d24
docs: adding the mddb documentation + update the readme and command …
Feb 4, 2026
e3c5e38
feat: refactor the code and resolve AttributeError
Feb 4, 2026
cfe2622
merge: sync main into update-mdposit-scraper
Essmaw Feb 5, 2026
5b01789
feat: add URL computation for ExternalIdentifier based on database name
Essmaw Feb 5, 2026
5533d8b
Fix merging of new datasource names into DatasetSourceName instead of…
Essmaw Feb 5, 2026
9ebc838
feat: enhance molecule extraction to fit the new model and adding Un…
Essmaw Feb 5, 2026
96793e5
test(simulation): test URL computation for ExternalIdentifier
Essmaw Feb 5, 2026
f031e28
tests: refactor tests for ExternalIdentifier to account for automatic…
Essmaw Feb 5, 2026
6cb949d
refactor: rename number_of_molecules to number_of_this_molecule_type_…
Essmaw Feb 5, 2026
3871d22
refactor: rename number_of_this_molecule_type_in_system to number_of_…
Essmaw Feb 6, 2026
c9be76f
tests: refactor with `number_of_molecules` attribute and adding speci…
Essmaw Feb 6, 2026
542f54a
fixes(mddb scraper): correct spelling errors, improve parameter descr…
Essmaw Feb 6, 2026
21943fc
docs: correct spelling errors
Essmaw Feb 6, 2026
d826989
fix: Revert to 'software' field
pierrepo Feb 6, 2026
671008c
refactor: Reduce usage and scope of try/except blocks
pierrepo Feb 6, 2026
f987ea7
feat: Add default DatasetSourceName
pierrepo Feb 7, 2026
059d51f
feat: Coexerce verstion to str
pierrepo Feb 7, 2026
ebf4470
docs: Update MDDB documentation and examples
pierrepo Feb 7, 2026
63181fa
refactor: Remove more try/except
pierrepo Feb 7, 2026
7a5f580
refactor: Split log message
pierrepo Feb 7, 2026
d0324ee
fix: Fix error when forcefield metadata is undifiend
pierrepo Feb 7, 2026
8b57c76
fix: Handle case with no protein sequence nor Uniprot identifier
pierrepo Feb 7, 2026
024efa9
fix: Handle case when no software is available
pierrepo Feb 7, 2026
88b9955
feat: Add InChIKey field for Molecule model
pierrepo Feb 7, 2026
dd724a7
fix: Fix dataset_url_in_repository field
pierrepo Feb 7, 2026
9e0374f
docs: Print dataset URL in API
pierrepo Feb 7, 2026
6b959da
feat: Align uniprot identifiers with protein sequences
pierrepo Feb 7, 2026
e3a353c
feat: Add replicas logic in file metadata extraction
pierrepo Feb 7, 2026
7068584
feat: Add rules to avoid lengthy try / except blocks
pierrepo Feb 7, 2026
9cd0a88
fix: Add special case for 'inr' (INRIA) node name
pierrepo Feb 7, 2026
40ea3ca
feat: Add Cineca MDDB node
pierrepo Feb 8, 2026
a8ed77b
feat: Add another way to get protein name from Uniprot
pierrepo Feb 8, 2026
7884275
fix: Update logic to fetch protein name from Uniprot
pierrepo Feb 8, 2026
71f7c43
docs: Fix typos
pierrepo Feb 11, 2026
3d003b3
docs: Relax scraping time
pierrepo Feb 11, 2026
cf32a04
chore: Reallow PERF401 rules
pierrepo Feb 11, 2026
91595f1
docs: Remove MDDB node names
pierrepo Feb 11, 2026
6658973
refactor: Clean code
pierrepo Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,26 @@ This command will:
5. Save the extracted metadata to Parquet files


## Scrape MDposit

Have a look to the notes regarding [MDposit](docs/mdposit.md) and its API.

Scrape MDposit to collect molecular dynamics (MD) datasets and files:

```bash
uv run scrape-mdposit --output-dir data
```

This command will:

1. Search for molecular dynamics entries and files through the MDposit API.
2. Parse metadata and validate them using the Pydantic models
`DatasetMetadata` and `FileMetadata`.
3. Save validated files and datasets metadata.

The scraping takes about 13 minutes.
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement “The scraping takes about 13 minutes.” is environment-dependent and likely to become outdated. Consider rephrasing to something less specific (e.g., “may take ~X minutes depending on network/CPU”) or dropping the timing entirely.

Suggested change
The scraping takes about 13 minutes.
The scraping may take several minutes, depending on your network connection and hardware.

Copilot uses AI. Check for mistakes.


## Analyze Gromacs mdp and gro files

### Download files
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,4 @@ scrape-zenodo = "mdverse_scrapers.scrapers.zenodo:main"
scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
scrape-mdposit = "mdverse_scrapers.scrapers.mdposit:main"
1 change: 1 addition & 0 deletions ruff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ extend-select = [
ignore = [
"COM812", # Redundant with ruff formatter. See: https://docs.astral.sh/ruff/rules/missing-trailing-comma/
"G004", # f-strings are allowed with the loguru module. See https://docs.astral.sh/ruff/rules/logging-f-string/
"PERF401", # list.extend suggestion is not applicable when appending model instances.
]
Comment on lines 41 to 44
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring PERF401 globally reduces signal across the whole repo. Since the codebase already suppresses this rule locally where needed (e.g., src/mdverse_scrapers/scrapers/nomad.py uses # noqa: PERF401), it would be better to keep the rule enabled globally and use targeted noqa/per-file ignores for the specific false positives.

Copilot uses AI. Check for mistakes.

# Force numpy-style for docstrings
Expand Down
2 changes: 1 addition & 1 deletion src/mdverse_scrapers/models/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:

Parameters
----------
cls : type[BaseDataset]
cls : type[DatasetMetadata]
The Pydantic model class being validated.
value : datetime | str | None
The input value of the 'date' field to validate.
Expand Down
14 changes: 14 additions & 0 deletions src/mdverse_scrapers/models/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,17 @@ class DatasetSourceName(StrEnum):
ATLAS = "atlas"
GPCRMD = "gpcrmd"
NMRLIPIDS = "nmrlipids"
MDDB = "mddb"
MDPOSIT_INRIA_NODE = "mdposit_inria_node"
MDPOSIT_MMB_NODE = "mdposit_mmb_node"


class MoleculeType(StrEnum):
"""Common molecular types found in molecular dynamics simulations."""

PROTEIN = "protein"
NUCLEIC_ACID = "nucleic_acid"
ION = "ion"
LIPID = "lipid"
CARBOHYDRATE = "carbohydrate"
SOLVENT = "solvent"
8 changes: 8 additions & 0 deletions src/mdverse_scrapers/models/simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

from pydantic import BaseModel, Field, StringConstraints, field_validator

from .enums import MoleculeType

DOI = Annotated[
str,
StringConstraints(pattern=r"^10\.\d{4,9}/[\w\-.]+$"),
Expand All @@ -15,6 +17,12 @@ class Molecule(BaseModel):
"""Molecule in a simulation."""

name: str = Field(..., description="Name of the molecule.")
type: MoleculeType | None = Field(
None,
description="Type of the molecule."
"Allowed values in the MoleculeType enum. "
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in description: Line 81 starts with 'Type of the molecule.' but line 82 continues with 'Allowed values' without a space between sentences. There should be a space at the beginning of line 82: ' Allowed values in the MoleculeType enum. '

Suggested change
"Allowed values in the MoleculeType enum. "
" Allowed values in the MoleculeType enum. "

Copilot uses AI. Check for mistakes.
"Examples: PROTEIN, ION, LIPID...",
)
number_of_atoms: int | None = Field(
None, ge=0, description="Number of atoms in the molecule, if known."
)
Expand Down
Loading
Loading