-
Notifications
You must be signed in to change notification settings - Fork 4
Feature/add mdposit scraper #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
0a37b89
caf2865
9147f32
f809832
e1a4e9d
fb283e1
064d94b
e150d24
e3c5e38
cfe2622
5b01789
5533d8b
9ebc838
96793e5
f031e28
6cb949d
3871d22
c9be76f
542f54a
21943fc
d826989
671008c
f987ea7
059d51f
ebf4470
63181fa
7a5f580
d0324ee
8b57c76
024efa9
88b9955
dd724a7
9e0374f
6b959da
e3a353c
7068584
9cd0a88
40ea3ca
a8ed77b
7884275
71f7c43
3d003b3
cf32a04
91595f1
6658973
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -171,6 +171,26 @@ This command will: | |||||
| 5. Save the extracted metadata to Parquet files | ||||||
|
|
||||||
|
|
||||||
| ## Scrape MDposit | ||||||
|
|
||||||
| Have a look to the notes regarding [MDposit](docs/mdposit.md) and its API. | ||||||
Essmaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Scrape MDposit to collect molecular dynamics (MD) datasets and files: | ||||||
|
|
||||||
| ```bash | ||||||
| uv run scrape-mdposit --output-dir data | ||||||
| ``` | ||||||
|
|
||||||
| This command will: | ||||||
|
|
||||||
| 1. Search for molecular dynamics entries and files through the MDposit API. | ||||||
| 2. Parse metadata and validate them using the Pydantic models | ||||||
| `DatasetMetadata` and `FileMetadata`. | ||||||
| 3. Save validated files and datasets metadata. | ||||||
|
|
||||||
| The scraping takes about 13 minutes. | ||||||
|
||||||
| The scraping takes about 13 minutes. | |
| The scraping may take several minutes, depending on your network connection and hardware. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,6 +41,7 @@ extend-select = [ | |
| ignore = [ | ||
| "COM812", # Redundant with ruff formatter. See: https://docs.astral.sh/ruff/rules/missing-trailing-comma/ | ||
| "G004", # f-strings are allowed with the loguru module. See https://docs.astral.sh/ruff/rules/logging-f-string/ | ||
| "PERF401", # list.extend suggestion is not applicable when appending model instances. | ||
| ] | ||
|
Comment on lines
41
to
44
|
||
|
|
||
| # Force numpy-style for docstrings | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -5,6 +5,8 @@ | |||||
|
|
||||||
| from pydantic import BaseModel, Field, StringConstraints, field_validator | ||||||
|
|
||||||
| from .enums import MoleculeType | ||||||
|
|
||||||
| DOI = Annotated[ | ||||||
| str, | ||||||
| StringConstraints(pattern=r"^10\.\d{4,9}/[\w\-.]+$"), | ||||||
|
|
@@ -15,6 +17,12 @@ class Molecule(BaseModel): | |||||
| """Molecule in a simulation.""" | ||||||
|
|
||||||
| name: str = Field(..., description="Name of the molecule.") | ||||||
| type: MoleculeType | None = Field( | ||||||
| None, | ||||||
| description="Type of the molecule." | ||||||
Essmaw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| "Allowed values in the MoleculeType enum. " | ||||||
|
||||||
| "Allowed values in the MoleculeType enum. " | |
| " Allowed values in the MoleculeType enum. " |
Uh oh!
There was an error while loading. Please reload this page.