Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Commit f1ae4e9

Browse files
authored
Merge pull request #51 from MDverse/update-gpcrmd-scraper
Update GPCRmd scraper
2 parents 499ca0a + f97a2cd commit f1ae4e9

File tree

9 files changed

+773
-9
lines changed

9 files changed

+773
-9
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ The scraping takes about 2 h.
130130
Scrape GPCRmd to collect molecular dynamics (MD) datasets and files related to G-protein-coupled receptors (GPCRs), a major family of membrane proteins and common drug targets.
131131

132132
```bash
133-
uv run -m scripts.scrape_gpcrmd
133+
uv run scrape-gpcrmd --output-dir data
134134
```
135135

136136
This command will:

docs/gpcrmd.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# GPCRmd
2+
3+
> GPCRmd is an online platform for visualizing, analyzing, and sharing molecular dynamics simulations of G-protein-coupled receptors (GPCRs), a key family of membrane proteins and common drug targets.
4+
5+
- web site: <https://www.gpcrmd.org/>
6+
- publication: [GPCRmd uncovers the dynamics of the 3D-GPCRome](https://www.nature.com/articles/s41592-020-0884-y), Nature Methods, 2020.
7+
- [documentation](https://gpcrmd-docs.readthedocs.io/en/latest/index.html)
8+
9+
## API
10+
11+
### Base URL
12+
13+
<https://www.gpcrmd.org/api/>
14+
15+
### Documentation
16+
17+
<https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api>
18+
19+
### Token
20+
21+
No token is needed to GPCRmd API in read mode.
22+
23+
### Metadata of datasets and files
24+
25+
Although GPCRmd provides a public API to discover molecular dynamics datasets, **some metadata fields and all file-level information are not exposed via the API**. For this reason, web scraping of the dataset page is required to retrieve complete dataset descriptions and file metadata.
26+
27+
### Datasets
28+
29+
In GPCRmd, dataset (a simulation and its related files) is called a "dynamic".
30+
31+
API entrypoint to search for all datasets at once:
32+
33+
- Endpoint: `/search_all/info/`
34+
- HTTP method: GET
35+
- [documentation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api)
36+
37+
#### Dataset metadata retrieved via the API
38+
39+
| Field | Description |
40+
| -------------------- | ----------------------------------- |
41+
| `dyn_id` | Unique dynamic (dataset) identifier |
42+
| `modelname` | Name of the simulated system |
43+
| `timestep` | MD integration time step in fs |
44+
| `atom_num` | Number of atoms |
45+
| `mysoftware` | MD engine used |
46+
| `software_version` | Version of the MD engine |
47+
| `forcefield` | Force field and model name |
48+
| `forcefield_version` | Force field and model version |
49+
| `creation_timestamp` | Dataset creation date |
50+
| `dataset_url` | URL of the dataset web page |
51+
52+
#### Dataset metadata retrieved via scraping of the dataset HTML page
53+
54+
| Field | Description |
55+
| -------------------- | ------------------------------------------ |
56+
| `description` | Textual description of the simulation |
57+
| `authors` | Authors |
58+
| `simulation_time` | Total simulation length |
59+
60+
### Files
61+
62+
The GPCRmd API does not provide any endpoint to access file-level metadata. File metadata is extracted from the dataset web page.
63+
64+
For example:
65+
66+
- Files associated to the [dataset `7`](https://www.gpcrmd.org/dynadb/dynamics/id/7/) are:
67+
68+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10166_trj_7.dcd>
69+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10167_dyn_7.psf>
70+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10168_dyn_7.pdb>
71+
72+
- Files associated to the [dataset `12`](https://www.gpcrmd.org/dynadb/dynamics/id/12/) are:
73+
74+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10193_trj_12.xtc>
75+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10196_dyn_12.psf>
76+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10197_dyn_12.pdb>
77+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10191_oth_12.tar.gz>
78+
- <https://www.gpcrmd.org/dynadb/files/Dynamics/10192_prm_12.tar.gz>
79+
80+
#### File metadata retrieved via scraping of the dataset HTML page
81+
82+
| Field | Description |
83+
| ----------- | ---------------------- |
84+
| `file_name` | *Name of the file* |
85+
| `file_type` | *File extension* |
86+
| `file_path` | *Public download URL* |
87+
| `file_size` | *File size in bytes* |
88+
89+
> 💡 File size is obtained using an HTTP `HEAD` request on the file URL, **avoiding file download**.
90+
91+
## Example
92+
93+
### Dataset ID 2316
94+
95+
- [Dataset on GPCRmd GUI](https://www.gpcrmd.org/dynadb/dynamics/id/2316/)
96+
- [Dataset on GPCRmd API](https://www.gpcrmd.org/api/search_dyn/info/2316)
97+
98+
#### Dataset metadata (API + scraping)
99+
100+
| Field | Description |
101+
| -------------------- | ------------------------------------------------- |
102+
| `dyn_id` | 2316 |
103+
| `modelname` | FFA2_TUG1375_Gi1-TUG1375 |
104+
| `timestep` | 2 |
105+
| `atom_num` | 4829 |
106+
| `mysoftware` | AMBER PMEMD.CUDA |
107+
| `software_version` | 2020 |
108+
| `forcefield` | ff19SB/lipid21/GAFF2 |
109+
| `forcefield_version` | ff19SB/lipid21 |
110+
| `creation_timestamp` | 2025-05-13 |
111+
| `dataset_url` | <https://www.gpcrmd.org/dynadb/dynamics/id/2316/> |
112+
| `description` | Simulation aims to observe structural features of FFA2 without an orthosteric agonist and G-protein, which will be compared to docking-based simulations of allosteric activators... |
113+
| `authors` | Abdul-Akim Guseinov, University of Glasgow |
114+
| `simulation_time` | 3.0 µs |
115+
116+
#### Files metadata
117+
118+
[files on GPCRmd GUI](https://www.gpcrmd.org/api/search_dyn/info/2316) (accessible via the *Technical Information* section)
119+
120+
| Field | Description |
121+
| ----------- | ------------------------------------------------------------------------- |
122+
| `file_name` | tmp_dyn_0_2667.pdb |
123+
| `file_path` | <https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/tmp_dyn_0_2667.pdb> |
124+
| `file_size` | 1024 |

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,4 @@ scrape-zenodo = "mdverse_scrapers.scrapers.zenodo:main"
7272
scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
7373
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
7474
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
75+
scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"

src/mdverse_scrapers/core/network.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,43 @@ def make_http_request_with_retries(
191191
return None
192192

193193

194+
def retrieve_file_size_from_http_head_request(
195+
client: httpx.Client, url: str, logger: "loguru.Logger" = loguru.logger
196+
) -> int | None:
197+
"""Retrieve file size from HTTP HEAD request.
198+
199+
Parameters
200+
----------
201+
client : httpx.Client
202+
The HTTPX client to use for making requests.
203+
url : str
204+
The URL of the file.
205+
logger : "loguru.Logger"
206+
Logger for logging messages.
207+
208+
Returns
209+
-------
210+
int | None
211+
File size in bytes if available, None otherwise.
212+
"""
213+
logger.info("Retrieving file size from HTTP HEAD request")
214+
response = make_http_request_with_retries(
215+
client,
216+
url,
217+
method=HttpMethod.HEAD,
218+
timeout=30,
219+
delay_before_request=0.2,
220+
logger=logger,
221+
)
222+
size = None
223+
if response and response.headers:
224+
size = int(response.headers.get("Content-Length", 0))
225+
logger.info(f"File size: {size:,} bytes")
226+
else:
227+
logger.warning("Could not retrieve file size.")
228+
return size
229+
230+
194231
def parse_response_headers(headers_bytes: bytes) -> dict[str, str]:
195232
"""Parse HTTP response header from bytes to a dictionary.
196233

src/mdverse_scrapers/models/simulation.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@ class Molecule(BaseModel):
2424
formula: str | None = Field(
2525
None, description="Chemical formula of the molecule, if known."
2626
)
27+
number_of_molecules: int | None = Field(
28+
None,
29+
ge=0,
30+
description="Number of molecules of this type in the simulation, if known.",
31+
)
2732

2833

2934
class ForceFieldModel(BaseModel):
@@ -73,15 +78,13 @@ class SimulationMetadata(BaseModel):
7378
total_number_of_atoms: int | None = Field(
7479
None,
7580
ge=0, # equal or greater than zero
76-
description="Total number of atoms in the simulated system.",
81+
description="Total number of atoms in the system.",
7782
)
7883
molecules: list[Molecule] | None = Field(
7984
None,
80-
description=(
81-
"List of molecules in the system with their number of atoms if known."
82-
),
85+
description=("List of simulated molecules in the system."),
8386
)
84-
forcefields: list[ForceFieldModel] | None = Field(
87+
forcefields_models: list[ForceFieldModel] | None = Field(
8588
None,
8689
description="List of forcefields and models used.",
8790
)

0 commit comments

Comments
 (0)