Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Commit 87e985c

Browse files
authored
Merge pull request #64 from MDverse/fix-figshare-202-status-code
Fix figshare 202 status code
2 parents 11f10c3 + 12328f3 commit 87e985c

File tree

6 files changed

+393
-53
lines changed

6 files changed

+393
-53
lines changed

.pre-commit-config.yaml

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Install pre-commit hooks with:
22
# prek install
3-
exclude: "scripts/*|tmp/*|.*.mdp|"
3+
exclude: "scripts/*|tmp/*"
44
repos:
55
- repo: https://github.com/pre-commit/pre-commit-hooks
66
rev: v6.0.0
@@ -24,14 +24,12 @@ repos:
2424
hooks:
2525
# Run the linter.
2626
- id: ruff-check
27-
types_or: [ python, pyi ]
27+
types: [python]
2828
args: [ --fix ]
2929
# Run the formatter.
3030
- id: ruff-format
31-
types_or: [ python, pyi ]
32-
31+
types: [python]
3332
- repo: https://github.com/PyCQA/bandit
3433
rev: '1.9.2'
3534
hooks:
3635
- id: bandit
37-

docs/figshare.md

Lines changed: 57 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,86 @@
1-
# FigShare documentation
1+
# Figshare documentation
22

33
## File size
44

5-
According to FigShare [FAQ](https://help.figshare.com/):
5+
According to Figshare [documentation](https://info.figshare.com/user-guide/file-size-limits-and-storage/):
66

7-
> Freely-available Figshare.com accounts have the following limits for sharing scholarly content:
8-
storage quota: 20GB
9-
max individual file size: 20GB
10-
max no of collections: 100
11-
max no of projects: 100
12-
max no of items: 500
13-
max no of files per item: 500
14-
max no of collaborators on project: 100
15-
max no of authors per item, collection: 100
16-
max no of item version: 50
17-
If you have more than 500 files that you need to include in an item, please create an archive (or archives) for the files (e.g. zip file).
18-
If an individual would like to publish outputs larger than 20GB (up to many TBs), please consider Figshare+, our Figshare repository for FAIR-ly sharing big datasets that allows for more storage, larger files, additional metadata and license options, and expert support. There is a one-time cost associated with Figshare+ to cover the cost of storing the data persistently ad infinitum. Find out more about Figshare+ or get in touch at review@figshare.com with the storage amount needed and we will find the best way to support your data sharing.
7+
> All figshare.com accounts are provided with 20GB of private storage and are able to upload individual files up to 20GB.
198
20-
> For those using an institutional version of Figshare, the number of collaboration spaces will be determined by your institution. Please contact your administrator.
21-
22-
So we don't expect much files to have an individual size above 20 GB.
9+
So we don't expect files to have an individual size above 20 GB.
2310

2411
## API
2512

26-
- [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)
27-
- [REST API](https://docs.figshare.com/)
13+
### Documentation
14+
15+
- [How to use the Figshare API](https://info.figshare.com/user-guide/how-to-use-the-figshare-api/)
16+
- [API documentation](https://docs.figshare.com/)
17+
18+
### Token
19+
20+
Figshare requires a token to access its API: [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)
21+
22+
### URL
2823

29-
## Query
24+
https://api.figshare.com/v2/
3025

31-
[Search guide](https://help.figshare.com/article/how-to-use-advanced-search-in-figshare)
26+
### Query
3227

33-
## Rate limiting
28+
[Search guide](https://docs.figshare.com/#search)
3429

35-
https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
30+
### Rate limiting
3631

3732
> We do not have automatic rate limiting in place for API requests. However, we do carry out monitoring to detect and mitigate abuse and prevent the platform's resources from being overused. We recommend that clients use the API responsibly and do not make more than one request per second. We reserve the right to throttle or block requests if we detect abuse.
3833
39-
## Dataset examples
34+
Source: https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
4035

41-
### MD-related file types
36+
## Datasets
4237

43-
Query:
38+
### Search for MD-related datasets
39+
40+
- Endpoint: `/articles/search`
41+
- Documentation: <https://docs.figshare.com/#articles_search>
42+
43+
We seach MD-related datasets by searching for file types and keywords if necessary. Keywords are searched into `:title:`, `:description:` and `:keywords:` text fields. Example queries:
4444

4545
```none
4646
resource_type.type:"dataset" AND filetype:"tpr"
4747
```
4848

49-
Datasets:
49+
or
50+
51+
```none
52+
:extension: mdp AND (:title: 'md simulation' OR :description: 'md simulation' OR :keyword: 'md simulation')
53+
:extension: mdp AND (:title: 'gromacs' OR :description: 'gromacs' OR :keyword: 'gromacs')
54+
```
55+
56+
Example datasets:
5057

5158
- [Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)
5259
- [a-Synuclein short MD simulations:homo-A53T](https://figshare.com/articles/dataset/a-Synuclein_short_MD_simulations_homo-A53T/7007552)
5360
- [Molecular Dynamics Protocol with Gromacs 4.0.7](https://figshare.com/articles/dataset/Molecular_Dynamics_Protocol_with_Gromacs_4_0_7/104603)
5461

55-
### zip files
62+
### Search strategy
63+
64+
We search for all file types and keywords. Results are paginated by batch of 100 datasets.
65+
66+
### Get metadata for a given dataset
67+
68+
- Endpoint: `/articles/{dataset_id}`
69+
- Documentation: <https://docs.figshare.com/#public_article>
70+
71+
Example dataset "[Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)":
72+
73+
- web view: <https://figshare.com/articles/dataset/M1_gro/5840706>
74+
- API view: <https://api.figshare.com/v2/articles/5840706>
75+
76+
All metadata related to a given dataset is provided, as well as all files metadata.
77+
78+
### Zip files
79+
80+
Zip files content is available with a preview (similar to Zenodo). The only metadata available within this preview is the file name (no file size, no md5sum).
81+
82+
Example dataset "[Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)":
5683

57-
Zip files content is available, like for Zenodo, but individual file sizes are not available.
84+
- The content of the file "Molecular Dynamics Simulations.zip" is available at <https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json>
5885

59-
Example:
60-
- For this dataset: [Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)
61-
- Content of the file: [Molecular Dynamics Simulations.zip](https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json)
86+
We need to emulate a web browser to access the URLs linking to the contents of zip files. Otherwise, we get a 202 code.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ dependencies = [
2525
"pyyaml>=6.0.2",
2626
"requests>=2.32.3",
2727
"scipy>=1.15.2",
28+
"selenium>=4.40.0",
2829
]
2930

3031
[dependency-groups]

src/mdverse_scrapers/core/network.py

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,20 @@
11
"""Common functions and network utilities."""
22

3+
import json
34
import time
45
from enum import StrEnum
6+
from io import BytesIO
57

8+
import certifi
69
import httpx
710
import loguru
11+
import pycurl
12+
from selenium import webdriver
13+
from selenium.common.exceptions import WebDriverException
14+
from selenium.webdriver.chrome.options import Options
15+
from selenium.webdriver.common.by import By
16+
from selenium.webdriver.support import expected_conditions as ec
17+
from selenium.webdriver.support.ui import WebDriverWait
818

919

1020
class HttpMethod(StrEnum):
@@ -148,3 +158,152 @@ def make_http_request_with_retries(
148158
else:
149159
logger.info("Retrying...")
150160
return None
161+
162+
163+
def parse_response_headers(headers_bytes: bytes) -> dict[str, str]:
164+
"""Parse HTTP response header from bytes to a dictionary.
165+
166+
Returns
167+
-------
168+
dict
169+
A dictionary of HTTP response headers.
170+
"""
171+
headers = {}
172+
headers_text = headers_bytes.decode("utf-8")
173+
for line in headers_text.split("\r\n"):
174+
if ": " in line:
175+
key, value = line.split(": ", maxsplit=1)
176+
headers[key] = value
177+
return headers
178+
179+
180+
def send_http_request_with_retries_pycurl(
181+
url: str,
182+
data: dict | None = None,
183+
delay_before_request: float = 1.0,
184+
logger: "loguru.Logger" = loguru.logger,
185+
) -> dict:
186+
"""Query the Figshare API and return the JSON response.
187+
188+
Parameters
189+
----------
190+
url : str
191+
URL to send the request to.
192+
data : dict, optional
193+
Data to send in the request body (for POST requests).
194+
delay_before_request : float, optional
195+
Time to wait before sending the request, in seconds.
196+
197+
Returns
198+
-------
199+
dict
200+
A dictionary with the following keys:
201+
- status_code: HTTP status code of the response.
202+
- elapsed_time: Time taken to perform the request.
203+
- headers: Dictionary of response headers.
204+
- response: JSON response from the API.
205+
"""
206+
# First, we wait.
207+
# https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
208+
# "We recommend that clients use the API responsibly
209+
# and do not make more than one request per second."
210+
headers = {
211+
"User-Agent": (
212+
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
213+
"(KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36"
214+
),
215+
"Content-Type": "application/json",
216+
}
217+
time.sleep(delay_before_request)
218+
results = {}
219+
# Initialize a Curl object.
220+
curl = pycurl.Curl()
221+
# Set the URL to send the request to.
222+
curl.setopt(curl.URL, url)
223+
# Add headers as a list of strings.
224+
headers_lst = [f"{key}: {value}" for key, value in headers.items()]
225+
curl.setopt(curl.HTTPHEADER, headers_lst)
226+
# Handle SSL certificates.
227+
curl.setopt(curl.CAINFO, certifi.where())
228+
# Follow redirect.
229+
curl.setopt(curl.FOLLOWLOCATION, True) # noqa: FBT003
230+
# If data is provided, set the request to POST and add the data.
231+
if data is not None:
232+
curl.setopt(curl.POST, True) # noqa: FBT003
233+
data_json = json.dumps(data)
234+
curl.setopt(curl.POSTFIELDS, data_json)
235+
# Capture the response body in a buffer.
236+
body_buffer = BytesIO()
237+
curl.setopt(curl.WRITEFUNCTION, body_buffer.write)
238+
# Capture the response headers in a buffer.
239+
header_buffer = BytesIO()
240+
curl.setopt(curl.HEADERFUNCTION, header_buffer.write)
241+
# Perform the request.
242+
curl.perform()
243+
# Get the HTTP status code.
244+
status_code = curl.getinfo(curl.RESPONSE_CODE)
245+
results["status_code"] = status_code
246+
# Get elapsed time.
247+
elapsed_time = curl.getinfo(curl.TOTAL_TIME)
248+
results["elapsed_time"] = elapsed_time
249+
# Close the Curl object.
250+
curl.close()
251+
# Get the response headers from the buffer.
252+
response_headers = parse_response_headers(header_buffer.getvalue())
253+
results["headers"] = response_headers
254+
# Get the response body from the buffer.
255+
response = body_buffer.getvalue()
256+
# Convert the response body from bytes to a string.
257+
response = response.decode("utf-8")
258+
# Convert the response string to a JSON object.
259+
try:
260+
response = json.loads(response)
261+
except json.JSONDecodeError:
262+
logger.error("Error decoding JSON response:")
263+
logger.error(response[:100])
264+
response = None
265+
results["response"] = response
266+
return results
267+
268+
269+
def get_html_page_with_selenium(
270+
url: str, tag: str = "body", logger: "loguru.Logger" = loguru.logger
271+
) -> str | None:
272+
"""Get HTML page content using Selenium.
273+
274+
Parameters
275+
----------
276+
url : str
277+
URL of the web page to retrieve.
278+
tag : str, optional
279+
HTML tag to wait for before retrieving the page content (default is "body").
280+
281+
Returns
282+
-------
283+
str | None
284+
HTML content of the page, or None if an error occurs.
285+
"""
286+
options = Options()
287+
options.add_argument("--headless")
288+
options.add_argument("--enable-javascript")
289+
page_content = ""
290+
logger.info("Retrieving page with Selenium:")
291+
logger.info(url)
292+
try:
293+
driver = webdriver.Chrome(options=options)
294+
driver.get(url)
295+
page_content = (
296+
WebDriverWait(driver, 10)
297+
.until(ec.visibility_of_element_located((By.CSS_SELECTOR, tag)))
298+
.text
299+
)
300+
driver.quit()
301+
except WebDriverException as e:
302+
logger.error("Cannot retrieve page:")
303+
logger.error(url)
304+
logger.error(f"Selenium error: {e}")
305+
return None
306+
if not page_content:
307+
logger.error("Retrieved page content is empty.")
308+
return None
309+
return page_content

src/mdverse_scrapers/scrapers/figshare.py

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
"""Scrape molecular dynamics datasets and files from Figshare."""
2+
from arrow import get
23

34
import json
45
import os
@@ -14,6 +15,7 @@
1415

1516
from ..core.figshare_api import FigshareAPI
1617
from ..core.logger import create_logger
18+
from ..core.network import get_html_page_with_selenium
1719
from ..core.toolbox import (
1820
ContextManager,
1921
DataType,
@@ -61,12 +63,13 @@ def extract_files_from_json_response(
6163

6264

6365
def extract_files_from_zip_file(
64-
file_id: str, logger: "loguru.Logger" = loguru.logger, max_attempts: int = 3
65-
) -> list[str]:
66+
file_id: str, logger: "loguru.Logger" = loguru.logger) -> list[str]:
6667
"""Extract files from a zip file content.
6768
6869
No endpoint is available in the Figshare API.
6970
We perform a direct HTTP GET request to the zip file content url.
71+
We need to use the Selenium library to emulate a browser request
72+
as direct requests fail with a 202 status code.
7073
7174
Known issue with:
7275
https://figshare.com/ndownloader/files/31660220/preview/31660220/structure.json
@@ -75,10 +78,8 @@ def extract_files_from_zip_file(
7578
----------
7679
file_id : str
7780
ID of the zip file to get content from.
78-
logger : loguru.Logger
81+
logger : "loguru.Logger"
7982
Logger object.
80-
max_attempts : int
81-
Maximum number of attempts to fetch the zip file content.
8283
8384
Returns
8485
-------
@@ -90,23 +91,17 @@ def extract_files_from_zip_file(
9091
f"https://figshare.com/ndownloader/files/{file_id}"
9192
f"/preview/{file_id}/structure.json"
9293
)
93-
response = make_http_get_request_with_retries(
94-
url=url,
95-
logger=logger,
96-
max_attempts=max_attempts,
97-
timeout=30,
98-
delay_before_request=2,
99-
)
94+
response = get_html_page_with_selenium(url, tag="pre", logger=logger)
10095
if response is None:
10196
logger.warning("Cannot get zip file content.")
10297
return file_names
10398
# Extract file names from JSON response.
10499
try:
105-
file_names = extract_files_from_json_response(response.json())
100+
file_names = extract_files_from_json_response(json.loads(response))
106101
except (json.decoder.JSONDecodeError, ValueError) as exc:
107-
logger.warning(f"Cannot extract files from JSON response: {exc}")
108-
logger.debug(f"Status code: {response.status_code}")
109-
logger.debug(response.text)
102+
logger.warning(f"Cannot extract files from HTML response: {exc}")
103+
logger.debug("Response content:")
104+
logger.debug(response)
110105
logger.success(f"Found {len(file_names)} files.")
111106
return file_names
112107

0 commit comments

Comments
 (0)