Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Install pre-commit hooks with:
# prek install
exclude: "scripts/*|tmp/*|.*.mdp|"
exclude: "scripts/*|tmp/*"
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
Expand All @@ -24,14 +24,12 @@ repos:
hooks:
# Run the linter.
- id: ruff-check
types_or: [ python, pyi ]
types: [python]
args: [ --fix ]
# Run the formatter.
- id: ruff-format
types_or: [ python, pyi ]

types: [python]
- repo: https://github.com/PyCQA/bandit
rev: '1.9.2'
hooks:
- id: bandit

89 changes: 57 additions & 32 deletions docs/figshare.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,86 @@
# FigShare documentation
# Figshare documentation

## File size

According to FigShare [FAQ](https://help.figshare.com/):
According to Figshare [documentation](https://info.figshare.com/user-guide/file-size-limits-and-storage/):

> Freely-available Figshare.com accounts have the following limits for sharing scholarly content:
storage quota: 20GB
max individual file size: 20GB
max no of collections: 100
max no of projects: 100
max no of items: 500
max no of files per item: 500
max no of collaborators on project: 100
max no of authors per item, collection: 100
max no of item version: 50
If you have more than 500 files that you need to include in an item, please create an archive (or archives) for the files (e.g. zip file).
If an individual would like to publish outputs larger than 20GB (up to many TBs), please consider Figshare+, our Figshare repository for FAIR-ly sharing big datasets that allows for more storage, larger files, additional metadata and license options, and expert support. There is a one-time cost associated with Figshare+ to cover the cost of storing the data persistently ad infinitum. Find out more about Figshare+ or get in touch at review@figshare.com with the storage amount needed and we will find the best way to support your data sharing.
> All figshare.com accounts are provided with 20GB of private storage and are able to upload individual files up to 20GB.

> For those using an institutional version of Figshare, the number of collaboration spaces will be determined by your institution. Please contact your administrator.

So we don't expect much files to have an individual size above 20 GB.
So we don't expect files to have an individual size above 20 GB.

## API

- [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)
- [REST API](https://docs.figshare.com/)
### Documentation

- [How to use the Figshare API](https://info.figshare.com/user-guide/how-to-use-the-figshare-api/)
- [API documentation](https://docs.figshare.com/)

### Token

Figshare requires a token to access its API: [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)

### URL

## Query
https://api.figshare.com/v2/

[Search guide](https://help.figshare.com/article/how-to-use-advanced-search-in-figshare)
### Query

## Rate limiting
[Search guide](https://docs.figshare.com/#search)

https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
### Rate limiting

> We do not have automatic rate limiting in place for API requests. However, we do carry out monitoring to detect and mitigate abuse and prevent the platform's resources from being overused. We recommend that clients use the API responsibly and do not make more than one request per second. We reserve the right to throttle or block requests if we detect abuse.

## Dataset examples
Source: https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting

### MD-related file types
## Datasets

Query:
### Search for MD-related datasets

- Endpoint: `/articles/search`
- Documentation: <https://docs.figshare.com/#articles_search>

We seach MD-related datasets by searching for file types and keywords if necessary. Keywords are searched into `:title:`, `:description:` and `:keywords:` text fields. Example queries:

```none
resource_type.type:"dataset" AND filetype:"tpr"
```

Datasets:
or

```none
:extension: mdp AND (:title: 'md simulation' OR :description: 'md simulation' OR :keyword: 'md simulation')
:extension: mdp AND (:title: 'gromacs' OR :description: 'gromacs' OR :keyword: 'gromacs')
```

Example datasets:

- [Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)
- [a-Synuclein short MD simulations:homo-A53T](https://figshare.com/articles/dataset/a-Synuclein_short_MD_simulations_homo-A53T/7007552)
- [Molecular Dynamics Protocol with Gromacs 4.0.7](https://figshare.com/articles/dataset/Molecular_Dynamics_Protocol_with_Gromacs_4_0_7/104603)

### zip files
### Search strategy

We search for all file types and keywords. Results are paginated by batch of 100 datasets.

### Get metadata for a given dataset

- Endpoint: `/articles/{dataset_id}`
- Documentation: <https://docs.figshare.com/#public_article>

Example dataset "[Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)":

- web view: <https://figshare.com/articles/dataset/M1_gro/5840706>
- API view: <https://api.figshare.com/v2/articles/5840706>

All metadata related to a given dataset is provided, as well as all files metadata.

### Zip files

Zip files content is available with a preview (similar to Zenodo). The only metadata available within this preview is the file name (no file size, no md5sum).

Example dataset "[Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)":

Zip files content is available, like for Zenodo, but individual file sizes are not available.
- The content of the file "Molecular Dynamics Simulations.zip" is available at <https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json>

Example:
- For this dataset: [Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)
- Content of the file: [Molecular Dynamics Simulations.zip](https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json)
We need to emulate a web browser to access the URLs linking to the contents of zip files. Otherwise, we get a 202 code.
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ dependencies = [
"pyyaml>=6.0.2",
"requests>=2.32.3",
"scipy>=1.15.2",
"selenium>=4.40.0",
]

[dependency-groups]
Expand Down
159 changes: 159 additions & 0 deletions src/mdverse_scrapers/core/network.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,20 @@
"""Common functions and network utilities."""

import json
import time
from enum import StrEnum
from io import BytesIO

import certifi
import httpx
import loguru
import pycurl
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait


class HttpMethod(StrEnum):
Expand Down Expand Up @@ -148,3 +158,152 @@ def make_http_request_with_retries(
else:
logger.info("Retrying...")
return None


def parse_response_headers(headers_bytes: bytes) -> dict[str, str]:
"""Parse HTTP response header from bytes to a dictionary.

Returns
-------
dict
A dictionary of HTTP response headers.
"""
headers = {}
headers_text = headers_bytes.decode("utf-8")
for line in headers_text.split("\r\n"):
if ": " in line:
key, value = line.split(": ", maxsplit=1)
headers[key] = value
return headers


def send_http_request_with_retries_pycurl(
url: str,
data: dict | None = None,
delay_before_request: float = 1.0,
logger: "loguru.Logger" = loguru.logger,
) -> dict:
"""Query the Figshare API and return the JSON response.

Parameters
----------
url : str
URL to send the request to.
data : dict, optional
Data to send in the request body (for POST requests).
delay_before_request : float, optional
Time to wait before sending the request, in seconds.

Returns
-------
dict
A dictionary with the following keys:
- status_code: HTTP status code of the response.
- elapsed_time: Time taken to perform the request.
- headers: Dictionary of response headers.
- response: JSON response from the API.
"""
# First, we wait.
# https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
# "We recommend that clients use the API responsibly
# and do not make more than one request per second."
headers = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36"
),
"Content-Type": "application/json",
}
time.sleep(delay_before_request)
results = {}
# Initialize a Curl object.
curl = pycurl.Curl()
# Set the URL to send the request to.
curl.setopt(curl.URL, url)
# Add headers as a list of strings.
headers_lst = [f"{key}: {value}" for key, value in headers.items()]
curl.setopt(curl.HTTPHEADER, headers_lst)
# Handle SSL certificates.
curl.setopt(curl.CAINFO, certifi.where())
# Follow redirect.
curl.setopt(curl.FOLLOWLOCATION, True) # noqa: FBT003
# If data is provided, set the request to POST and add the data.
if data is not None:
curl.setopt(curl.POST, True) # noqa: FBT003
data_json = json.dumps(data)
curl.setopt(curl.POSTFIELDS, data_json)
# Capture the response body in a buffer.
body_buffer = BytesIO()
curl.setopt(curl.WRITEFUNCTION, body_buffer.write)
# Capture the response headers in a buffer.
header_buffer = BytesIO()
curl.setopt(curl.HEADERFUNCTION, header_buffer.write)
# Perform the request.
curl.perform()
# Get the HTTP status code.
status_code = curl.getinfo(curl.RESPONSE_CODE)
results["status_code"] = status_code
# Get elapsed time.
elapsed_time = curl.getinfo(curl.TOTAL_TIME)
results["elapsed_time"] = elapsed_time
# Close the Curl object.
curl.close()
# Get the response headers from the buffer.
response_headers = parse_response_headers(header_buffer.getvalue())
results["headers"] = response_headers
# Get the response body from the buffer.
response = body_buffer.getvalue()
# Convert the response body from bytes to a string.
response = response.decode("utf-8")
# Convert the response string to a JSON object.
try:
response = json.loads(response)
except json.JSONDecodeError:
logger.error("Error decoding JSON response:")
logger.error(response[:100])
response = None
results["response"] = response
return results


def get_html_page_with_selenium(
url: str, tag: str = "body", logger: "loguru.Logger" = loguru.logger
) -> str | None:
"""Get HTML page content using Selenium.

Parameters
----------
url : str
URL of the web page to retrieve.
tag : str, optional
HTML tag to wait for before retrieving the page content (default is "body").

Returns
-------
str | None
HTML content of the page, or None if an error occurs.
"""
options = Options()
options.add_argument("--headless")
options.add_argument("--enable-javascript")
page_content = ""
logger.info("Retrieving page with Selenium:")
logger.info(url)
try:
driver = webdriver.Chrome(options=options)
driver.get(url)
page_content = (
WebDriverWait(driver, 10)
.until(ec.visibility_of_element_located((By.CSS_SELECTOR, tag)))
.text
)
driver.quit()
except WebDriverException as e:
logger.error("Cannot retrieve page:")
logger.error(url)
logger.error(f"Selenium error: {e}")
return None
if not page_content:
logger.error("Retrieved page content is empty.")
return None
return page_content
27 changes: 11 additions & 16 deletions src/mdverse_scrapers/scrapers/figshare.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Scrape molecular dynamics datasets and files from Figshare."""
from arrow import get

import json
import os
Expand All @@ -14,6 +15,7 @@

from ..core.figshare_api import FigshareAPI
from ..core.logger import create_logger
from ..core.network import get_html_page_with_selenium
from ..core.toolbox import (
ContextManager,
DataType,
Expand Down Expand Up @@ -61,12 +63,13 @@ def extract_files_from_json_response(


def extract_files_from_zip_file(
file_id: str, logger: "loguru.Logger" = loguru.logger, max_attempts: int = 3
) -> list[str]:
file_id: str, logger: "loguru.Logger" = loguru.logger) -> list[str]:
"""Extract files from a zip file content.

No endpoint is available in the Figshare API.
We perform a direct HTTP GET request to the zip file content url.
We need to use the Selenium library to emulate a browser request
as direct requests fail with a 202 status code.

Known issue with:
https://figshare.com/ndownloader/files/31660220/preview/31660220/structure.json
Expand All @@ -75,10 +78,8 @@ def extract_files_from_zip_file(
----------
file_id : str
ID of the zip file to get content from.
logger : loguru.Logger
logger : "loguru.Logger"
Logger object.
max_attempts : int
Maximum number of attempts to fetch the zip file content.

Returns
-------
Expand All @@ -90,23 +91,17 @@ def extract_files_from_zip_file(
f"https://figshare.com/ndownloader/files/{file_id}"
f"/preview/{file_id}/structure.json"
)
response = make_http_get_request_with_retries(
url=url,
logger=logger,
max_attempts=max_attempts,
timeout=30,
delay_before_request=2,
)
response = get_html_page_with_selenium(url, tag="pre", logger=logger)
if response is None:
logger.warning("Cannot get zip file content.")
return file_names
# Extract file names from JSON response.
try:
file_names = extract_files_from_json_response(response.json())
file_names = extract_files_from_json_response(json.loads(response))
except (json.decoder.JSONDecodeError, ValueError) as exc:
logger.warning(f"Cannot extract files from JSON response: {exc}")
logger.debug(f"Status code: {response.status_code}")
logger.debug(response.text)
logger.warning(f"Cannot extract files from HTML response: {exc}")
logger.debug("Response content:")
logger.debug(response)
logger.success(f"Found {len(file_names)} files.")
return file_names

Expand Down
Loading