Merge pull request #64 from MDverse/fix-figshare-202-status-code

pierrepo · web-flow · commit 87e985cc9dd1 · 2026-01-21T13:56:33.000+01:00
Fix figshare 202 status code
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 # Install pre-commit hooks with:
 # prek install
-exclude: "scripts/*|tmp/*|.*.mdp|"
+exclude: "scripts/*|tmp/*"
 repos:
 - repo: https://github.com/pre-commit/pre-commit-hooks
   rev: v6.0.0
@@ -24,14 +24,12 @@ repos:
   hooks:
     # Run the linter.
     - id: ruff-check
-      types_or: [ python, pyi ]
+      types: [python]
       args: [ --fix ]
     # Run the formatter.
     - id: ruff-format
-      types_or: [ python, pyi ]
-
+      types: [python]
 - repo: https://github.com/PyCQA/bandit
   rev: '1.9.2'
   hooks:
   - id: bandit
-
diff --git a/docs/figshare.md b/docs/figshare.md
@@ -1,61 +1,86 @@
-# FigShare documentation
+# Figshare documentation
 
 ## File size
 
-According to FigShare [FAQ](https://help.figshare.com/):
+According to Figshare [documentation](https://info.figshare.com/user-guide/file-size-limits-and-storage/):
 
-> Freely-available Figshare.com accounts have the following limits for sharing scholarly content:
-storage quota: 20GB
-max individual file size: 20GB
-max no of collections: 100
-max no of projects: 100
-max no of items: 500
-max no of files per item: 500
-max no of collaborators on project: 100
-max no of authors per item, collection: 100
-max no of item version: 50
-If you have more than 500 files that you need to include in an item, please create an archive (or archives) for the files (e.g. zip file).
-If an individual would like to publish outputs larger than 20GB (up to many TBs), please consider Figshare+, our Figshare repository for FAIR-ly sharing big datasets that allows for more storage, larger files, additional metadata and license options, and expert support. There is a one-time cost associated with Figshare+ to cover the cost of storing the data persistently ad infinitum. Find out more about Figshare+ or get in touch at review@figshare.com with the storage amount needed and we will find the best way to support your data sharing.
+> All figshare.com accounts are provided with 20GB of private storage and are able to upload individual files up to 20GB.
 
-> For those using an institutional version of Figshare, the number of collaboration spaces will be determined by your institution. Please contact your administrator.
-
-So we don't expect much files to have an individual size above 20 GB.
+So we don't expect files to have an individual size above 20 GB.
 
 ## API
 
-- [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)
-- [REST API](https://docs.figshare.com/)
+### Documentation
+
+- [How to use the Figshare API](https://info.figshare.com/user-guide/how-to-use-the-figshare-api/)
+- [API documentation](https://docs.figshare.com/)
+
+### Token
+
+Figshare requires a token to access its API: [How to get a personnal token](https://info.figshare.com/user-guide/how-to-get-a-personal-token/)
+
+### URL
 
-## Query
+https://api.figshare.com/v2/
 
-[Search guide](https://help.figshare.com/article/how-to-use-advanced-search-in-figshare)
+### Query
 
-## Rate limiting
+[Search guide](https://docs.figshare.com/#search)
 
-https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
+### Rate limiting
 
 > We do not have automatic rate limiting in place for API requests. However, we do carry out monitoring to detect and mitigate abuse and prevent the platform's resources from being overused. We recommend that clients use the API responsibly and do not make more than one request per second. We reserve the right to throttle or block requests if we detect abuse.
 
-## Dataset examples
+Source: https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
 
-### MD-related file types
+## Datasets
 
-Query:
+### Search for MD-related datasets
+
+- Endpoint: `/articles/search`
+- Documentation: <https://docs.figshare.com/#articles_search>
+
+We seach MD-related datasets by searching for file types and keywords if necessary. Keywords are searched into `:title:`, `:description:` and `:keywords:` text fields. Example queries:
 
 ```none
 resource_type.type:"dataset" AND filetype:"tpr"
 ```
 
-Datasets:
+or
+
+```none
+:extension: mdp AND (:title: 'md simulation' OR :description: 'md simulation' OR :keyword: 'md simulation')
+:extension: mdp AND (:title: 'gromacs' OR :description: 'gromacs' OR :keyword: 'gromacs')
+```
+
+Example datasets:
 
 - [Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)
 - [a-Synuclein short MD simulations:homo-A53T](https://figshare.com/articles/dataset/a-Synuclein_short_MD_simulations_homo-A53T/7007552)
 - [Molecular Dynamics Protocol with Gromacs 4.0.7](https://figshare.com/articles/dataset/Molecular_Dynamics_Protocol_with_Gromacs_4_0_7/104603)
 
-### zip files
+### Search strategy
+
+We search for all file types and keywords. Results are paginated by batch of 100 datasets.
+
+### Get metadata for a given dataset
+
+- Endpoint: `/articles/{dataset_id}`
+- Documentation: <https://docs.figshare.com/#public_article>
+
+Example dataset "[Molecular dynamics of DSB in nucleosome](https://figshare.com/articles/dataset/M1_gro/5840706)":
+
+- web view: <https://figshare.com/articles/dataset/M1_gro/5840706>
+- API view: <https://api.figshare.com/v2/articles/5840706>
+
+All metadata related to a given dataset is provided, as well as all files metadata.
+
+### Zip files
+
+Zip files content is available with a preview (similar to Zenodo). The only metadata available within this preview is the file name (no file size, no md5sum).
+
+Example dataset "[Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)":
 
-Zip files content is available, like for Zenodo, but individual file sizes are not available.
+- The content of the file "Molecular Dynamics Simulations.zip" is available at <https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json>
 
-Example:
-- For this dataset: [Molecular Dynamics Simulations](https://figshare.com/articles/dataset/Molecular_Dynamics_Simulations/30307108?file=58572346)
-- Content of the file: [Molecular Dynamics Simulations.zip](https://figshare.com/ndownloader/files/58572346/preview/58572346/structure.json)
+We need to emulate a web browser to access the URLs linking to the contents of zip files. Otherwise, we get a 202 code.
diff --git a/pyproject.toml b/pyproject.toml
@@ -25,6 +25,7 @@ dependencies = [
     "pyyaml>=6.0.2",
     "requests>=2.32.3",
     "scipy>=1.15.2",
+    "selenium>=4.40.0",
 ]
 
 [dependency-groups]
diff --git a/src/mdverse_scrapers/core/network.py b/src/mdverse_scrapers/core/network.py
@@ -1,10 +1,20 @@
 """Common functions and network utilities."""
 
+import json
 import time
 from enum import StrEnum
+from io import BytesIO
 
+import certifi
 import httpx
 import loguru
+import pycurl
+from selenium import webdriver
+from selenium.common.exceptions import WebDriverException
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support import expected_conditions as ec
+from selenium.webdriver.support.ui import WebDriverWait
 
 
 class HttpMethod(StrEnum):
@@ -148,3 +158,152 @@ def make_http_request_with_retries(
         else:
             logger.info("Retrying...")
     return None
+
+
+def parse_response_headers(headers_bytes: bytes) -> dict[str, str]:
+    """Parse HTTP response header from bytes to a dictionary.
+
+    Returns
+    -------
+    dict
+        A dictionary of HTTP response headers.
+    """
+    headers = {}
+    headers_text = headers_bytes.decode("utf-8")
+    for line in headers_text.split("\r\n"):
+        if ": " in line:
+            key, value = line.split(": ", maxsplit=1)
+            headers[key] = value
+    return headers
+
+
+def send_http_request_with_retries_pycurl(
+    url: str,
+    data: dict | None = None,
+    delay_before_request: float = 1.0,
+    logger: "loguru.Logger" = loguru.logger,
+) -> dict:
+    """Query the Figshare API and return the JSON response.
+
+    Parameters
+    ----------
+    url : str
+        URL to send the request to.
+    data : dict, optional
+        Data to send in the request body (for POST requests).
+    delay_before_request : float, optional
+        Time to wait before sending the request, in seconds.
+
+    Returns
+    -------
+    dict
+        A dictionary with the following keys:
+        - status_code: HTTP status code of the response.
+        - elapsed_time: Time taken to perform the request.
+        - headers: Dictionary of response headers.
+        - response: JSON response from the API.
+    """
+    # First, we wait.
+    # https://docs.figshare.com/#figshare_documentation_api_description_rate_limiting
+    # "We recommend that clients use the API responsibly
+    # and do not make more than one request per second."
+    headers = {
+        "User-Agent": (
+            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36"
+        ),
+        "Content-Type": "application/json",
+    }
+    time.sleep(delay_before_request)
+    results = {}
+    # Initialize a Curl object.
+    curl = pycurl.Curl()
+    # Set the URL to send the request to.
+    curl.setopt(curl.URL, url)
+    # Add headers as a list of strings.
+    headers_lst = [f"{key}: {value}" for key, value in headers.items()]
+    curl.setopt(curl.HTTPHEADER, headers_lst)
+    # Handle SSL certificates.
+    curl.setopt(curl.CAINFO, certifi.where())
+    # Follow redirect.
+    curl.setopt(curl.FOLLOWLOCATION, True)  # noqa: FBT003
+    # If data is provided, set the request to POST and add the data.
+    if data is not None:
+        curl.setopt(curl.POST, True)  # noqa: FBT003
+        data_json = json.dumps(data)
+        curl.setopt(curl.POSTFIELDS, data_json)
+    # Capture the response body in a buffer.
+    body_buffer = BytesIO()
+    curl.setopt(curl.WRITEFUNCTION, body_buffer.write)
+    # Capture the response headers in a buffer.
+    header_buffer = BytesIO()
+    curl.setopt(curl.HEADERFUNCTION, header_buffer.write)
+    # Perform the request.
+    curl.perform()
+    # Get the HTTP status code.
+    status_code = curl.getinfo(curl.RESPONSE_CODE)
+    results["status_code"] = status_code
+    # Get elapsed time.
+    elapsed_time = curl.getinfo(curl.TOTAL_TIME)
+    results["elapsed_time"] = elapsed_time
+    # Close the Curl object.
+    curl.close()
+    # Get the response headers from the buffer.
+    response_headers = parse_response_headers(header_buffer.getvalue())
+    results["headers"] = response_headers
+    # Get the response body from the buffer.
+    response = body_buffer.getvalue()
+    # Convert the response body from bytes to a string.
+    response = response.decode("utf-8")
+    # Convert the response string to a JSON object.
+    try:
+        response = json.loads(response)
+    except json.JSONDecodeError:
+        logger.error("Error decoding JSON response:")
+        logger.error(response[:100])
+        response = None
+    results["response"] = response
+    return results
+
+
+def get_html_page_with_selenium(
+    url: str, tag: str = "body", logger: "loguru.Logger" = loguru.logger
+) -> str | None:
+    """Get HTML page content using Selenium.
+
+    Parameters
+    ----------
+    url : str
+        URL of the web page to retrieve.
+    tag : str, optional
+        HTML tag to wait for before retrieving the page content (default is "body").
+
+    Returns
+    -------
+    str | None
+        HTML content of the page, or None if an error occurs.
+    """
+    options = Options()
+    options.add_argument("--headless")
+    options.add_argument("--enable-javascript")
+    page_content = ""
+    logger.info("Retrieving page with Selenium:")
+    logger.info(url)
+    try:
+        driver = webdriver.Chrome(options=options)
+        driver.get(url)
+        page_content = (
+            WebDriverWait(driver, 10)
+            .until(ec.visibility_of_element_located((By.CSS_SELECTOR, tag)))
+            .text
+        )
+        driver.quit()
+    except WebDriverException as e:
+        logger.error("Cannot retrieve page:")
+        logger.error(url)
+        logger.error(f"Selenium error: {e}")
+        return None
+    if not page_content:
+        logger.error("Retrieved page content is empty.")
+        return None
+    return page_content
diff --git a/src/mdverse_scrapers/scrapers/figshare.py b/src/mdverse_scrapers/scrapers/figshare.py
@@ -1,4 +1,5 @@
 """Scrape molecular dynamics datasets and files from Figshare."""
+from arrow import get
 
 import json
 import os
@@ -14,6 +15,7 @@
 
 from ..core.figshare_api import FigshareAPI
 from ..core.logger import create_logger
+from ..core.network import get_html_page_with_selenium
 from ..core.toolbox import (
     ContextManager,
     DataType,
@@ -61,12 +63,13 @@ def extract_files_from_json_response(
 
 
 def extract_files_from_zip_file(
-    file_id: str, logger: "loguru.Logger" = loguru.logger, max_attempts: int = 3
-) -> list[str]:
+    file_id: str, logger: "loguru.Logger" = loguru.logger) -> list[str]:
     """Extract files from a zip file content.
 
     No endpoint is available in the Figshare API.
     We perform a direct HTTP GET request to the zip file content url.
+    We need to use the Selenium library to emulate a browser request
+    as direct requests fail with a 202 status code.
 
     Known issue with:
     https://figshare.com/ndownloader/files/31660220/preview/31660220/structure.json
@@ -75,10 +78,8 @@ def extract_files_from_zip_file(
     ----------
     file_id : str
         ID of the zip file to get content from.
-    logger : loguru.Logger
+    logger : "loguru.Logger"
         Logger object.
-    max_attempts : int
-        Maximum number of attempts to fetch the zip file content.
 
     Returns
     -------
@@ -90,23 +91,17 @@ def extract_files_from_zip_file(
         f"https://figshare.com/ndownloader/files/{file_id}"
         f"/preview/{file_id}/structure.json"
     )
-    response = make_http_get_request_with_retries(
-        url=url,
-        logger=logger,
-        max_attempts=max_attempts,
-        timeout=30,
-        delay_before_request=2,
-    )
+    response = get_html_page_with_selenium(url, tag="pre", logger=logger)
     if response is None:
         logger.warning("Cannot get zip file content.")
         return file_names
     # Extract file names from JSON response.
     try:
-        file_names = extract_files_from_json_response(response.json())
+        file_names = extract_files_from_json_response(json.loads(response))
     except (json.decoder.JSONDecodeError, ValueError) as exc:
-        logger.warning(f"Cannot extract files from JSON response: {exc}")
-        logger.debug(f"Status code: {response.status_code}")
-        logger.debug(response.text)
+        logger.warning(f"Cannot extract files from HTML response: {exc}")
+        logger.debug("Response content:")
+        logger.debug(response)
     logger.success(f"Found {len(file_names)} files.")
     return file_names
 
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -25,6 +25,7 @@ dependencies = [`
`25`	`25`	`"pyyaml>=6.0.2",`
`26`	`26`	`"requests>=2.32.3",`
`27`	`27`	`"scipy>=1.15.2",`
	`28`	`+ "selenium>=4.40.0",`
`28`	`29`	`]`
`29`	`30`
`30`	`31`	`[dependency-groups]`