Update figshare scraper by pierrepo · Pull Request #71 · MDverse/mdws

pierrepo · 2026-01-26T06:57:28Z

No description provided.

Copilot

Pull request overview

This PR refactors the Figshare scraper (and related scraping pipeline) to use the new ScraperContext and Pydantic-based metadata normalization/export, while also aligning Zenodo and NOMAD scrapers with the same metadata workflow and adding a debug mode flag.

Changes:

Refactor Figshare/Zenodo/NOMAD scrapers to produce raw metadata dicts, normalize them into Pydantic models, and export to parquet via shared utilities.
Replace the legacy ContextManager/dataframe flow with ScraperContext, list-based processing, and new helper utilities (normalization, deduplication, exclusion filtering, stats printing).
Update Figshare docs and simplify the Figshare scraper tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`src/mdverse_scrapers/scrapers/figshare.py`	Refactors Figshare scraping pipeline to `ScraperContext` + Pydantic normalization/export; adds debug mode and zip-content scraping.
`src/mdverse_scrapers/scrapers/zenodo.py`	Refactors Zenodo scraping pipeline to list-based metadata + Pydantic normalization/export; adds debug mode.
`src/mdverse_scrapers/scrapers/nomad.py`	Updates NOMAD scraper to use shared normalization/export utilities and adds debug mode early-stop.
`src/mdverse_scrapers/models/utils.py`	Introduces shared normalization helpers and parquet export for lists of Pydantic models.
`src/mdverse_scrapers/core/toolbox.py`	Removes legacy context/dataclass usage; adds list deduplication; refactors exclusion + false-positive filtering to operate on Pydantic models.
`src/mdverse_scrapers/models/scraper.py`	Adds `is_in_debug_mode` to the scraper context.
`src/mdverse_scrapers/models/file.py`	Reorders fields (no functional changes intended) while keeping required file metadata.
`src/mdverse_scrapers/models/dataset.py`	Reorders dataset statistic fields (no functional changes intended).
`tests/scrapers/test_figshare.py`	Simplifies Figshare tests after context removal.
`docs/figshare.md`	Fixes wording and improves formatting in Figshare documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T07:05:46Z

src/mdverse_scrapers/scrapers/figshare.py

+def scrap_zip_files_content(
+    all_files_metadata, logger: "loguru.Logger" = loguru.logger
+) -> list[dict]:
    """Scrap information from files contained in zip archives.

-    Uncertain how many files can be fetched from the preview.
    Only get file name and file type.
    File size and MD5 checksum are not available.

    Arguments
    ---------
-    files_df: Pandas dataframe
-        Dataframe with information about files.
-    logger: loguru.Logger
-        Logger object.
+    all_files_metadata: list[dict]
+        List of dictionaries with files metadata.
+    logger: "loguru.Logger"
+        Logger for logging messages.

    Returns
    -------
-    zip_df: Pandas dataframe
-        Dataframe with information about files found in zip archive.
+    list[dict]
+        List of dictionaries with files metadata found in zip archive.
    """
-    files_in_zip_lst = []
-    zip_files_counter = 0
-    zip_files_df = files_df[files_df["file_type"] == "zip"]
-    logger.info(f"Number of zip files to scrap content from: {zip_files_df.shape[0]}")
-    for zip_files_counter, zip_idx in enumerate(zip_files_df.index, start=1):
-        zip_file = zip_files_df.loc[zip_idx]
-        file_id = zip_file["file_url"].split("/")[-1]
+    files_in_zip_metadata = []
+    # Select zip files only.
+    zip_files = [f_meta for f_meta in all_files_metadata if f_meta.file_type == "zip"]
+    logger.info(f"Number of zip files to scrap content from: {len(zip_files):,}")


scrap_zip_files_content iterates all_files_metadata as FileMetadata objects (uses .file_type, .file_url_in_repository, etc.), but the signature/docstring say list[dict]. Update the type hint and docstring to list[FileMetadata] to match actual usage and help callers pass the correct type.

Copilot · 2026-01-26T07:05:47Z

src/mdverse_scrapers/scrapers/zenodo.py

+            file_lst.append(
+                {
+                    "file_name": path,
+                    "file_size": size,


extract_data_from_zip_file builds entries with the key file_size, but FileMetadata uses file_size_in_bytes. As a result, zip-extracted file sizes will be ignored during normalization. Use file_size_in_bytes here (ByteSize can parse strings like "4.6 kB" after the model's validator).

Suggested change

"file_size": size,

"file_size_in_bytes": size,

Copilot · 2026-01-26T07:05:47Z

src/mdverse_scrapers/scrapers/zenodo.py

+    try:
+        _ = response_json["hits"]["hits"]
+    except KeyError:
+        logger.warning("Cannot extract hits the response JSON.")


Spelling/grammar: log message reads "Cannot extract hits the response JSON." It should say "Cannot extract hits from the response JSON."

Suggested change

logger.warning("Cannot extract hits the response JSON.")

logger.warning("Cannot extract hits from the response JSON.")

Copilot · 2026-01-26T07:05:47Z

src/mdverse_scrapers/models/utils.py

+    for file_meta in files_list:
+        dataset_id = file_meta["dataset_id_in_repository"]
+        # Print info only when changing dataset.
+        if dataset_id != previous_dataset_id:
+            logger.info(f"Normalizing metadata for files in dataset: {dataset_id}")
+        normalized_metadata = validate_metadata_against_model(


normalize_files_metadata directly indexes file_meta["dataset_id_in_repository"] and file_meta["file_name"] before validation. A single malformed entry will raise KeyError and stop the whole normalization. Use .get(...) (or wrap per-item handling in try/except) and skip/log malformed entries so normalization is resilient.

Copilot · 2026-01-26T07:05:48Z

src/mdverse_scrapers/models/utils.py

+                f"Metadata normalization failed for file: {file_meta['file_name']}"
+            )
+            logger.info(
+                f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}"


Spelling/formatting in log message: f"In dataset: {dataset_id}from ..." is missing a space after the dataset id, which makes logs hard to read.

Suggested change

f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}"

f"In dataset: {dataset_id} from {file_meta['dataset_repository_name']}"

Copilot · 2026-01-26T07:05:48Z

src/mdverse_scrapers/scrapers/zenodo.py

+    files_normalized_metadata += zip_normalized_metadata
+    logger.info(f"Total number of files found: {len(files_normalized_metadata)}")
+    files_normalized_metadata = remove_excluded_files(
+        files_normalized_metadata, excluded_files, excluded_paths


remove_excluded_files now accepts a logger argument, but this call doesn't pass it, so exclusion statistics will go to the default loguru.logger instead of the scraper's configured logger/file. Pass logger=logger here for consistent logging.

Suggested change

files_normalized_metadata, excluded_files, excluded_paths

files_normalized_metadata, excluded_files, excluded_paths, logger=logger

Copilot · 2026-01-26T07:05:49Z

src/mdverse_scrapers/scrapers/figshare.py

+        "date_created": record_json["created_date"],
+        "date_last_updated": record_json["modified_date"],
        "title": clean_text(record_json["title"]),
        "author": clean_text(record_json["authors"][0]["full_name"]),


In extract_metadata_from_single_dataset_record, the dataset author is stored under the key author, but the DatasetMetadata model expects author_names (list of strings). With the current key, the author information will be dropped during normalization (and could fail if extra fields are forbidden). Populate author_names (e.g., a single-element list) instead of author.

Suggested change

"author": clean_text(record_json["authors"][0]["full_name"]),

"author_names": [clean_text(record_json["authors"][0]["full_name"])],

Copilot · 2026-01-26T07:05:49Z

src/mdverse_scrapers/scrapers/figshare.py

        datasets_lst.append(dataset_info)
        files_lst += files_info


extract_metadata_from_single_dataset_record returns an empty dict for embargoed records, but get_metadata_for_datasets_and_files appends dataset_info unconditionally. This will later raise KeyError in normalize_datasets_metadata when it indexes dataset['dataset_id_in_repository']. Skip appending when dataset_info is empty (and similarly avoid merging empty files_info).

Suggested change

datasets_lst.append(dataset_info)

files_lst += files_info

# Skip embargoed or otherwise invalid records that return empty metadata.

if dataset_info:

datasets_lst.append(dataset_info)

if files_info:

files_lst += files_info

Copilot · 2026-01-26T07:05:49Z

src/mdverse_scrapers/models/utils.py

+        logger.info(
+            f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}"
+        )
+        normalized_metadata = validate_metadata_against_model(
+            dataset, DatasetMetadata, logger=logger
+        )
+        if not normalized_metadata:
+            logger.error(
+                f"Metadata normalization failed for dataset "
+                f"{dataset['dataset_id_in_repository']} "
+                f"from {dataset['dataset_repository_name']}"


normalize_datasets_metadata indexes dataset['dataset_id_in_repository'] (and other keys) before validation. If any upstream scraper returns an empty/partial dict (e.g., Figshare embargoed records currently return {}), this will raise KeyError and abort normalization. Use .get(...) and skip/log malformed entries rather than crashing.

Suggested change

logger.info(

f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}"

)

normalized_metadata = validate_metadata_against_model(

dataset, DatasetMetadata, logger=logger

)

if not normalized_metadata:

logger.error(

f"Metadata normalization failed for dataset "

f"{dataset['dataset_id_in_repository']} "

f"from {dataset['dataset_repository_name']}"

if not isinstance(dataset, dict):

logger.warning(

"Skipping dataset metadata entry because it is not a dict: "

f"{dataset!r}"

)

continue

dataset_id = dataset.get("dataset_id_in_repository")

repo_name = dataset.get("dataset_repository_name")

dataset_id_for_log = (

dataset_id if dataset_id is not None else "<missing dataset_id_in_repository>"

)

repo_name_for_log = (

repo_name if repo_name is not None else "<missing dataset_repository_name>"

)

logger.info(

f"Normalizing metadata for dataset: {dataset_id_for_log}"

)

normalized_metadata = validate_metadata_against_model(

dataset, DatasetMetadata, logger=logger

)

if not normalized_metadata:

logger.error(

"Metadata normalization failed for dataset "

f"{dataset_id_for_log} from {repo_name_for_log}"

Copilot · 2026-01-26T07:05:49Z

src/mdverse_scrapers/core/toolbox.py

+def remove_duplicates_in_list_of_dicts(input_list: list[dict]) -> list[dict]:
+    """Remove duplicates in a list while preserving the original order.

    Parameters
    ----------
-    filename : str
-        Name of file to verify existence
-
-    Raises
-    ------
-    FileNotFoundError
-        If file does not exist or is not a file.
-    """
-    file_in = pathlib.Path(filename)
-    if not file_in.exists():
-        msg = f"File {filename} not found"
-        raise FileNotFoundError(msg)
-    if not file_in.is_file():
-        msg = f"{filename} is not a file"
-        raise FileNotFoundError(msg)
-
-
-def verify_output_directory(directory, logger: "loguru.Logger" = loguru.logger):
-    """Verify output directory exists.
-
-    Create it if necessary.
+    input_list : list
+        List with possible duplicate entries.

-    Parameters
-    ----------
-    directory : str
-        Path to directory to store results
-    logger : "loguru.Logger"
-        Logger for logging messages.
-
-    Raises
-    ------
-    FileNotFoundError
-        If directory path is an existing file.
+    Returns
+    -------
+    list
+        List without duplicates.
    """
-    directory_path = pathlib.Path(directory)
-    if directory_path.is_file():
-        msg = f"{directory} is an existing file."
-        raise FileNotFoundError(msg)
-    if directory_path.is_dir():
-        logger.info(f"Output directory {directory} already exists.")
-    else:
-        directory_path.mkdir(parents=True, exist_ok=True)
-        logger.info(f"Created output directory {directory}")
+    output_list = []
+    for dict_item in input_list:
+        if dict_item not in output_list:
+            output_list.append(dict_item)
+    return output_list


remove_duplicates_in_list_of_dicts is O(n^2) because it does a linear dict_item not in output_list check for every element. This can become a major bottleneck for large Zenodo result sets. Consider tracking a set of hashable keys (e.g., tuple(sorted(d.items())) or a domain-specific key like (dataset_id_in_repository, file_name)), while still preserving insertion order.

pierrepo added 5 commits January 25, 2026 19:55

BREAKING CHANGE: Major refactor for figshare and nomad scrapers

cbbf3bd

BREAKING CHANGE: Update figshare and nomad scrapers

6efa6a9

BREAKING CHANGE: Refactor zenodo scraper with Pydantic data models

7732167

fix: Remove unused ContextManager

89a559d

feat: Display files normalization status only when changing dataset

a77bbae

pierrepo requested a review from Copilot January 26, 2026 06:57

Copilot started reviewing on behalf of pierrepo January 26, 2026 06:58 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

pierrepo added 12 commits January 26, 2026 08:26

docs: Fix function signature

488613b

docs: Fix function signature

9cdab68

fix: Fix wrong field name for file size in zip files

543c9e9

docs: Fix logger message

5d28062

feat: Check field exists before accessing

fa82468

feat: Improve file exclusion based on patterns

b24720d

fix: Pass logger to search_zenodo() function

d4858c8

feat: Pass logger to remove_excluded_files() function

83b508c

feat: Impose minimum length for dataset_id_in_repository

404c70c

feat: Improve fields extraction

2a8aba2

feat: Add guardrails accesssing field values

00a395a

chore: Remove unused function

c4766a7

pierrepo merged commit 5116fe7 into main Jan 26, 2026
1 check passed

pierrepo deleted the update-figshare-scraper branch January 26, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update figshare scraper#71

Update figshare scraper#71
pierrepo merged 17 commits intomainfrom
update-figshare-scraper

pierrepo commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	logger.warning("Cannot extract hits the response JSON.")
	logger.warning("Cannot extract hits from the response JSON.")

	f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}"
	f"In dataset: {dataset_id} from {file_meta['dataset_repository_name']}"

	files_normalized_metadata, excluded_files, excluded_paths
	files_normalized_metadata, excluded_files, excluded_paths, logger=logger

	"author": clean_text(record_json["authors"][0]["full_name"]),
	"author_names": [clean_text(record_json["authors"][0]["full_name"])],

Conversation

pierrepo commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants