Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the Figshare scraper (and related scraping pipeline) to use the new ScraperContext and Pydantic-based metadata normalization/export, while also aligning Zenodo and NOMAD scrapers with the same metadata workflow and adding a debug mode flag.
Changes:
- Refactor Figshare/Zenodo/NOMAD scrapers to produce raw metadata dicts, normalize them into Pydantic models, and export to parquet via shared utilities.
- Replace the legacy
ContextManager/dataframe flow withScraperContext, list-based processing, and new helper utilities (normalization, deduplication, exclusion filtering, stats printing). - Update Figshare docs and simplify the Figshare scraper tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
src/mdverse_scrapers/scrapers/figshare.py |
Refactors Figshare scraping pipeline to ScraperContext + Pydantic normalization/export; adds debug mode and zip-content scraping. |
src/mdverse_scrapers/scrapers/zenodo.py |
Refactors Zenodo scraping pipeline to list-based metadata + Pydantic normalization/export; adds debug mode. |
src/mdverse_scrapers/scrapers/nomad.py |
Updates NOMAD scraper to use shared normalization/export utilities and adds debug mode early-stop. |
src/mdverse_scrapers/models/utils.py |
Introduces shared normalization helpers and parquet export for lists of Pydantic models. |
src/mdverse_scrapers/core/toolbox.py |
Removes legacy context/dataclass usage; adds list deduplication; refactors exclusion + false-positive filtering to operate on Pydantic models. |
src/mdverse_scrapers/models/scraper.py |
Adds is_in_debug_mode to the scraper context. |
src/mdverse_scrapers/models/file.py |
Reorders fields (no functional changes intended) while keeping required file metadata. |
src/mdverse_scrapers/models/dataset.py |
Reorders dataset statistic fields (no functional changes intended). |
tests/scrapers/test_figshare.py |
Simplifies Figshare tests after context removal. |
docs/figshare.md |
Fixes wording and improves formatting in Figshare documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def scrap_zip_files_content( | ||
| all_files_metadata, logger: "loguru.Logger" = loguru.logger | ||
| ) -> list[dict]: | ||
| """Scrap information from files contained in zip archives. | ||
|
|
||
| Uncertain how many files can be fetched from the preview. | ||
| Only get file name and file type. | ||
| File size and MD5 checksum are not available. | ||
|
|
||
| Arguments | ||
| --------- | ||
| files_df: Pandas dataframe | ||
| Dataframe with information about files. | ||
| logger: loguru.Logger | ||
| Logger object. | ||
| all_files_metadata: list[dict] | ||
| List of dictionaries with files metadata. | ||
| logger: "loguru.Logger" | ||
| Logger for logging messages. | ||
|
|
||
| Returns | ||
| ------- | ||
| zip_df: Pandas dataframe | ||
| Dataframe with information about files found in zip archive. | ||
| list[dict] | ||
| List of dictionaries with files metadata found in zip archive. | ||
| """ | ||
| files_in_zip_lst = [] | ||
| zip_files_counter = 0 | ||
| zip_files_df = files_df[files_df["file_type"] == "zip"] | ||
| logger.info(f"Number of zip files to scrap content from: {zip_files_df.shape[0]}") | ||
| for zip_files_counter, zip_idx in enumerate(zip_files_df.index, start=1): | ||
| zip_file = zip_files_df.loc[zip_idx] | ||
| file_id = zip_file["file_url"].split("/")[-1] | ||
| files_in_zip_metadata = [] | ||
| # Select zip files only. | ||
| zip_files = [f_meta for f_meta in all_files_metadata if f_meta.file_type == "zip"] | ||
| logger.info(f"Number of zip files to scrap content from: {len(zip_files):,}") |
There was a problem hiding this comment.
scrap_zip_files_content iterates all_files_metadata as FileMetadata objects (uses .file_type, .file_url_in_repository, etc.), but the signature/docstring say list[dict]. Update the type hint and docstring to list[FileMetadata] to match actual usage and help callers pass the correct type.
| file_lst.append( | ||
| { | ||
| "file_name": path, | ||
| "file_size": size, |
There was a problem hiding this comment.
extract_data_from_zip_file builds entries with the key file_size, but FileMetadata uses file_size_in_bytes. As a result, zip-extracted file sizes will be ignored during normalization. Use file_size_in_bytes here (ByteSize can parse strings like "4.6 kB" after the model's validator).
| "file_size": size, | |
| "file_size_in_bytes": size, |
| try: | ||
| _ = response_json["hits"]["hits"] | ||
| except KeyError: | ||
| logger.warning("Cannot extract hits the response JSON.") |
There was a problem hiding this comment.
Spelling/grammar: log message reads "Cannot extract hits the response JSON." It should say "Cannot extract hits from the response JSON."
| logger.warning("Cannot extract hits the response JSON.") | |
| logger.warning("Cannot extract hits from the response JSON.") |
| for file_meta in files_list: | ||
| dataset_id = file_meta["dataset_id_in_repository"] | ||
| # Print info only when changing dataset. | ||
| if dataset_id != previous_dataset_id: | ||
| logger.info(f"Normalizing metadata for files in dataset: {dataset_id}") | ||
| normalized_metadata = validate_metadata_against_model( |
There was a problem hiding this comment.
normalize_files_metadata directly indexes file_meta["dataset_id_in_repository"] and file_meta["file_name"] before validation. A single malformed entry will raise KeyError and stop the whole normalization. Use .get(...) (or wrap per-item handling in try/except) and skip/log malformed entries so normalization is resilient.
src/mdverse_scrapers/models/utils.py
Outdated
| f"Metadata normalization failed for file: {file_meta['file_name']}" | ||
| ) | ||
| logger.info( | ||
| f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}" |
There was a problem hiding this comment.
Spelling/formatting in log message: f"In dataset: {dataset_id}from ..." is missing a space after the dataset id, which makes logs hard to read.
| f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}" | |
| f"In dataset: {dataset_id} from {file_meta['dataset_repository_name']}" |
| files_normalized_metadata += zip_normalized_metadata | ||
| logger.info(f"Total number of files found: {len(files_normalized_metadata)}") | ||
| files_normalized_metadata = remove_excluded_files( | ||
| files_normalized_metadata, excluded_files, excluded_paths |
There was a problem hiding this comment.
remove_excluded_files now accepts a logger argument, but this call doesn't pass it, so exclusion statistics will go to the default loguru.logger instead of the scraper's configured logger/file. Pass logger=logger here for consistent logging.
| files_normalized_metadata, excluded_files, excluded_paths | |
| files_normalized_metadata, excluded_files, excluded_paths, logger=logger |
| "date_created": record_json["created_date"], | ||
| "date_last_updated": record_json["modified_date"], | ||
| "title": clean_text(record_json["title"]), | ||
| "author": clean_text(record_json["authors"][0]["full_name"]), |
There was a problem hiding this comment.
In extract_metadata_from_single_dataset_record, the dataset author is stored under the key author, but the DatasetMetadata model expects author_names (list of strings). With the current key, the author information will be dropped during normalization (and could fail if extra fields are forbidden). Populate author_names (e.g., a single-element list) instead of author.
| "author": clean_text(record_json["authors"][0]["full_name"]), | |
| "author_names": [clean_text(record_json["authors"][0]["full_name"])], |
| datasets_lst.append(dataset_info) | ||
| files_lst += files_info |
There was a problem hiding this comment.
extract_metadata_from_single_dataset_record returns an empty dict for embargoed records, but get_metadata_for_datasets_and_files appends dataset_info unconditionally. This will later raise KeyError in normalize_datasets_metadata when it indexes dataset['dataset_id_in_repository']. Skip appending when dataset_info is empty (and similarly avoid merging empty files_info).
| datasets_lst.append(dataset_info) | |
| files_lst += files_info | |
| # Skip embargoed or otherwise invalid records that return empty metadata. | |
| if dataset_info: | |
| datasets_lst.append(dataset_info) | |
| if files_info: | |
| files_lst += files_info |
src/mdverse_scrapers/models/utils.py
Outdated
| logger.info( | ||
| f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}" | ||
| ) | ||
| normalized_metadata = validate_metadata_against_model( | ||
| dataset, DatasetMetadata, logger=logger | ||
| ) | ||
| if not normalized_metadata: | ||
| logger.error( | ||
| f"Metadata normalization failed for dataset " | ||
| f"{dataset['dataset_id_in_repository']} " | ||
| f"from {dataset['dataset_repository_name']}" |
There was a problem hiding this comment.
normalize_datasets_metadata indexes dataset['dataset_id_in_repository'] (and other keys) before validation. If any upstream scraper returns an empty/partial dict (e.g., Figshare embargoed records currently return {}), this will raise KeyError and abort normalization. Use .get(...) and skip/log malformed entries rather than crashing.
| logger.info( | |
| f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}" | |
| ) | |
| normalized_metadata = validate_metadata_against_model( | |
| dataset, DatasetMetadata, logger=logger | |
| ) | |
| if not normalized_metadata: | |
| logger.error( | |
| f"Metadata normalization failed for dataset " | |
| f"{dataset['dataset_id_in_repository']} " | |
| f"from {dataset['dataset_repository_name']}" | |
| if not isinstance(dataset, dict): | |
| logger.warning( | |
| "Skipping dataset metadata entry because it is not a dict: " | |
| f"{dataset!r}" | |
| ) | |
| continue | |
| dataset_id = dataset.get("dataset_id_in_repository") | |
| repo_name = dataset.get("dataset_repository_name") | |
| dataset_id_for_log = ( | |
| dataset_id if dataset_id is not None else "<missing dataset_id_in_repository>" | |
| ) | |
| repo_name_for_log = ( | |
| repo_name if repo_name is not None else "<missing dataset_repository_name>" | |
| ) | |
| logger.info( | |
| f"Normalizing metadata for dataset: {dataset_id_for_log}" | |
| ) | |
| normalized_metadata = validate_metadata_against_model( | |
| dataset, DatasetMetadata, logger=logger | |
| ) | |
| if not normalized_metadata: | |
| logger.error( | |
| "Metadata normalization failed for dataset " | |
| f"{dataset_id_for_log} from {repo_name_for_log}" |
| def remove_duplicates_in_list_of_dicts(input_list: list[dict]) -> list[dict]: | ||
| """Remove duplicates in a list while preserving the original order. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| filename : str | ||
| Name of file to verify existence | ||
|
|
||
| Raises | ||
| ------ | ||
| FileNotFoundError | ||
| If file does not exist or is not a file. | ||
| """ | ||
| file_in = pathlib.Path(filename) | ||
| if not file_in.exists(): | ||
| msg = f"File {filename} not found" | ||
| raise FileNotFoundError(msg) | ||
| if not file_in.is_file(): | ||
| msg = f"{filename} is not a file" | ||
| raise FileNotFoundError(msg) | ||
|
|
||
|
|
||
| def verify_output_directory(directory, logger: "loguru.Logger" = loguru.logger): | ||
| """Verify output directory exists. | ||
|
|
||
| Create it if necessary. | ||
| input_list : list | ||
| List with possible duplicate entries. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| directory : str | ||
| Path to directory to store results | ||
| logger : "loguru.Logger" | ||
| Logger for logging messages. | ||
|
|
||
| Raises | ||
| ------ | ||
| FileNotFoundError | ||
| If directory path is an existing file. | ||
| Returns | ||
| ------- | ||
| list | ||
| List without duplicates. | ||
| """ | ||
| directory_path = pathlib.Path(directory) | ||
| if directory_path.is_file(): | ||
| msg = f"{directory} is an existing file." | ||
| raise FileNotFoundError(msg) | ||
| if directory_path.is_dir(): | ||
| logger.info(f"Output directory {directory} already exists.") | ||
| else: | ||
| directory_path.mkdir(parents=True, exist_ok=True) | ||
| logger.info(f"Created output directory {directory}") | ||
| output_list = [] | ||
| for dict_item in input_list: | ||
| if dict_item not in output_list: | ||
| output_list.append(dict_item) | ||
| return output_list |
There was a problem hiding this comment.
remove_duplicates_in_list_of_dicts is O(n^2) because it does a linear dict_item not in output_list check for every element. This can become a major bottleneck for large Zenodo result sets. Consider tracking a set of hashable keys (e.g., tuple(sorted(d.items())) or a domain-specific key like (dataset_id_in_repository, file_name)), while still preserving insertion order.
No description provided.