Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions src/mdverse_scrapers/core/toolbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -511,6 +511,26 @@ def format_date(date: datetime | str) -> str:
raise TypeError(msg)


def convert_file_size_in_bytes_to_human_readable_format(size_in_bytes: float) -> str:
"""Convert file size in bytes to a human-readable format.

Parameters
----------
size_in_bytes : float
File size in bytes.

Returns
-------
str
File size in a human-readable format (e.g., '10.5 MB').
"""
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1024.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1024.0
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function uses binary units (1024-based) for file size conversion (1 KB = 1024 B), while the FileMetadata.file_size_with_human_readable_unit property uses decimal units (1000-based) via ByteSize.human_readable(decimal=True). This inconsistency could confuse users seeing different unit conventions in the same application. Consider using the same unit convention throughout, or clearly documenting why different conventions are used. The industry standard for file sizes in operating systems is typically binary (1024-based), while storage devices typically use decimal (1000-based).

Suggested change
if size_in_bytes < 1024.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1024.0
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not handle negative input values. While file sizes are typically non-negative, passing a negative value would cause the loop to never exit the first iteration and return a negative formatted value like "-100.00 B". Consider adding validation to handle this edge case, either by raising a ValueError or returning a special message for invalid inputs.

Copilot uses AI. Check for mistakes.
return "File too big!"


def print_statistics(
scraper: ScraperContext, logger: "loguru.Logger" = loguru.logger
) -> None:
Expand All @@ -527,9 +547,15 @@ def print_statistics(
logger.success(
f"Number of datasets scraped: {scraper.number_of_datasets_scraped:,}"
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path}")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'file_size' is reused for both datasets and files parquet file sizes. While this works functionally, it reduces code clarity. Consider using more specific names like 'datasets_file_size' and 'files_file_size', or extracting this logic into a helper method to avoid duplication of the stat().st_size call and conversion pattern.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Suggested change
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
if scraper.datasets_parquet_file_path.is_file():
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(
f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})"
)
else:
logger.warning(
f"Datasets parquet file not found at: {scraper.datasets_parquet_file_path}"
)
logger.info(
f"Saved in: {scraper.datasets_parquet_file_path} (size unknown)"
)
if scraper.files_parquet_file_path.is_file():
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
else:
logger.warning(
f"Files parquet file not found at: {scraper.files_parquet_file_path}"
)
file_size = "size unknown"

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Copilot uses AI. Check for mistakes.
logger.success(f"Number of files scraped: {scraper.number_of_files_scraped:,}")
logger.info(f"Saved in: {scraper.files_parquet_file_path}")
logger.info(f"Saved in: {scraper.files_parquet_file_path} ({file_size})")
elapsed_time = int((datetime.now() - scraper.start_time).total_seconds())
logger.success(
f"Scraped {scraper.data_source_name} in: {timedelta(seconds=elapsed_time)} 🎉"
Expand Down
82 changes: 82 additions & 0 deletions tests/core/test_toolbox.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""Tests for the toolbox module."""

from mdverse_scrapers.core.toolbox import (
convert_file_size_in_bytes_to_human_readable_format,
)


class TestConvertFileSizeInBytesToHumanReadableFormat:
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test class name TestConvertFileSizeInBytesToHumanReadableFormat doesn't match the actual function name convert_file_size_to_human_readable. The class name appears to reference a non-existent function name. For consistency and clarity, the class name should match the function being tested.

Suggested change
class TestConvertFileSizeInBytesToHumanReadableFormat:
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
class TestConvertFileSizeToHumanReadable:
"""Tests for convert_file_size_to_human_readable function."""

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references convert_file_size_in_bytes_to_human_readable_format function, but the actual function name is convert_file_size_to_human_readable. The docstring should accurately reflect the function being tested.

Suggested change
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
"""Tests for convert_file_size_to_human_readable function."""

Copilot uses AI. Check for mistakes.

def test_bytes(self):
"""Test conversion for values in bytes range."""
assert convert_file_size_in_bytes_to_human_readable_format(0) == "0.00 B"
assert convert_file_size_in_bytes_to_human_readable_format(1) == "1.00 B"
assert convert_file_size_in_bytes_to_human_readable_format(512) == "512.00 B"
assert convert_file_size_in_bytes_to_human_readable_format(1023) == "1023.00 B"

def test_kilobytes(self):
"""Test conversion for values in kilobytes range."""
assert convert_file_size_in_bytes_to_human_readable_format(1024) == "1.00 KB"
assert convert_file_size_in_bytes_to_human_readable_format(1536) == "1.50 KB"
assert convert_file_size_in_bytes_to_human_readable_format(10240) == "10.00 KB"
assert (
convert_file_size_in_bytes_to_human_readable_format(127560) == "124.57 KB"
)

def test_megabytes(self):
"""Test conversion for values in megabytes range."""
assert convert_file_size_in_bytes_to_human_readable_format(1048576) == "1.00 MB"
assert convert_file_size_in_bytes_to_human_readable_format(1289748) == "1.23 MB"
assert (
convert_file_size_in_bytes_to_human_readable_format(10485760) == "10.00 MB"
)
assert (
convert_file_size_in_bytes_to_human_readable_format(104857600)
== "100.00 MB"
)

def test_gigabytes(self):
"""Test conversion for values in gigabytes range."""
assert (
convert_file_size_in_bytes_to_human_readable_format(1073741824) == "1.00 GB"
)
assert (
convert_file_size_in_bytes_to_human_readable_format(2147483648) == "2.00 GB"
)
assert (
convert_file_size_in_bytes_to_human_readable_format(132553428173)
== "123.45 GB"
)

def test_terabytes(self):
"""Test conversion for values in terabytes range."""
assert (
convert_file_size_in_bytes_to_human_readable_format(1099511627776)
== "1.00 TB"
) # 1 TB
assert (
convert_file_size_in_bytes_to_human_readable_format(5497558138880)
== "5.00 TB"
) # 5 TB

def test_very_large_file(self):
"""Test conversion for files larger than terabytes."""
# 1 PB (petabyte)
assert (
convert_file_size_in_bytes_to_human_readable_format(1125899906842624)
== "File too big!"
)

def test_edge_cases(self):
"""Test edge cases between unit boundaries."""
# Just at the boundary
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**2) == "1.00 MB"
) # Exactly 1 MB
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**3) == "1.00 GB"
) # Exactly 1 GB
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**4) == "1.00 TB"
) # Exactly 1 TB
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite is missing test cases for invalid inputs, particularly negative values. Since the function accepts a float parameter, it should handle negative values appropriately (e.g., by raising a ValueError or returning an error message). Add a test case to verify the behavior when a negative value is passed.

Copilot uses AI. Check for mistakes.
Loading