Skip to content

feat: Add parquet file size in final statistics#75

Merged
pierrepo merged 7 commits intomainfrom
print-file-size-in-statistics
Jan 27, 2026
Merged

feat: Add parquet file size in final statistics#75
pierrepo merged 7 commits intomainfrom
print-file-size-in-statistics

Conversation

@pierrepo
Copy link
Member

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds file size information to the final scraping statistics by implementing a utility function to convert file sizes from bytes to human-readable format and integrating it into the statistics output.

Changes:

  • Added convert_file_size_in_bytes_to_human_readable_format() function to convert file sizes using binary units (1024-based)
  • Modified print_statistics() to display parquet file sizes alongside file paths
  • Added comprehensive test suite for the new conversion function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/mdverse_scrapers/core/toolbox.py Added file size conversion utility function and integrated it into the statistics printing logic
tests/core/test_toolbox.py New test file with comprehensive test coverage for various file size ranges (bytes to terabytes) and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +550 to +556
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'file_size' is reused for both datasets and files parquet file sizes. While this works functionally, it reduces code clarity. Consider using more specific names like 'datasets_file_size' and 'files_file_size', or extracting this logic into a helper method to avoid duplication of the stat().st_size call and conversion pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +82
def test_edge_cases(self):
"""Test edge cases between unit boundaries."""
# Just at the boundary
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**2) == "1.00 MB"
) # Exactly 1 MB
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**3) == "1.00 GB"
) # Exactly 1 GB
assert (
convert_file_size_in_bytes_to_human_readable_format(1024**4) == "1.00 TB"
) # Exactly 1 TB
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite is missing test cases for invalid inputs, particularly negative values. Since the function accepts a float parameter, it should handle negative values appropriately (e.g., by raising a ValueError or returning an error message). Add a test case to verify the behavior when a negative value is passed.

Copilot uses AI. Check for mistakes.
Comment on lines +550 to +556
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Suggested change
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
if scraper.datasets_parquet_file_path.is_file():
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.datasets_parquet_file_path.stat().st_size
)
logger.info(
f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})"
)
else:
logger.warning(
f"Datasets parquet file not found at: {scraper.datasets_parquet_file_path}"
)
logger.info(
f"Saved in: {scraper.datasets_parquet_file_path} (size unknown)"
)
if scraper.files_parquet_file_path.is_file():
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
else:
logger.warning(
f"Files parquet file not found at: {scraper.files_parquet_file_path}"
)
file_size = "size unknown"

Copilot uses AI. Check for mistakes.
Comment on lines +554 to +556
file_size = convert_file_size_in_bytes_to_human_readable_format(
scraper.files_parquet_file_path.stat().st_size
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +530
if size_in_bytes < 1024.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1024.0
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function uses binary units (1024-based) for file size conversion (1 KB = 1024 B), while the FileMetadata.file_size_with_human_readable_unit property uses decimal units (1000-based) via ByteSize.human_readable(decimal=True). This inconsistency could confuse users seeing different unit conventions in the same application. Consider using the same unit convention throughout, or clearly documenting why different conventions are used. The industry standard for file sizes in operating systems is typically binary (1024-based), while storage devices typically use decimal (1000-based).

Suggested change
if size_in_bytes < 1024.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1024.0
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0

Copilot uses AI. Check for mistakes.
Comment on lines +527 to +530
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1024.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1024.0
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not handle negative input values. While file sizes are typically non-negative, passing a negative value would cause the loop to never exit the first iteration and return a negative formatted value like "-100.00 B". Consider adding validation to handle this edge case, either by raising a ValueError or returning a special message for invalid inputs.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +9
class TestConvertFileSizeInBytesToHumanReadableFormat:
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test class name TestConvertFileSizeInBytesToHumanReadableFormat doesn't match the actual function name convert_file_size_to_human_readable. The class name appears to reference a non-existent function name. For consistency and clarity, the class name should match the function being tested.

Suggested change
class TestConvertFileSizeInBytesToHumanReadableFormat:
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
class TestConvertFileSizeToHumanReadable:
"""Tests for convert_file_size_to_human_readable function."""

Copilot uses AI. Check for mistakes.
Comment on lines +565 to +568
if not scraper.files_parquet_file_path.is_file():
logger.error("Files parquet file not found!")
logger.error(f"{scraper.files_parquet_file_path} is missing.")
return
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning early when the files parquet file is missing prevents the remaining statistics from being printed, including elapsed time, log file path, and debug mode warning. This means if the files file is missing, the user won't see the elapsed time for the scraping operation or where the log file is saved. Consider logging the error but continuing to print the remaining statistics, or at least ensure the elapsed time and log file location are always printed before any early returns.

Copilot uses AI. Check for mistakes.


class TestConvertFileSizeInBytesToHumanReadableFormat:
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references convert_file_size_in_bytes_to_human_readable_format function, but the actual function name is convert_file_size_to_human_readable. The docstring should accurately reflect the function being tested.

Suggested change
"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
"""Tests for convert_file_size_to_human_readable function."""

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +22
def test_kilobytes(self):
"""Test conversion for values in kilobytes range."""
assert convert_file_size_to_human_readable(1_024) == "1.02 KB"
assert convert_file_size_to_human_readable(1_234) == "1.23 KB"
assert convert_file_size_to_human_readable(123_456) == "123.46 KB"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test values use binary-based byte counts (e.g., 1_024, 1_048_576) but the implementation uses decimal (1000-based) conversion. While the expected results are correctly calculated, this creates confusion. For example, 1_024 bytes becomes "1.02 KB" rather than "1.00 KB" or "1.00 KiB". Consider either:

  1. Using decimal-based test values (1000, 1_000_000) for consistency with the decimal implementation
  2. Or renaming the tests to clarify they're testing decimal conversion of binary-sized inputs

The current approach works but may confuse readers about whether the function uses decimal (SI) or binary (IEC) units.

Copilot uses AI. Check for mistakes.
Comment on lines +529 to +532
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function mutates the input parameter size_in_bytes during iteration by dividing it. While this works correctly (Python passes numbers by value), it makes the code less clear. Consider using a separate variable for the conversion loop to preserve the semantic meaning of the parameter name.

Suggested change
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
size = size_in_bytes
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size < 1000.0:
return f"{size:.2f} {unit}"
size /= 1000.0

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +533
return "Negative size!"
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
return "File too big!"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages "Negative size!" and "File too big!" are informal and not very descriptive. Consider using more informative messages such as "Invalid file size: negative value" and "File size exceeds maximum displayable range (>= 1000 TB)" to better communicate the issue to users.

Suggested change
return "Negative size!"
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
return "File too big!"
return "Invalid file size: negative value"
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
return "File size exceeds maximum displayable range (>= 1000 TB)"

Copilot uses AI. Check for mistakes.
Returns
-------
str
File size in a human-readable format (e.g., '10.52 MB').
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should clarify that this function uses decimal (SI) units (1 KB = 1000 bytes) rather than binary (IEC) units (1 KiB = 1024 bytes). This is important because file sizes are often displayed using binary units in operating systems and tools. Consider adding a note in the docstring explaining the unit system, such as: "Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc."

Suggested change
File size in a human-readable format (e.g., '10.52 MB').
File size in a human-readable format (e.g., '10.52 MB').
Notes
-----
Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc.,
rather than binary (IEC) units (1 KiB = 1024 bytes).

Copilot uses AI. Check for mistakes.
Comment on lines +514 to +533
def convert_file_size_to_human_readable(size_in_bytes: float) -> str:
"""Convert file size in bytes to a human-readable format.

Parameters
----------
size_in_bytes : float
File size in bytes.

Returns
-------
str
File size in a human-readable format (e.g., '10.52 MB').
"""
if size_in_bytes < 0:
return "Negative size!"
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size_in_bytes < 1000.0:
return f"{size_in_bytes:.2f} {unit}"
size_in_bytes /= 1000.0
return "File too big!"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using pydantic's ByteSize for consistency with the existing file_size_with_human_readable_unit property in FileMetadata (models/file.py:81-94). You could convert the raw byte count to a ByteSize object and use its human_readable method. For example: ByteSize(size_in_bytes).human_readable(decimal=True, separator=" "). This would ensure consistent file size formatting across the codebase and leverage the well-tested pydantic implementation.

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo merged commit cbf69fb into main Jan 27, 2026
1 check passed
@pierrepo pierrepo deleted the print-file-size-in-statistics branch January 27, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants