feat: Add parquet file size in final statistics by pierrepo · Pull Request #75 · MDverse/mdws

pierrepo · 2026-01-27T15:44:20Z

No description provided.

…t() function

Copilot

Pull request overview

This PR adds file size information to the final scraping statistics by implementing a utility function to convert file sizes from bytes to human-readable format and integrating it into the statistics output.

Changes:

Added convert_file_size_in_bytes_to_human_readable_format() function to convert file sizes using binary units (1024-based)
Modified print_statistics() to display parquet file sizes alongside file paths
Added comprehensive test suite for the new conversion function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
src/mdverse_scrapers/core/toolbox.py	Added file size conversion utility function and integrated it into the statistics printing logic
tests/core/test_toolbox.py	New test file with comprehensive test coverage for various file size ranges (bytes to terabytes) and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T16:42:15Z

src/mdverse_scrapers/core/toolbox.py

+    file_size = convert_file_size_in_bytes_to_human_readable_format(
+        scraper.datasets_parquet_file_path.stat().st_size
+    )
+    logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
+    file_size = convert_file_size_in_bytes_to_human_readable_format(
+        scraper.files_parquet_file_path.stat().st_size
+    )


The variable name 'file_size' is reused for both datasets and files parquet file sizes. While this works functionally, it reduces code clarity. Consider using more specific names like 'datasets_file_size' and 'files_file_size', or extracting this logic into a helper method to avoid duplication of the stat().st_size call and conversion pattern.

Copilot · 2026-01-27T16:42:17Z

tests/core/test_toolbox.py

+    def test_edge_cases(self):
+        """Test edge cases between unit boundaries."""
+        # Just at the boundary
+        assert (
+            convert_file_size_in_bytes_to_human_readable_format(1024**2) == "1.00 MB"
+        )  # Exactly 1 MB
+        assert (
+            convert_file_size_in_bytes_to_human_readable_format(1024**3) == "1.00 GB"
+        )  # Exactly 1 GB
+        assert (
+            convert_file_size_in_bytes_to_human_readable_format(1024**4) == "1.00 TB"
+        )  # Exactly 1 TB


The test suite is missing test cases for invalid inputs, particularly negative values. Since the function accepts a float parameter, it should handle negative values appropriately (e.g., by raising a ValueError or returning an error message). Add a test case to verify the behavior when a negative value is passed.

Copilot · 2026-01-27T16:42:17Z

src/mdverse_scrapers/core/toolbox.py

+    file_size = convert_file_size_in_bytes_to_human_readable_format(
+        scraper.datasets_parquet_file_path.stat().st_size
+    )
+    logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")
+    file_size = convert_file_size_in_bytes_to_human_readable_format(
+        scraper.files_parquet_file_path.stat().st_size
+    )


The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Suggested change

file_size = convert_file_size_in_bytes_to_human_readable_format(

scraper.datasets_parquet_file_path.stat().st_size

)

logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})")

file_size = convert_file_size_in_bytes_to_human_readable_format(

scraper.files_parquet_file_path.stat().st_size

)

if scraper.datasets_parquet_file_path.is_file():

file_size = convert_file_size_in_bytes_to_human_readable_format(

scraper.datasets_parquet_file_path.stat().st_size

)

logger.info(

f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})"

)

else:

logger.warning(

f"Datasets parquet file not found at: {scraper.datasets_parquet_file_path}"

)

logger.info(

f"Saved in: {scraper.datasets_parquet_file_path} (size unknown)"

)

if scraper.files_parquet_file_path.is_file():

file_size = convert_file_size_in_bytes_to_human_readable_format(

scraper.files_parquet_file_path.stat().st_size

)

else:

logger.warning(

f"Files parquet file not found at: {scraper.files_parquet_file_path}"

)

file_size = "size unknown"

Copilot · 2026-01-27T16:42:21Z

src/mdverse_scrapers/core/toolbox.py

+    file_size = convert_file_size_in_bytes_to_human_readable_format(
+        scraper.files_parquet_file_path.stat().st_size
+    )


The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.

Copilot · 2026-01-27T16:42:21Z

src/mdverse_scrapers/core/toolbox.py

+        if size_in_bytes < 1024.0:
+            return f"{size_in_bytes:.2f} {unit}"
+        size_in_bytes /= 1024.0


This function uses binary units (1024-based) for file size conversion (1 KB = 1024 B), while the FileMetadata.file_size_with_human_readable_unit property uses decimal units (1000-based) via ByteSize.human_readable(decimal=True). This inconsistency could confuse users seeing different unit conventions in the same application. Consider using the same unit convention throughout, or clearly documenting why different conventions are used. The industry standard for file sizes in operating systems is typically binary (1024-based), while storage devices typically use decimal (1000-based).

Suggested change

if size_in_bytes < 1024.0:

return f"{size_in_bytes:.2f} {unit}"

size_in_bytes /= 1024.0

if size_in_bytes < 1000.0:

return f"{size_in_bytes:.2f} {unit}"

size_in_bytes /= 1000.0

Copilot · 2026-01-27T16:42:21Z

src/mdverse_scrapers/core/toolbox.py

+    for unit in ["B", "KB", "MB", "GB", "TB"]:
+        if size_in_bytes < 1024.0:
+            return f"{size_in_bytes:.2f} {unit}"
+        size_in_bytes /= 1024.0


The function does not handle negative input values. While file sizes are typically non-negative, passing a negative value would cause the loop to never exit the first iteration and return a negative formatted value like "-100.00 B". Consider adding validation to handle this edge case, either by raising a ValueError or returning a special message for invalid inputs.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T17:10:30Z

tests/core/test_toolbox.py

+class TestConvertFileSizeInBytesToHumanReadableFormat:
+    """Tests for convert_file_size_in_bytes_to_human_readable_format function."""


The test class name TestConvertFileSizeInBytesToHumanReadableFormat doesn't match the actual function name convert_file_size_to_human_readable. The class name appears to reference a non-existent function name. For consistency and clarity, the class name should match the function being tested.

Suggested change

class TestConvertFileSizeInBytesToHumanReadableFormat:

"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""

class TestConvertFileSizeToHumanReadable:

"""Tests for convert_file_size_to_human_readable function."""

Copilot · 2026-01-27T17:10:30Z

src/mdverse_scrapers/core/toolbox.py

+    if not scraper.files_parquet_file_path.is_file():
+        logger.error("Files parquet file not found!")
+        logger.error(f"{scraper.files_parquet_file_path} is missing.")
+        return


Returning early when the files parquet file is missing prevents the remaining statistics from being printed, including elapsed time, log file path, and debug mode warning. This means if the files file is missing, the user won't see the elapsed time for the scraping operation or where the log file is saved. Consider logging the error but continuing to print the remaining statistics, or at least ensure the elapsed time and log file location are always printed before any early returns.

Copilot · 2026-01-27T17:10:30Z

tests/core/test_toolbox.py

+
+
+class TestConvertFileSizeInBytesToHumanReadableFormat:
+    """Tests for convert_file_size_in_bytes_to_human_readable_format function."""


The docstring references convert_file_size_in_bytes_to_human_readable_format function, but the actual function name is convert_file_size_to_human_readable. The docstring should accurately reflect the function being tested.

Suggested change

"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""

"""Tests for convert_file_size_to_human_readable function."""

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T17:26:49Z

tests/core/test_toolbox.py

+    def test_kilobytes(self):
+        """Test conversion for values in kilobytes range."""
+        assert convert_file_size_to_human_readable(1_024) == "1.02 KB"
+        assert convert_file_size_to_human_readable(1_234) == "1.23 KB"
+        assert convert_file_size_to_human_readable(123_456) == "123.46 KB"


The test values use binary-based byte counts (e.g., 1_024, 1_048_576) but the implementation uses decimal (1000-based) conversion. While the expected results are correctly calculated, this creates confusion. For example, 1_024 bytes becomes "1.02 KB" rather than "1.00 KB" or "1.00 KiB". Consider either:

Using decimal-based test values (1000, 1_000_000) for consistency with the decimal implementation

Or renaming the tests to clarify they're testing decimal conversion of binary-sized inputs

The current approach works but may confuse readers about whether the function uses decimal (SI) or binary (IEC) units.

Copilot · 2026-01-27T17:26:51Z

src/mdverse_scrapers/core/toolbox.py

+    for unit in ["B", "KB", "MB", "GB", "TB"]:
+        if size_in_bytes < 1000.0:
+            return f"{size_in_bytes:.2f} {unit}"
+        size_in_bytes /= 1000.0


The function mutates the input parameter size_in_bytes during iteration by dividing it. While this works correctly (Python passes numbers by value), it makes the code less clear. Consider using a separate variable for the conversion loop to preserve the semantic meaning of the parameter name.

Suggested change

for unit in ["B", "KB", "MB", "GB", "TB"]:

if size_in_bytes < 1000.0:

return f"{size_in_bytes:.2f} {unit}"

size_in_bytes /= 1000.0

size = size_in_bytes

for unit in ["B", "KB", "MB", "GB", "TB"]:

if size < 1000.0:

return f"{size:.2f} {unit}"

size /= 1000.0

Copilot · 2026-01-27T17:26:51Z

src/mdverse_scrapers/core/toolbox.py

+        return "Negative size!"
+    for unit in ["B", "KB", "MB", "GB", "TB"]:
+        if size_in_bytes < 1000.0:
+            return f"{size_in_bytes:.2f} {unit}"
+        size_in_bytes /= 1000.0
+    return "File too big!"


The error messages "Negative size!" and "File too big!" are informal and not very descriptive. Consider using more informative messages such as "Invalid file size: negative value" and "File size exceeds maximum displayable range (>= 1000 TB)" to better communicate the issue to users.

Suggested change

return "Negative size!"

for unit in ["B", "KB", "MB", "GB", "TB"]:

if size_in_bytes < 1000.0:

return f"{size_in_bytes:.2f} {unit}"

size_in_bytes /= 1000.0

return "File too big!"

return "Invalid file size: negative value"

for unit in ["B", "KB", "MB", "GB", "TB"]:

if size_in_bytes < 1000.0:

return f"{size_in_bytes:.2f} {unit}"

size_in_bytes /= 1000.0

return "File size exceeds maximum displayable range (>= 1000 TB)"

Copilot · 2026-01-27T17:26:52Z

src/mdverse_scrapers/core/toolbox.py

+    Returns
+    -------
+    str
+        File size in a human-readable format (e.g., '10.52 MB').


The docstring should clarify that this function uses decimal (SI) units (1 KB = 1000 bytes) rather than binary (IEC) units (1 KiB = 1024 bytes). This is important because file sizes are often displayed using binary units in operating systems and tools. Consider adding a note in the docstring explaining the unit system, such as: "Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc."

Suggested change

File size in a human-readable format (e.g., '10.52 MB').

File size in a human-readable format (e.g., '10.52 MB').

Notes

-----

Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc.,

rather than binary (IEC) units (1 KiB = 1024 bytes).

Copilot · 2026-01-27T17:26:52Z

src/mdverse_scrapers/core/toolbox.py

+def convert_file_size_to_human_readable(size_in_bytes: float) -> str:
+    """Convert file size in bytes to a human-readable format.
+
+    Parameters
+    ----------
+    size_in_bytes : float
+        File size in bytes.
+
+    Returns
+    -------
+    str
+        File size in a human-readable format (e.g., '10.52 MB').
+    """
+    if size_in_bytes < 0:
+        return "Negative size!"
+    for unit in ["B", "KB", "MB", "GB", "TB"]:
+        if size_in_bytes < 1000.0:
+            return f"{size_in_bytes:.2f} {unit}"
+        size_in_bytes /= 1000.0
+    return "File too big!"


Consider using pydantic's ByteSize for consistency with the existing file_size_with_human_readable_unit property in FileMetadata (models/file.py:81-94). You could convert the raw byte count to a ByteSize object and use its human_readable method. For example: ByteSize(size_in_bytes).human_readable(decimal=True, separator=" "). This would ensure consistent file size formatting across the codebase and leverage the well-tested pydantic implementation.

pierrepo added 2 commits January 27, 2026 16:44

feat: Add parquet file size in final statistics

5718ecb

test: Add test for convert_file_size_in_bytes_to_human_readable_forma…

496fe85

…t() function

pierrepo requested a review from Copilot January 27, 2026 16:38

Copilot started reviewing on behalf of pierrepo January 27, 2026 16:38 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

fix: Fix edge cases and conversion ratio

4fa016a

pierrepo requested a review from Copilot January 27, 2026 17:06

Copilot started reviewing on behalf of pierrepo January 27, 2026 17:07 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

pierrepo added 2 commits January 27, 2026 18:19

docs: Update class name and docstring

cf01a5a

fix: Improve the logic if file is missing

b558e8c

pierrepo requested a review from Copilot January 27, 2026 17:20

Copilot started reviewing on behalf of pierrepo January 27, 2026 17:21 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

pierrepo added 2 commits January 27, 2026 18:43

refactor: Rename variable for clarity

6d91a92

docs: Add example in doctring

82e5720

pierrepo merged commit cbf69fb into main Jan 27, 2026
1 check passed

pierrepo deleted the print-file-size-in-statistics branch January 27, 2026 17:46

		class TestConvertFileSizeInBytesToHumanReadableFormat:
		"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""

	"""Tests for convert_file_size_in_bytes_to_human_readable_format function."""
	"""Tests for convert_file_size_to_human_readable function."""

Conversation

pierrepo commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants