feat: Add parquet file size in final statistics#75
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds file size information to the final scraping statistics by implementing a utility function to convert file sizes from bytes to human-readable format and integrating it into the statistics output.
Changes:
- Added
convert_file_size_in_bytes_to_human_readable_format()function to convert file sizes using binary units (1024-based) - Modified
print_statistics()to display parquet file sizes alongside file paths - Added comprehensive test suite for the new conversion function
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/mdverse_scrapers/core/toolbox.py | Added file size conversion utility function and integrated it into the statistics printing logic |
| tests/core/test_toolbox.py | New test file with comprehensive test coverage for various file size ranges (bytes to terabytes) and edge cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/mdverse_scrapers/core/toolbox.py
Outdated
| file_size = convert_file_size_in_bytes_to_human_readable_format( | ||
| scraper.datasets_parquet_file_path.stat().st_size | ||
| ) | ||
| logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})") | ||
| file_size = convert_file_size_in_bytes_to_human_readable_format( | ||
| scraper.files_parquet_file_path.stat().st_size | ||
| ) |
There was a problem hiding this comment.
The variable name 'file_size' is reused for both datasets and files parquet file sizes. While this works functionally, it reduces code clarity. Consider using more specific names like 'datasets_file_size' and 'files_file_size', or extracting this logic into a helper method to avoid duplication of the stat().st_size call and conversion pattern.
tests/core/test_toolbox.py
Outdated
| def test_edge_cases(self): | ||
| """Test edge cases between unit boundaries.""" | ||
| # Just at the boundary | ||
| assert ( | ||
| convert_file_size_in_bytes_to_human_readable_format(1024**2) == "1.00 MB" | ||
| ) # Exactly 1 MB | ||
| assert ( | ||
| convert_file_size_in_bytes_to_human_readable_format(1024**3) == "1.00 GB" | ||
| ) # Exactly 1 GB | ||
| assert ( | ||
| convert_file_size_in_bytes_to_human_readable_format(1024**4) == "1.00 TB" | ||
| ) # Exactly 1 TB |
There was a problem hiding this comment.
The test suite is missing test cases for invalid inputs, particularly negative values. Since the function accepts a float parameter, it should handle negative values appropriately (e.g., by raising a ValueError or returning an error message). Add a test case to verify the behavior when a negative value is passed.
src/mdverse_scrapers/core/toolbox.py
Outdated
| file_size = convert_file_size_in_bytes_to_human_readable_format( | ||
| scraper.datasets_parquet_file_path.stat().st_size | ||
| ) | ||
| logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})") | ||
| file_size = convert_file_size_in_bytes_to_human_readable_format( | ||
| scraper.files_parquet_file_path.stat().st_size | ||
| ) |
There was a problem hiding this comment.
The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.
| file_size = convert_file_size_in_bytes_to_human_readable_format( | |
| scraper.datasets_parquet_file_path.stat().st_size | |
| ) | |
| logger.info(f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})") | |
| file_size = convert_file_size_in_bytes_to_human_readable_format( | |
| scraper.files_parquet_file_path.stat().st_size | |
| ) | |
| if scraper.datasets_parquet_file_path.is_file(): | |
| file_size = convert_file_size_in_bytes_to_human_readable_format( | |
| scraper.datasets_parquet_file_path.stat().st_size | |
| ) | |
| logger.info( | |
| f"Saved in: {scraper.datasets_parquet_file_path} ({file_size})" | |
| ) | |
| else: | |
| logger.warning( | |
| f"Datasets parquet file not found at: {scraper.datasets_parquet_file_path}" | |
| ) | |
| logger.info( | |
| f"Saved in: {scraper.datasets_parquet_file_path} (size unknown)" | |
| ) | |
| if scraper.files_parquet_file_path.is_file(): | |
| file_size = convert_file_size_in_bytes_to_human_readable_format( | |
| scraper.files_parquet_file_path.stat().st_size | |
| ) | |
| else: | |
| logger.warning( | |
| f"Files parquet file not found at: {scraper.files_parquet_file_path}" | |
| ) | |
| file_size = "size unknown" |
src/mdverse_scrapers/core/toolbox.py
Outdated
| file_size = convert_file_size_in_bytes_to_human_readable_format( | ||
| scraper.files_parquet_file_path.stat().st_size | ||
| ) |
There was a problem hiding this comment.
The code calls .stat().st_size on the parquet file paths without error handling. If the export_list_of_models_to_parquet function fails (which can happen based on its exception handling in src/mdverse_scrapers/models/utils.py:172-175), the parquet files may not exist, causing a FileNotFoundError when .stat() is called. Consider adding a try-except block or checking if the file exists before attempting to get its size.
src/mdverse_scrapers/core/toolbox.py
Outdated
| if size_in_bytes < 1024.0: | ||
| return f"{size_in_bytes:.2f} {unit}" | ||
| size_in_bytes /= 1024.0 |
There was a problem hiding this comment.
This function uses binary units (1024-based) for file size conversion (1 KB = 1024 B), while the FileMetadata.file_size_with_human_readable_unit property uses decimal units (1000-based) via ByteSize.human_readable(decimal=True). This inconsistency could confuse users seeing different unit conventions in the same application. Consider using the same unit convention throughout, or clearly documenting why different conventions are used. The industry standard for file sizes in operating systems is typically binary (1024-based), while storage devices typically use decimal (1000-based).
| if size_in_bytes < 1024.0: | |
| return f"{size_in_bytes:.2f} {unit}" | |
| size_in_bytes /= 1024.0 | |
| if size_in_bytes < 1000.0: | |
| return f"{size_in_bytes:.2f} {unit}" | |
| size_in_bytes /= 1000.0 |
src/mdverse_scrapers/core/toolbox.py
Outdated
| for unit in ["B", "KB", "MB", "GB", "TB"]: | ||
| if size_in_bytes < 1024.0: | ||
| return f"{size_in_bytes:.2f} {unit}" | ||
| size_in_bytes /= 1024.0 |
There was a problem hiding this comment.
The function does not handle negative input values. While file sizes are typically non-negative, passing a negative value would cause the loop to never exit the first iteration and return a negative formatted value like "-100.00 B". Consider adding validation to handle this edge case, either by raising a ValueError or returning a special message for invalid inputs.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/core/test_toolbox.py
Outdated
| class TestConvertFileSizeInBytesToHumanReadableFormat: | ||
| """Tests for convert_file_size_in_bytes_to_human_readable_format function.""" |
There was a problem hiding this comment.
The test class name TestConvertFileSizeInBytesToHumanReadableFormat doesn't match the actual function name convert_file_size_to_human_readable. The class name appears to reference a non-existent function name. For consistency and clarity, the class name should match the function being tested.
| class TestConvertFileSizeInBytesToHumanReadableFormat: | |
| """Tests for convert_file_size_in_bytes_to_human_readable_format function.""" | |
| class TestConvertFileSizeToHumanReadable: | |
| """Tests for convert_file_size_to_human_readable function.""" |
src/mdverse_scrapers/core/toolbox.py
Outdated
| if not scraper.files_parquet_file_path.is_file(): | ||
| logger.error("Files parquet file not found!") | ||
| logger.error(f"{scraper.files_parquet_file_path} is missing.") | ||
| return |
There was a problem hiding this comment.
Returning early when the files parquet file is missing prevents the remaining statistics from being printed, including elapsed time, log file path, and debug mode warning. This means if the files file is missing, the user won't see the elapsed time for the scraping operation or where the log file is saved. Consider logging the error but continuing to print the remaining statistics, or at least ensure the elapsed time and log file location are always printed before any early returns.
tests/core/test_toolbox.py
Outdated
|
|
||
|
|
||
| class TestConvertFileSizeInBytesToHumanReadableFormat: | ||
| """Tests for convert_file_size_in_bytes_to_human_readable_format function.""" |
There was a problem hiding this comment.
The docstring references convert_file_size_in_bytes_to_human_readable_format function, but the actual function name is convert_file_size_to_human_readable. The docstring should accurately reflect the function being tested.
| """Tests for convert_file_size_in_bytes_to_human_readable_format function.""" | |
| """Tests for convert_file_size_to_human_readable function.""" |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_kilobytes(self): | ||
| """Test conversion for values in kilobytes range.""" | ||
| assert convert_file_size_to_human_readable(1_024) == "1.02 KB" | ||
| assert convert_file_size_to_human_readable(1_234) == "1.23 KB" | ||
| assert convert_file_size_to_human_readable(123_456) == "123.46 KB" |
There was a problem hiding this comment.
The test values use binary-based byte counts (e.g., 1_024, 1_048_576) but the implementation uses decimal (1000-based) conversion. While the expected results are correctly calculated, this creates confusion. For example, 1_024 bytes becomes "1.02 KB" rather than "1.00 KB" or "1.00 KiB". Consider either:
- Using decimal-based test values (1000, 1_000_000) for consistency with the decimal implementation
- Or renaming the tests to clarify they're testing decimal conversion of binary-sized inputs
The current approach works but may confuse readers about whether the function uses decimal (SI) or binary (IEC) units.
src/mdverse_scrapers/core/toolbox.py
Outdated
| for unit in ["B", "KB", "MB", "GB", "TB"]: | ||
| if size_in_bytes < 1000.0: | ||
| return f"{size_in_bytes:.2f} {unit}" | ||
| size_in_bytes /= 1000.0 |
There was a problem hiding this comment.
The function mutates the input parameter size_in_bytes during iteration by dividing it. While this works correctly (Python passes numbers by value), it makes the code less clear. Consider using a separate variable for the conversion loop to preserve the semantic meaning of the parameter name.
| for unit in ["B", "KB", "MB", "GB", "TB"]: | |
| if size_in_bytes < 1000.0: | |
| return f"{size_in_bytes:.2f} {unit}" | |
| size_in_bytes /= 1000.0 | |
| size = size_in_bytes | |
| for unit in ["B", "KB", "MB", "GB", "TB"]: | |
| if size < 1000.0: | |
| return f"{size:.2f} {unit}" | |
| size /= 1000.0 |
| return "Negative size!" | ||
| for unit in ["B", "KB", "MB", "GB", "TB"]: | ||
| if size_in_bytes < 1000.0: | ||
| return f"{size_in_bytes:.2f} {unit}" | ||
| size_in_bytes /= 1000.0 | ||
| return "File too big!" |
There was a problem hiding this comment.
The error messages "Negative size!" and "File too big!" are informal and not very descriptive. Consider using more informative messages such as "Invalid file size: negative value" and "File size exceeds maximum displayable range (>= 1000 TB)" to better communicate the issue to users.
| return "Negative size!" | |
| for unit in ["B", "KB", "MB", "GB", "TB"]: | |
| if size_in_bytes < 1000.0: | |
| return f"{size_in_bytes:.2f} {unit}" | |
| size_in_bytes /= 1000.0 | |
| return "File too big!" | |
| return "Invalid file size: negative value" | |
| for unit in ["B", "KB", "MB", "GB", "TB"]: | |
| if size_in_bytes < 1000.0: | |
| return f"{size_in_bytes:.2f} {unit}" | |
| size_in_bytes /= 1000.0 | |
| return "File size exceeds maximum displayable range (>= 1000 TB)" |
| Returns | ||
| ------- | ||
| str | ||
| File size in a human-readable format (e.g., '10.52 MB'). |
There was a problem hiding this comment.
The docstring should clarify that this function uses decimal (SI) units (1 KB = 1000 bytes) rather than binary (IEC) units (1 KiB = 1024 bytes). This is important because file sizes are often displayed using binary units in operating systems and tools. Consider adding a note in the docstring explaining the unit system, such as: "Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc."
| File size in a human-readable format (e.g., '10.52 MB'). | |
| File size in a human-readable format (e.g., '10.52 MB'). | |
| Notes | |
| ----- | |
| Uses decimal (SI) units where 1 KB = 1000 bytes, 1 MB = 1000 KB, etc., | |
| rather than binary (IEC) units (1 KiB = 1024 bytes). |
| def convert_file_size_to_human_readable(size_in_bytes: float) -> str: | ||
| """Convert file size in bytes to a human-readable format. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| size_in_bytes : float | ||
| File size in bytes. | ||
|
|
||
| Returns | ||
| ------- | ||
| str | ||
| File size in a human-readable format (e.g., '10.52 MB'). | ||
| """ | ||
| if size_in_bytes < 0: | ||
| return "Negative size!" | ||
| for unit in ["B", "KB", "MB", "GB", "TB"]: | ||
| if size_in_bytes < 1000.0: | ||
| return f"{size_in_bytes:.2f} {unit}" | ||
| size_in_bytes /= 1000.0 | ||
| return "File too big!" |
There was a problem hiding this comment.
Consider using pydantic's ByteSize for consistency with the existing file_size_with_human_readable_unit property in FileMetadata (models/file.py:81-94). You could convert the raw byte count to a ByteSize object and use its human_readable method. For example: ByteSize(size_in_bytes).human_readable(decimal=True, separator=" "). This would ensure consistent file size formatting across the codebase and leverage the well-tested pydantic implementation.
No description provided.