-
Notifications
You must be signed in to change notification settings - Fork 12
Add on-the-fly compression conversion during download (Issue #18) #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add on-the-fly compression conversion during download (Issue #18) #45
Conversation
📝 WalkthroughWalkthroughImplements on-the-fly compression format conversion during downloads. Adds Changes
Sequence DiagramsequenceDiagram
participant User as User / CLI
participant CLI as cli.download()
participant API as api.download()
participant DL as _download_file()
participant Conv as Conversion Logic
participant FS as File System
User->>CLI: invoke --convert-to gz --convert-from bz2
CLI->>API: download(..., convert_to='gz', convert_from='bz2')
API->>DL: _download_file(..., convert_to, convert_from)
DL->>FS: fetch original file
FS-->>DL: file.bz2
DL->>Conv: detect_format('file.bz2')
Conv-->>DL: 'bz2'
DL->>Conv: should_convert('bz2', 'gz', convert_from='bz2')
Conv-->>DL: true
DL->>Conv: convert_compression_format('file.bz2', 'bz2'→'gz')
Conv->>Conv: decompress bz2
Conv->>Conv: compress to gz
Conv-->>DL: success
DL->>FS: remove file.bz2
DL->>FS: write file.gz
DL-->>API: conversion complete
API-->>CLI: success
CLI-->>User: download finished
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_compression_conversion.py (1)
66-135: Comprehensive happy-path coverage.The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.
Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.
💡 Optional: Add error condition tests
Consider adding tests like:
def test_convert_with_corrupted_file(): """Test conversion fails gracefully with corrupted input""" with tempfile.TemporaryDirectory() as tmpdir: # Create invalid compressed file bad_file = os.path.join(tmpdir, "bad.txt.bz2") with open(bad_file, 'wb') as f: f.write(b"not actually compressed data") target_file = os.path.join(tmpdir, "bad.txt.gz") # Should raise RuntimeError and not create partial target with pytest.raises(RuntimeError): _convert_compression_format(bad_file, target_file, "bz2", "gz") assert os.path.exists(bad_file) # Original preserved on failure assert not os.path.exists(target_file) # No partial outputThis would validate the error handling mentioned in the conversion implementation.
databusclient/api/download.py (1)
134-138: Consider preserving original exception details.Wrapping the exception in
RuntimeErroron Line 138 may obscure the root cause. Consider chaining the exception or providing more context.🔎 Suggested improvement
except Exception as e: # If conversion fails, ensure the partial target file is removed if os.path.exists(target_file): os.remove(target_file) - raise RuntimeError(f"Compression conversion failed: {e}") + raise RuntimeError(f"Compression conversion failed: {e}") from eThis preserves the full exception chain for debugging.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignoreREADME.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_compression_conversion.pytests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(700-836)_detect_compression_format(31-45)_should_convert_file(48-79)_get_converted_filename(82-99)_convert_compression_format(102-138)
🔇 Additional comments (18)
.gitignore (1)
3-4: LGTM!The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.
tests/test_download.py (2)
3-3: LGTM!The pytest import is correctly added to support the skip decorator.
30-30: LGTM!Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.
tests/test_compression_conversion.py (4)
17-24: LGTM!The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.
26-57: LGTM!The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.
59-64: LGTM!The filename conversion tests validate all format pair transformations concisely.
137-138: LGTM!Direct pytest execution support is convenient for development and debugging.
README.md (3)
169-173: LGTM!The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.
185-211: LGTM!The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.
264-274: LGTM!The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.
databusclient/cli.py (2)
161-170: LGTM!The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.
180-197: LGTM!The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the
api_downloadcall.databusclient/api/download.py (6)
3-28: LGTM! Clean compression module setup.The imports and constant mappings are well-structured and use standard library modules for compression handling.
31-45: LGTM! Robust extension detection.The case-insensitive matching ensures reliable format detection across different filename conventions.
48-79: LGTM! Conversion decision logic is sound.The function correctly handles all conversion scenarios including filtering and format matching.
141-263: LGTM! Clean integration of conversion into download flow.The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.
265-299: LGTM! Consistent parameter propagation across all download functions.All internal download functions properly accept and forward
convert_toandconvert_fromparameters through the call chain, with clear documentation.Also applies to: 437-476, 478-512, 515-557, 624-663
700-836: The CLI already validates compression format parameters using Click's built-inChoicevalidator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.Likely an incorrect or invalid review comment.
| def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: | ||
| """ | ||
| Generate the new filename after compression format conversion. | ||
| Parameters: | ||
| - filename: Original filename | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| Returns: | ||
| - New filename with updated extension | ||
| """ | ||
| source_ext = COMPRESSION_EXTENSIONS[source_format] | ||
| target_ext = COMPRESSION_EXTENSIONS[target_format] | ||
|
|
||
| if filename.endswith(source_ext): | ||
| return filename[:-len(source_ext)] + target_ext | ||
| return filename + target_ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback logic may create malformed filenames.
The fallback on Line 98 appends the target extension without removing the source extension, potentially creating filenames like file.bz2.gz if the source extension check fails. Since _detect_compression_format uses case-insensitive matching but filename.endswith(source_ext) on Line 97 is case-sensitive, this mismatch could trigger the fallback.
🔎 Proposed fix using case-insensitive matching
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]
- if filename.endswith(source_ext):
+ if filename.lower().endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 82 to 99, the fallback appends the
target extension without reliably removing the source extension because
endswith() is case-sensitive while detection is case-insensitive; update the
logic to perform a case-insensitive check (e.g., compare
filename.lower().endswith(source_ext.lower())) and if it matches remove the
trailing source extension by length before appending the target extension; if no
match, simply append the target extension as before, ensuring you strip only the
exact suffix length rather than any substring to avoid creating names like
file.bz2.gz.
| def _convert_compression_format( | ||
| source_file: str, target_file: str, source_format: str, target_format: str | ||
| ) -> None: | ||
| """ | ||
| Convert a compressed file from one format to another. | ||
| Parameters: | ||
| - source_file: Path to source compressed file | ||
| - target_file: Path to target compressed file | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| """ | ||
| source_module = COMPRESSION_MODULES[source_format] | ||
| target_module = COMPRESSION_MODULES[target_format] | ||
|
|
||
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | ||
|
|
||
| # Decompress and recompress with progress indication | ||
| chunk_size = 8192 | ||
|
|
||
| try: | ||
| with source_module.open(source_file, 'rb') as sf: | ||
| with target_module.open(target_file, 'wb') as tf: | ||
| while True: | ||
| chunk = sf.read(chunk_size) | ||
| if not chunk: | ||
| break | ||
| tf.write(chunk) | ||
|
|
||
| # Remove the original file after successful conversion | ||
| os.remove(source_file) | ||
| print(f"Conversion complete: {os.path.basename(target_file)}") | ||
| except Exception as e: | ||
| # If conversion fails, ensure the partial target file is removed | ||
| if os.path.exists(target_file): | ||
| os.remove(target_file) | ||
| raise RuntimeError(f"Compression conversion failed: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation for compression formats.
Lines 114-115 access COMPRESSION_MODULES using source_format and target_format without validation. If invalid format strings are passed (e.g., from external input or typos), this will raise KeyError.
🔎 Proposed fix with validation
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
+ # Validate compression formats
+ if source_format not in COMPRESSION_MODULES:
+ raise ValueError(f"Unsupported source compression format: {source_format}")
+ if target_format not in COMPRESSION_MODULES:
+ raise ValueError(f"Unsupported target compression format: {target_format}")
+
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _convert_compression_format( | |
| source_file: str, target_file: str, source_format: str, target_format: str | |
| ) -> None: | |
| """ | |
| Convert a compressed file from one format to another. | |
| Parameters: | |
| - source_file: Path to source compressed file | |
| - target_file: Path to target compressed file | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| """ | |
| source_module = COMPRESSION_MODULES[source_format] | |
| target_module = COMPRESSION_MODULES[target_format] | |
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | |
| # Decompress and recompress with progress indication | |
| chunk_size = 8192 | |
| try: | |
| with source_module.open(source_file, 'rb') as sf: | |
| with target_module.open(target_file, 'wb') as tf: | |
| while True: | |
| chunk = sf.read(chunk_size) | |
| if not chunk: | |
| break | |
| tf.write(chunk) | |
| # Remove the original file after successful conversion | |
| os.remove(source_file) | |
| print(f"Conversion complete: {os.path.basename(target_file)}") | |
| except Exception as e: | |
| # If conversion fails, ensure the partial target file is removed | |
| if os.path.exists(target_file): | |
| os.remove(target_file) | |
| raise RuntimeError(f"Compression conversion failed: {e}") | |
| def _convert_compression_format( | |
| source_file: str, target_file: str, source_format: str, target_format: str | |
| ) -> None: | |
| """ | |
| Convert a compressed file from one format to another. | |
| Parameters: | |
| - source_file: Path to source compressed file | |
| - target_file: Path to target compressed file | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| """ | |
| # Validate compression formats | |
| if source_format not in COMPRESSION_MODULES: | |
| raise ValueError(f"Unsupported source compression format: {source_format}") | |
| if target_format not in COMPRESSION_MODULES: | |
| raise ValueError(f"Unsupported target compression format: {target_format}") | |
| source_module = COMPRESSION_MODULES[source_format] | |
| target_module = COMPRESSION_MODULES[target_format] | |
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | |
| # Decompress and recompress with progress indication | |
| chunk_size = 8192 | |
| try: | |
| with source_module.open(source_file, 'rb') as sf: | |
| with target_module.open(target_file, 'wb') as tf: | |
| while True: | |
| chunk = sf.read(chunk_size) | |
| if not chunk: | |
| break | |
| tf.write(chunk) | |
| # Remove the original file after successful conversion | |
| os.remove(source_file) | |
| print(f"Conversion complete: {os.path.basename(target_file)}") | |
| except Exception as e: | |
| # If conversion fails, ensure the partial target file is removed | |
| if os.path.exists(target_file): | |
| os.remove(target_file) | |
| raise RuntimeError(f"Compression conversion failed: {e}") |
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 102 to 138, the function
_convert_compression_format accesses COMPRESSION_MODULES[source_format] and
COMPRESSION_MODULES[target_format] without validating the inputs; add explicit
validation at the top of the function to check that source_format and
target_format are strings and keys present in COMPRESSION_MODULES, and if not
raise a clear ValueError listing allowed formats; after validation, safely
retrieve the modules (or use dict access knowing keys exist) and proceed as
before; ensure error messages are descriptive (e.g., "Unsupported compression
format: <format>. Supported: <list>") so callers can correct invalid inputs.
| source_module = COMPRESSION_MODULES[source_format] | ||
| target_module = COMPRESSION_MODULES[target_format] | ||
|
|
||
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove misleading comment about progress indication.
The comment claims "with progress indication" but no progress bar is implemented for the conversion process. Unlike the download function which uses tqdm, this conversion runs silently except for the start/end print statements.
🔎 Proposed fix
- # Decompress and recompress with progress indication
+ # Decompress and recompress in chunks
chunk_size = 8192Alternatively, consider adding actual progress indication for large file conversions to match the download UX.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In databusclient/api/download.py around line 117, the printed message claims
"with progress indication" while no progress bar is shown; either remove the
misleading reference from the print/comment or implement a real progress
indicator (e.g., wrap conversion steps or byte streaming with tqdm) so the UX
matches the download function—update the print/message accordingly and, if
adding tqdm, ensure tqdm is imported and used only for large files to avoid
unnecessary output.
Pull Request
Description
This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between
bz2,gz, andxzformats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.Key Features:
--convert-toCLI option to specify target compression format (bz2, gz, xz)--convert-fromCLI option to filter which source formats to convertRelated Issues
Resolves #18
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
--convert-toand--convert-fromCLI options (supports bz2, gz, xz formats). Files are automatically decompressed and recompressed to your target format during download.Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.