Skip to content

Conversation

@DhanashreePetare
Copy link
Contributor

@DhanashreePetare DhanashreePetare commented Jan 2, 2026

Pull Request

Description

This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between bz2, gz, and xz formats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.

Key Features:

  • New --convert-to CLI option to specify target compression format (bz2, gz, xz)
  • Optional --convert-from CLI option to filter which source formats to convert
  • Automatic compression format detection based on file extensions
  • Smart conversion logic that skips when source equals target format
  • Decompress → Recompress pipeline using Python's built-in compression modules
  • Comprehensive test coverage for all conversion scenarios

Related Issues
Resolves #18

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • This change requires a documentation update
  • Housekeeping

Checklist:

  • My code follows the ruff code style of this project.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
    • poetry run pytest - all tests passed
    • poetry run ruff check - no linting errors

Summary by CodeRabbit

  • New Features

    • Added on-the-fly compression format conversion for downloads using --convert-to and --convert-from CLI options (supports bz2, gz, xz formats). Files are automatically decompressed and recompressed to your target format during download.
  • Documentation

    • Updated download command documentation with compression conversion examples and usage instructions.
  • Tests

    • Added compression conversion tests.
  • Chores

    • Updated project ignore rules.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 2, 2026

📝 Walkthrough

Walkthrough

Implements on-the-fly compression format conversion during downloads. Adds --convert-to and --convert-from CLI options (supporting bz2, gz, xz), includes detection and conversion utilities, propagates parameters through the download API call chain, and provides comprehensive tests and documentation updates.

Changes

Cohort / File(s) Summary
Configuration & Documentation
.gitignore, README.md
Ignores test artifacts (test-download/, vault-token.dat); documents new --convert-to and --convert-from options with usage examples demonstrating format conversions (bz2→gz, collection→bz2, etc.).
Core Download API
databusclient/api/download.py
Adds compression format mappings, detection logic, and conversion utilities (_detect_compression_format, _should_convert_file, _get_converted_filename, _convert_compression_format). Extends all download functions (_download_file, _download_files, _download_collection, _download_version, _download_artifact, _download_group, download) with convert_to and convert_from parameters; implements post-download conversion flow with error cleanup.
CLI Layer
databusclient/cli.py
Adds two new click options (--convert-to, --convert-from) to download command with choices (bz2, gz, xz); propagates parameters to api_download call.
Test Coverage
tests/test_compression_conversion.py, tests/test_download.py
New module tests format detection, conversion decisions, filename generation, and actual compression workflows (bz2↔gz↔xz) with data integrity validation. Marks long-running collection test with skip marker.

Sequence Diagram

sequenceDiagram
    participant User as User / CLI
    participant CLI as cli.download()
    participant API as api.download()
    participant DL as _download_file()
    participant Conv as Conversion Logic
    participant FS as File System

    User->>CLI: invoke --convert-to gz --convert-from bz2
    CLI->>API: download(..., convert_to='gz', convert_from='bz2')
    API->>DL: _download_file(..., convert_to, convert_from)
    DL->>FS: fetch original file
    FS-->>DL: file.bz2
    DL->>Conv: detect_format('file.bz2')
    Conv-->>DL: 'bz2'
    DL->>Conv: should_convert('bz2', 'gz', convert_from='bz2')
    Conv-->>DL: true
    DL->>Conv: convert_compression_format('file.bz2', 'bz2'→'gz')
    Conv->>Conv: decompress bz2
    Conv->>Conv: compress to gz
    Conv-->>DL: success
    DL->>FS: remove file.bz2
    DL->>FS: write file.gz
    DL-->>API: conversion complete
    API-->>CLI: success
    CLI-->>User: download finished
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature being added: on-the-fly compression conversion during download, with the issue reference.
Description check ✅ Passed The PR description is comprehensive and follows the template with all key sections filled out including description, type of change, and checklist items marked appropriately.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #18: adds --convert-to and --convert-from CLI options, supports bz2/gz/xz formats, detects compression formats, skips redundant conversions, and includes comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing compression conversion: CLI options, API signatures, conversion utilities, tests, documentation, and .gitignore updates for test artifacts—no unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/test_compression_conversion.py (1)

66-135: Comprehensive happy-path coverage.

The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.

Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.

💡 Optional: Add error condition tests

Consider adding tests like:

def test_convert_with_corrupted_file():
    """Test conversion fails gracefully with corrupted input"""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create invalid compressed file
        bad_file = os.path.join(tmpdir, "bad.txt.bz2")
        with open(bad_file, 'wb') as f:
            f.write(b"not actually compressed data")
        
        target_file = os.path.join(tmpdir, "bad.txt.gz")
        
        # Should raise RuntimeError and not create partial target
        with pytest.raises(RuntimeError):
            _convert_compression_format(bad_file, target_file, "bz2", "gz")
        
        assert os.path.exists(bad_file)  # Original preserved on failure
        assert not os.path.exists(target_file)  # No partial output

This would validate the error handling mentioned in the conversion implementation.

databusclient/api/download.py (1)

134-138: Consider preserving original exception details.

Wrapping the exception in RuntimeError on Line 138 may obscure the root cause. Consider chaining the exception or providing more context.

🔎 Suggested improvement
     except Exception as e:
         # If conversion fails, ensure the partial target file is removed
         if os.path.exists(target_file):
             os.remove(target_file)
-        raise RuntimeError(f"Compression conversion failed: {e}")
+        raise RuntimeError(f"Compression conversion failed: {e}") from e

This preserves the full exception chain for debugging.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 46df914.

📒 Files selected for processing (6)
  • .gitignore
  • README.md
  • databusclient/api/download.py
  • databusclient/cli.py
  • tests/test_compression_conversion.py
  • tests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
  • download (700-836)
  • _detect_compression_format (31-45)
  • _should_convert_file (48-79)
  • _get_converted_filename (82-99)
  • _convert_compression_format (102-138)
🔇 Additional comments (18)
.gitignore (1)

3-4: LGTM!

The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.

tests/test_download.py (2)

3-3: LGTM!

The pytest import is correctly added to support the skip decorator.


30-30: LGTM!

Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.

tests/test_compression_conversion.py (4)

17-24: LGTM!

The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.


26-57: LGTM!

The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.


59-64: LGTM!

The filename conversion tests validate all format pair transformations concisely.


137-138: LGTM!

Direct pytest execution support is convenient for development and debugging.

README.md (3)

169-173: LGTM!

The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.


185-211: LGTM!

The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.


264-274: LGTM!

The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.

databusclient/cli.py (2)

161-170: LGTM!

The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.


180-197: LGTM!

The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the api_download call.

databusclient/api/download.py (6)

3-28: LGTM! Clean compression module setup.

The imports and constant mappings are well-structured and use standard library modules for compression handling.


31-45: LGTM! Robust extension detection.

The case-insensitive matching ensures reliable format detection across different filename conventions.


48-79: LGTM! Conversion decision logic is sound.

The function correctly handles all conversion scenarios including filtering and format matching.


141-263: LGTM! Clean integration of conversion into download flow.

The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.


265-299: LGTM! Consistent parameter propagation across all download functions.

All internal download functions properly accept and forward convert_to and convert_from parameters through the call chain, with clear documentation.

Also applies to: 437-476, 478-512, 515-557, 624-663


700-836: The CLI already validates compression format parameters using Click's built-in Choice validator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.

Likely an incorrect or invalid review comment.

Comment on lines +82 to +99
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]

if filename.endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback logic may create malformed filenames.

The fallback on Line 98 appends the target extension without removing the source extension, potentially creating filenames like file.bz2.gz if the source extension check fails. Since _detect_compression_format uses case-insensitive matching but filename.endswith(source_ext) on Line 97 is case-sensitive, this mismatch could trigger the fallback.

🔎 Proposed fix using case-insensitive matching
 def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
     """
     Generate the new filename after compression format conversion.
     
     Parameters:
     - filename: Original filename
     - source_format: Source compression format ('bz2', 'gz', 'xz')
     - target_format: Target compression format ('bz2', 'gz', 'xz')
     
     Returns:
     - New filename with updated extension
     """
     source_ext = COMPRESSION_EXTENSIONS[source_format]
     target_ext = COMPRESSION_EXTENSIONS[target_format]
     
-    if filename.endswith(source_ext):
+    if filename.lower().endswith(source_ext):
         return filename[:-len(source_ext)] + target_ext
     return filename + target_ext
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 82 to 99, the fallback appends the
target extension without reliably removing the source extension because
endswith() is case-sensitive while detection is case-insensitive; update the
logic to perform a case-insensitive check (e.g., compare
filename.lower().endswith(source_ext.lower())) and if it matches remove the
trailing source extension by length before appending the target extension; if no
match, simply append the target extension as before, ensuring you strip only the
exact suffix length rather than any substring to avoid creating names like
file.bz2.gz.

Comment on lines +102 to +138
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]

print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")

# Decompress and recompress with progress indication
chunk_size = 8192

try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)

# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add input validation for compression formats.

Lines 114-115 access COMPRESSION_MODULES using source_format and target_format without validation. If invalid format strings are passed (e.g., from external input or typos), this will raise KeyError.

🔎 Proposed fix with validation
 def _convert_compression_format(
     source_file: str, target_file: str, source_format: str, target_format: str
 ) -> None:
     """
     Convert a compressed file from one format to another.
     
     Parameters:
     - source_file: Path to source compressed file
     - target_file: Path to target compressed file
     - source_format: Source compression format ('bz2', 'gz', 'xz')
     - target_format: Target compression format ('bz2', 'gz', 'xz')
     """
+    # Validate compression formats
+    if source_format not in COMPRESSION_MODULES:
+        raise ValueError(f"Unsupported source compression format: {source_format}")
+    if target_format not in COMPRESSION_MODULES:
+        raise ValueError(f"Unsupported target compression format: {target_format}")
+    
     source_module = COMPRESSION_MODULES[source_format]
     target_module = COMPRESSION_MODULES[target_format]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]
print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")
# Decompress and recompress with progress indication
chunk_size = 8192
try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)
# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
# Validate compression formats
if source_format not in COMPRESSION_MODULES:
raise ValueError(f"Unsupported source compression format: {source_format}")
if target_format not in COMPRESSION_MODULES:
raise ValueError(f"Unsupported target compression format: {target_format}")
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]
print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")
# Decompress and recompress with progress indication
chunk_size = 8192
try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)
# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 102 to 138, the function
_convert_compression_format accesses COMPRESSION_MODULES[source_format] and
COMPRESSION_MODULES[target_format] without validating the inputs; add explicit
validation at the top of the function to check that source_format and
target_format are strings and keys present in COMPRESSION_MODULES, and if not
raise a clear ValueError listing allowed formats; after validation, safely
retrieve the modules (or use dict access knowing keys exist) and proceed as
before; ensure error messages are descriptive (e.g., "Unsupported compression
format: <format>. Supported: <list>") so callers can correct invalid inputs.

source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]

print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove misleading comment about progress indication.

The comment claims "with progress indication" but no progress bar is implemented for the conversion process. Unlike the download function which uses tqdm, this conversion runs silently except for the start/end print statements.

🔎 Proposed fix
-    # Decompress and recompress with progress indication
+    # Decompress and recompress in chunks
     chunk_size = 8192

Alternatively, consider adding actual progress indication for large file conversions to match the download UX.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In databusclient/api/download.py around line 117, the printed message claims
"with progress indication" while no progress bar is shown; either remove the
misleading reference from the print/comment or implement a real progress
indicator (e.g., wrap conversion steps or byte streaming with tqdm) so the UX
matches the download function—update the print/message accordingly and, if
adding tqdm, ensure tqdm is imported and used only for large files to avoid
unnecessary output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

On-the-fly Conversion During Download

1 participant