Add on-the-fly compression conversion during download (Issue #18) #45

DhanashreePetare · 2026-01-02T15:07:32Z

Pull Request

Description

This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between bz2, gz, and xz formats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.

Key Features:

New --convert-to CLI option to specify target compression format (bz2, gz, xz)
Optional --convert-from CLI option to filter which source formats to convert
Automatic compression format detection based on file extensions
Smart conversion logic that skips when source equals target format
Decompress → Recompress pipeline using Python's built-in compression modules
Comprehensive test coverage for all conversion scenarios

Related Issues
Resolves #18

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
This change requires a documentation update
Housekeeping

Checklist:

My code follows the ruff code style of this project.
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
- poetry run pytest - all tests passed
- poetry run ruff check - no linting errors

Summary by CodeRabbit

New Features
- Added on-the-fly compression format conversion for downloads using --convert-to and --convert-from CLI options (supports bz2, gz, xz formats). Files are automatically decompressed and recompressed to your target format during download.
Documentation
- Updated download command documentation with compression conversion examples and usage instructions.
Tests
- Added compression conversion tests.
Chores
- Updated project ignore rules.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-02T15:07:43Z

📝 Walkthrough

Walkthrough

Implements on-the-fly compression format conversion during downloads. Adds --convert-to and --convert-from CLI options (supporting bz2, gz, xz), includes detection and conversion utilities, propagates parameters through the download API call chain, and provides comprehensive tests and documentation updates.

Changes

Cohort / File(s)	Summary
Configuration & Documentation `.gitignore`, `README.md`	Ignores test artifacts (`test-download/`, `vault-token.dat`); documents new `--convert-to` and `--convert-from` options with usage examples demonstrating format conversions (bz2→gz, collection→bz2, etc.).
Core Download API `databusclient/api/download.py`	Adds compression format mappings, detection logic, and conversion utilities (`_detect_compression_format`, `_should_convert_file`, `_get_converted_filename`, `_convert_compression_format`). Extends all download functions (`_download_file`, `_download_files`, `_download_collection`, `_download_version`, `_download_artifact`, `_download_group`, `download`) with `convert_to` and `convert_from` parameters; implements post-download conversion flow with error cleanup.
CLI Layer `databusclient/cli.py`	Adds two new click options (`--convert-to`, `--convert-from`) to download command with choices (bz2, gz, xz); propagates parameters to `api_download` call.
Test Coverage `tests/test_compression_conversion.py`, `tests/test_download.py`	New module tests format detection, conversion decisions, filename generation, and actual compression workflows (bz2↔gz↔xz) with data integrity validation. Marks long-running collection test with skip marker.

Sequence Diagram

sequenceDiagram
    participant User as User / CLI
    participant CLI as cli.download()
    participant API as api.download()
    participant DL as _download_file()
    participant Conv as Conversion Logic
    participant FS as File System

    User->>CLI: invoke --convert-to gz --convert-from bz2
    CLI->>API: download(..., convert_to='gz', convert_from='bz2')
    API->>DL: _download_file(..., convert_to, convert_from)
    DL->>FS: fetch original file
    FS-->>DL: file.bz2
    DL->>Conv: detect_format('file.bz2')
    Conv-->>DL: 'bz2'
    DL->>Conv: should_convert('bz2', 'gz', convert_from='bz2')
    Conv-->>DL: true
    DL->>Conv: convert_compression_format('file.bz2', 'bz2'→'gz')
    Conv->>Conv: decompress bz2
    Conv->>Conv: compress to gz
    Conv-->>DL: success
    DL->>FS: remove file.bz2
    DL->>FS: write file.gz
    DL-->>API: conversion complete
    API-->>CLI: success
    CLI-->>User: download finished

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Download capabilities and docker image #12 — Modifies databusclient/cli.py's download command signature and options; potential overlap with new CLI parameter additions.
Restrict Vault token exchange to specific hosts; improve auth errors; (Issue #19) #40 — Alters download flow in databusclient/api/download.py and CLI wiring; overlaps with download function signature extensions and parameter propagation.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature being added: on-the-fly compression conversion during download, with the issue reference.
Description check	✅ Passed	The PR description is comprehensive and follows the template with all key sections filled out including description, type of change, and checklist items marked appropriately.
Linked Issues check	✅ Passed	The PR fully implements the requirements from issue #18: adds --convert-to and --convert-from CLI options, supports bz2/gz/xz formats, detects compression formats, skips redundant conversions, and includes comprehensive test coverage.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing compression conversion: CLI options, API signatures, conversion utilities, tests, documentation, and .gitignore updates for test artifacts—no unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/test_compression_conversion.py (1)
66-135: Comprehensive happy-path coverage.

The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.

Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.
💡 Optional: Add error condition tests

Consider adding tests like:
def test_convert_with_corrupted_file():
    """Test conversion fails gracefully with corrupted input"""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create invalid compressed file
        bad_file = os.path.join(tmpdir, "bad.txt.bz2")
        with open(bad_file, 'wb') as f:
            f.write(b"not actually compressed data")
        
        target_file = os.path.join(tmpdir, "bad.txt.gz")
        
        # Should raise RuntimeError and not create partial target
        with pytest.raises(RuntimeError):
            _convert_compression_format(bad_file, target_file, "bz2", "gz")
        
        assert os.path.exists(bad_file)  # Original preserved on failure
        assert not os.path.exists(target_file)  # No partial output
This would validate the error handling mentioned in the conversion implementation.
databusclient/api/download.py (1)
134-138: Consider preserving original exception details.

Wrapping the exception in RuntimeError on Line 138 may obscure the root cause. Consider chaining the exception or providing more context.
🔎 Suggested improvement
     except Exception as e:
         # If conversion fails, ensure the partial target file is removed
         if os.path.exists(target_file):
             os.remove(target_file)
-        raise RuntimeError(f"Compression conversion failed: {e}")
+        raise RuntimeError(f"Compression conversion failed: {e}") from e
This preserves the full exception chain for debugging.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 46df914.

📒 Files selected for processing (6)

.gitignore
README.md
databusclient/api/download.py
databusclient/cli.py
tests/test_compression_conversion.py
tests/test_download.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_compression_conversion.py (1)

databusclient/api/download.py (5)

download (700-836)

_detect_compression_format (31-45)

_should_convert_file (48-79)

_get_converted_filename (82-99)

_convert_compression_format (102-138)

🔇 Additional comments (18)

.gitignore (1)

3-4: LGTM!

The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.

tests/test_download.py (2)

3-3: LGTM!

The pytest import is correctly added to support the skip decorator.

30-30: LGTM!

Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.

tests/test_compression_conversion.py (4)

17-24: LGTM!

The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.

26-57: LGTM!

The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.

59-64: LGTM!

The filename conversion tests validate all format pair transformations concisely.

137-138: LGTM!

Direct pytest execution support is convenient for development and debugging.

README.md (3)

169-173: LGTM!

The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.

185-211: LGTM!

The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.

264-274: LGTM!

The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.

databusclient/cli.py (2)

161-170: LGTM!

The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.

180-197: LGTM!

The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the api_download call.

databusclient/api/download.py (6)

3-28: LGTM! Clean compression module setup.

The imports and constant mappings are well-structured and use standard library modules for compression handling.

31-45: LGTM! Robust extension detection.

The case-insensitive matching ensures reliable format detection across different filename conventions.

48-79: LGTM! Conversion decision logic is sound.

The function correctly handles all conversion scenarios including filtering and format matching.

141-263: LGTM! Clean integration of conversion into download flow.

The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.

265-299: LGTM! Consistent parameter propagation across all download functions.

All internal download functions properly accept and forward convert_to and convert_from parameters through the call chain, with clear documentation.

Also applies to: 437-476, 478-512, 515-557, 624-663

700-836: The CLI already validates compression format parameters using Click's built-in Choice validator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.

Likely an incorrect or invalid review comment.

coderabbitai · 2026-01-02T15:10:42Z

databusclient/api/download.py

+def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
+    """
+    Generate the new filename after compression format conversion.
+    
+    Parameters:
+    - filename: Original filename
+    - source_format: Source compression format ('bz2', 'gz', 'xz')
+    - target_format: Target compression format ('bz2', 'gz', 'xz')
+    
+    Returns:
+    - New filename with updated extension
+    """
+    source_ext = COMPRESSION_EXTENSIONS[source_format]
+    target_ext = COMPRESSION_EXTENSIONS[target_format]
+
+    if filename.endswith(source_ext):
+        return filename[:-len(source_ext)] + target_ext
+    return filename + target_ext


⚠️ Potential issue | 🟠 Major

Fallback logic may create malformed filenames.

The fallback on Line 98 appends the target extension without removing the source extension, potentially creating filenames like file.bz2.gz if the source extension check fails. Since _detect_compression_format uses case-insensitive matching but filename.endswith(source_ext) on Line 97 is case-sensitive, this mismatch could trigger the fallback.

🔎 Proposed fix using case-insensitive matching

def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: """ Generate the new filename after compression format conversion. Parameters: - filename: Original filename - source_format: Source compression format ('bz2', 'gz', 'xz') - target_format: Target compression format ('bz2', 'gz', 'xz') Returns: - New filename with updated extension """ source_ext = COMPRESSION_EXTENSIONS[source_format] target_ext = COMPRESSION_EXTENSIONS[target_format] - if filename.endswith(source_ext): + if filename.lower().endswith(source_ext): return filename[:-len(source_ext)] + target_ext return filename + target_ext

🤖 Prompt for AI Agents

In databusclient/api/download.py around lines 82 to 99, the fallback appends the target extension without reliably removing the source extension because endswith() is case-sensitive while detection is case-insensitive; update the logic to perform a case-insensitive check (e.g., compare filename.lower().endswith(source_ext.lower())) and if it matches remove the trailing source extension by length before appending the target extension; if no match, simply append the target extension as before, ensuring you strip only the exact suffix length rather than any substring to avoid creating names like file.bz2.gz.

coderabbitai · 2026-01-02T15:10:42Z

databusclient/api/download.py

+def _convert_compression_format(
+    source_file: str, target_file: str, source_format: str, target_format: str
+) -> None:
+    """
+    Convert a compressed file from one format to another.
+    
+    Parameters:
+    - source_file: Path to source compressed file
+    - target_file: Path to target compressed file
+    - source_format: Source compression format ('bz2', 'gz', 'xz')
+    - target_format: Target compression format ('bz2', 'gz', 'xz')
+    """
+    source_module = COMPRESSION_MODULES[source_format]
+    target_module = COMPRESSION_MODULES[target_format]
+
+    print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}")
+
+    # Decompress and recompress with progress indication
+    chunk_size = 8192
+
+    try:
+        with source_module.open(source_file, 'rb') as sf:
+            with target_module.open(target_file, 'wb') as tf:
+                while True:
+                    chunk = sf.read(chunk_size)
+                    if not chunk:
+                        break
+                    tf.write(chunk)
+
+        # Remove the original file after successful conversion
+        os.remove(source_file)
+        print(f"Conversion complete: {os.path.basename(target_file)}")
+    except Exception as e:
+        # If conversion fails, ensure the partial target file is removed
+        if os.path.exists(target_file):
+            os.remove(target_file)
+        raise RuntimeError(f"Compression conversion failed: {e}")


⚠️ Potential issue | 🔴 Critical

Add input validation for compression formats.

Lines 114-115 access COMPRESSION_MODULES using source_format and target_format without validation. If invalid format strings are passed (e.g., from external input or typos), this will raise KeyError.

🔎 Proposed fix with validation

def _convert_compression_format( source_file: str, target_file: str, source_format: str, target_format: str ) -> None: """ Convert a compressed file from one format to another. Parameters: - source_file: Path to source compressed file - target_file: Path to target compressed file - source_format: Source compression format ('bz2', 'gz', 'xz') - target_format: Target compression format ('bz2', 'gz', 'xz') """ + # Validate compression formats + if source_format not in COMPRESSION_MODULES: + raise ValueError(f"Unsupported source compression format: {source_format}") + if target_format not in COMPRESSION_MODULES: + raise ValueError(f"Unsupported target compression format: {target_format}") + source_module = COMPRESSION_MODULES[source_format] target_module = COMPRESSION_MODULES[target_format]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _convert_compression_format(

source_file: str, target_file: str, source_format: str, target_format: str

) -> None:

"""

Convert a compressed file from one format to another.

Parameters:

- source_file: Path to source compressed file

- target_file: Path to target compressed file

- source_format: Source compression format ('bz2', 'gz', 'xz')

- target_format: Target compression format ('bz2', 'gz', 'xz')

"""

source_module = COMPRESSION_MODULES[source_format]

target_module = COMPRESSION_MODULES[target_format]

print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}")

# Decompress and recompress with progress indication

chunk_size = 8192

try:

with source_module.open(source_file, 'rb') as sf:

with target_module.open(target_file, 'wb') as tf:

while True:

chunk = sf.read(chunk_size)

if not chunk:

break

tf.write(chunk)

# Remove the original file after successful conversion

os.remove(source_file)

print(f"Conversion complete: {os.path.basename(target_file)}")

except Exception as e:

# If conversion fails, ensure the partial target file is removed

if os.path.exists(target_file):

os.remove(target_file)

raise RuntimeError(f"Compression conversion failed: {e}")

def _convert_compression_format(

source_file: str, target_file: str, source_format: str, target_format: str

) -> None:

"""

Convert a compressed file from one format to another.

Parameters:

- source_file: Path to source compressed file

- target_file: Path to target compressed file

- source_format: Source compression format ('bz2', 'gz', 'xz')

- target_format: Target compression format ('bz2', 'gz', 'xz')

"""

# Validate compression formats

if source_format not in COMPRESSION_MODULES:

raise ValueError(f"Unsupported source compression format: {source_format}")

if target_format not in COMPRESSION_MODULES:

raise ValueError(f"Unsupported target compression format: {target_format}")

source_module = COMPRESSION_MODULES[source_format]

target_module = COMPRESSION_MODULES[target_format]

print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}")

# Decompress and recompress with progress indication

chunk_size = 8192

try:

with source_module.open(source_file, 'rb') as sf:

with target_module.open(target_file, 'wb') as tf:

while True:

chunk = sf.read(chunk_size)

if not chunk:

break

tf.write(chunk)

# Remove the original file after successful conversion

os.remove(source_file)

print(f"Conversion complete: {os.path.basename(target_file)}")

except Exception as e:

# If conversion fails, ensure the partial target file is removed

if os.path.exists(target_file):

os.remove(target_file)

raise RuntimeError(f"Compression conversion failed: {e}")

🤖 Prompt for AI Agents

In databusclient/api/download.py around lines 102 to 138, the function _convert_compression_format accesses COMPRESSION_MODULES[source_format] and COMPRESSION_MODULES[target_format] without validating the inputs; add explicit validation at the top of the function to check that source_format and target_format are strings and keys present in COMPRESSION_MODULES, and if not raise a clear ValueError listing allowed formats; after validation, safely retrieve the modules (or use dict access knowing keys exist) and proceed as before; ensure error messages are descriptive (e.g., "Unsupported compression format: <format>. Supported: <list>") so callers can correct invalid inputs.

coderabbitai · 2026-01-02T15:10:42Z

databusclient/api/download.py

+    source_module = COMPRESSION_MODULES[source_format]
+    target_module = COMPRESSION_MODULES[target_format]
+
+    print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}")


⚠️ Potential issue | 🟡 Minor

Remove misleading comment about progress indication.

The comment claims "with progress indication" but no progress bar is implemented for the conversion process. Unlike the download function which uses tqdm, this conversion runs silently except for the start/end print statements.

🔎 Proposed fix

- # Decompress and recompress with progress indication + # Decompress and recompress in chunks chunk_size = 8192

Alternatively, consider adding actual progress indication for large file conversions to match the download UX.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In databusclient/api/download.py around line 117, the printed message claims "with progress indication" while no progress bar is shown; either remove the misleading reference from the print/comment or implement a real progress indicator (e.g., wrap conversion steps or byte streaming with tqdm) so the UX matches the download function—update the print/message accordingly and, if adding tqdm, ensure tqdm is imported and used only for large files to avoid unnecessary output.

Add on-the-fly compression conversion during download (Issue dbpedia#18)

46df914

coderabbitai bot reviewed Jan 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add on-the-fly compression conversion during download (Issue #18) #45

Add on-the-fly compression conversion during download (Issue #18) #45

Uh oh!

DhanashreePetare commented Jan 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 2, 2026

Uh oh!

coderabbitai bot Jan 2, 2026

Uh oh!

coderabbitai bot Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add on-the-fly compression conversion during download (Issue #18) #45

Are you sure you want to change the base?

Add on-the-fly compression conversion during download (Issue #18) #45

Uh oh!

Conversation

DhanashreePetare commented Jan 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of change

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DhanashreePetare commented Jan 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 2, 2026 •

edited

Loading