Add support for jsonl.zst files #187

vinci1it2000 · 2025-11-13T09:56:42Z

Description

This PR adds support for .jsonl.zst (JSON Lines with Zstandard compression) file format as an alternative to the existing .tar.bz2 format for Photon index data.

Changes

Added zstd package to Docker image for decompression support
Implemented automatic file format detection based on download URL extension
Added separate extraction logic for .jsonl.zst files using Photon's nominatim-import feature
Introduced new BUILD_PHOTON_PARAMS environment variable to configure import parameters (defaults to -languages en,de,fr,es,it)
Simplified MD5 download logic to derive the checksum URL directly from the data file URL
Improved MD5 file detection to handle both .tar.bz2 and .jsonl.zst formats
Updated documentation to reflect support for both file formats

Benefits

Provides flexibility in choosing data source formats
Enables users to import data from JSON dumps with customizable language settings
Maintains backward compatibility with existing .tar.bz2 format
Cleaner, more maintainable code for handling different file formats

Testing

Verified .jsonl.zst file download and extraction
Verified existing .tar.bz2 functionality remains intact
Confirmed BUILD_PHOTON_PARAMS configuration works as expected
Verified MD5 checksum validation works for both file formats

Fixes #122

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
Using endswith on the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection.
You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
- Using `endswith` on the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection.
- You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.

## Individual Comments

### Comment 1
<location> `src/downloader.py:284-291` </location>
<code_context>

 def download_index() -> str:
-    output_file = "photon-db-latest.tar.bz2"
     download_url = get_download_url()
+    if download_url.endswith(".jsonl.zst"):
+        output_file = "photon-data-dump.jsonl.zst"
</code_context>

<issue_to_address>
**suggestion:** The new download_md5 logic assumes all download URLs can be suffixed with '.md5'.

If the server does not support '.md5' suffixes for all URLs, this logic may fail. Consider adding error handling or validating the URL format before appending the suffix.

Suggested implementation:

```python
def download_md5(download_url: str) -> str:
    """
    Attempts to download the MD5 file for the given download_url.
    Handles cases where the .md5 file is not available.
    Returns the path to the downloaded MD5 file, or raises an exception if not found.
    """
    import requests

    md5_url = download_url + ".md5"
    try:
        response = requests.get(md5_url, timeout=10)
        response.raise_for_status()
        md5_file = os.path.join(config.TEMP_DIR, os.path.basename(md5_url))
        with open(md5_file, "wb") as f:
            f.write(response.content)
        return md5_file
    except requests.exceptions.HTTPError as e:
        if response.status_code == 404:
            raise FileNotFoundError(f"MD5 file not found for URL: {md5_url}")
        else:
            raise
    except Exception as e:
        raise RuntimeError(f"Failed to download MD5 file: {e}")

```

You will need to update any calls to `download_md5()` to pass the `download_url` as an argument.  
If you want to validate the URL format before appending `.md5`, you can add a check to ensure the URL is a valid file download URL (e.g., ends with a known extension).
</issue_to_address>

### Comment 2
<location> `src/filesystem.py:24-27` </location>
<code_context>
     if not os.path.exists(config.TEMP_DIR):
         logging.debug(f"Creating temp directory: {config.TEMP_DIR}")
         os.makedirs(config.TEMP_DIR, exist_ok=True)
+    if index_file.endswith(".jsonl.zst"):
+        install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
+    else:
+        install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"

</code_context>

<issue_to_address>
**suggestion:** Extraction logic for '.jsonl.zst' files assumes zstd and Java are available and compatible.

Consider checking for zstd and Java availability at startup, and enhance error messages to guide users on resolving missing or incompatible dependencies.

```suggestion
    def check_executable_available(executable_name):
        from shutil import which
        if which(executable_name) is None:
            raise RuntimeError(
                f"Required executable '{executable_name}' not found in PATH. "
                f"Please install '{executable_name}' and ensure it is available in your system PATH."
            )

    if index_file.endswith(".jsonl.zst"):
        check_executable_available("zstd")
        check_executable_available("java")
        install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
    else:
        install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/downloader.py

src/filesystem.py

rtuszik · 2025-11-18T20:23:07Z

Thanks for this PR!

Before having done proper testing on this, some questions:

How is the storage space check affected as it was initially based on the expansion ratio of the .tar.bz2 files?
Is there any sort of update mechanism that would allow for incremental updates with .jsonl.zst imports? Or do we still have to delete the entire old index?

vinci1it2000 added 2 commits November 13, 2025 10:16

feat, rtuszik#122: Add support for jsonl.zst files.

e832da9

fix: Simplify MD5 download logic and support jsonl.zst file detection.

db98494

sourcery-ai bot reviewed Nov 13, 2025

View reviewed changes

src/downloader.py Show resolved Hide resolved

src/filesystem.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for jsonl.zst files #187

Add support for jsonl.zst files #187

Uh oh!

vinci1it2000 commented Nov 13, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

rtuszik commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for jsonl.zst files #187

Are you sure you want to change the base?

Add support for jsonl.zst files #187

Uh oh!

Conversation

vinci1it2000 commented Nov 13, 2025

Description

Changes

Benefits

Testing

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rtuszik commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants