Skip to content

Conversation

@vinci1it2000
Copy link

Description

This PR adds support for .jsonl.zst (JSON Lines with Zstandard compression) file format as an alternative to the existing .tar.bz2 format for Photon index data.

Changes

  • Added zstd package to Docker image for decompression support
  • Implemented automatic file format detection based on download URL extension
  • Added separate extraction logic for .jsonl.zst files using Photon's nominatim-import feature
  • Introduced new BUILD_PHOTON_PARAMS environment variable to configure import parameters (defaults to -languages en,de,fr,es,it)
  • Simplified MD5 download logic to derive the checksum URL directly from the data file URL
  • Improved MD5 file detection to handle both .tar.bz2 and .jsonl.zst formats
  • Updated documentation to reflect support for both file formats

Benefits

  • Provides flexibility in choosing data source formats
  • Enables users to import data from JSON dumps with customizable language settings
  • Maintains backward compatibility with existing .tar.bz2 format
  • Cleaner, more maintainable code for handling different file formats

Testing

  • Verified .jsonl.zst file download and extraction
  • Verified existing .tar.bz2 functionality remains intact
  • Confirmed BUILD_PHOTON_PARAMS configuration works as expected
  • Verified MD5 checksum validation works for both file formats

Fixes #122

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
  • Using endswith on the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection.
  • You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
- Using `endswith` on the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection.
- You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.

## Individual Comments

### Comment 1
<location> `src/downloader.py:284-291` </location>
<code_context>

 def download_index() -> str:
-    output_file = "photon-db-latest.tar.bz2"
     download_url = get_download_url()
+    if download_url.endswith(".jsonl.zst"):
+        output_file = "photon-data-dump.jsonl.zst"
</code_context>

<issue_to_address>
**suggestion:** The new download_md5 logic assumes all download URLs can be suffixed with '.md5'.

If the server does not support '.md5' suffixes for all URLs, this logic may fail. Consider adding error handling or validating the URL format before appending the suffix.

Suggested implementation:

```python
def download_md5(download_url: str) -> str:
    """
    Attempts to download the MD5 file for the given download_url.
    Handles cases where the .md5 file is not available.
    Returns the path to the downloaded MD5 file, or raises an exception if not found.
    """
    import requests

    md5_url = download_url + ".md5"
    try:
        response = requests.get(md5_url, timeout=10)
        response.raise_for_status()
        md5_file = os.path.join(config.TEMP_DIR, os.path.basename(md5_url))
        with open(md5_file, "wb") as f:
            f.write(response.content)
        return md5_file
    except requests.exceptions.HTTPError as e:
        if response.status_code == 404:
            raise FileNotFoundError(f"MD5 file not found for URL: {md5_url}")
        else:
            raise
    except Exception as e:
        raise RuntimeError(f"Failed to download MD5 file: {e}")

```

You will need to update any calls to `download_md5()` to pass the `download_url` as an argument.  
If you want to validate the URL format before appending `.md5`, you can add a check to ensure the URL is a valid file download URL (e.g., ends with a known extension).
</issue_to_address>

### Comment 2
<location> `src/filesystem.py:24-27` </location>
<code_context>
     if not os.path.exists(config.TEMP_DIR):
         logging.debug(f"Creating temp directory: {config.TEMP_DIR}")
         os.makedirs(config.TEMP_DIR, exist_ok=True)
+    if index_file.endswith(".jsonl.zst"):
+        install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
+    else:
+        install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"

</code_context>

<issue_to_address>
**suggestion:** Extraction logic for '.jsonl.zst' files assumes zstd and Java are available and compatible.

Consider checking for zstd and Java availability at startup, and enhance error messages to guide users on resolving missing or incompatible dependencies.

```suggestion
    def check_executable_available(executable_name):
        from shutil import which
        if which(executable_name) is None:
            raise RuntimeError(
                f"Required executable '{executable_name}' not found in PATH. "
                f"Please install '{executable_name}' and ensure it is available in your system PATH."
            )

    if index_file.endswith(".jsonl.zst"):
        check_executable_available("zstd")
        check_executable_available("java")
        install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
    else:
        install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rtuszik
Copy link
Owner

rtuszik commented Nov 18, 2025

Thanks for this PR!

Before having done proper testing on this, some questions:

  • How is the storage space check affected as it was initially based on the expansion ratio of the .tar.bz2 files?
  • Is there any sort of update mechanism that would allow for incremental updates with .jsonl.zst imports? Or do we still have to delete the entire old index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add support for building Photon databases from JSONL dumps

2 participants