Skip to content

Conversation

@anlowee
Copy link
Contributor

@anlowee anlowee commented Aug 18, 2025

Description

This PR added the support for S3 config for Presto. It scans the clp-config.yaml to decide whether the S3 config should be generated if the archive_output.storage.type is s3.

Since from user's end, there is no changes in the usage, no need to update the doc.

Note that this PR should not be merged before Presto update the worker image dependency (y-scope/presto#55). It needs to install ca-certificates package.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Test with a S3-sonfigured clp-package and it works.

Summary by CodeRabbit

  • New Features

    • Added a CLI to generate CLP environment and worker configuration from package configs, supporting filesystem and S3 storage; introduced coordinator split-filter setting and switched coordinator config to use split-filter.json.
  • Chores

    • Setup script now invokes the new CLI; updated .gitignore and removed the default connector name from the worker template.
  • Documentation

    • Bumped CLP JSON requirement, added python3-venv, added S3/object-storage guidance, changed filter config format and dataset filter structure, and updated query examples including default-dataset guidance.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

Walkthrough

Adds a new CLI tool that generates CLP environment variables and worker clp.properties from CLP YAML, credentials and coordinator env files; updates the setup script to call it; switches coordinator/worker templates to split-filter config; adds a coordinator env var; updates docs for S3/object-storage and split-filter usage; updates .gitignore.

Changes

Cohort / File(s) Summary
Init CLI & setup script
tools/deployment/presto-clp/scripts/init.py, tools/deployment/presto-clp/scripts/set-up-config.sh
New CLI init.py (adds main(argv=None) -> int) that loads etc/clp-config.yml, credentials.yml, and coordinator-common.env, validates DB and archive storage types (mariadb/mysql, fs/s3), extracts S3/AWS creds when needed, builds env vars (DB URL/name/user/password, CLP_ARCHIVES_DIR, discovery URI), generates worker clp.properties (including S3 entries), and writes KEY=VALUE output; set-up-config.sh now invokes init.py.
Coordinator env & template
tools/deployment/presto-clp/coordinator.env, tools/deployment/presto-clp/coordinator/config-template/clp.properties
Added PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_FILTER_PROVIDER_TYPE in coordinator.env; replaced clp.metadata-filter-config with clp.split-filter-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_FILTER_PROVIDER_TYPE} and clp.split-filter-config=/opt/presto-server/etc/split-filter.json in coordinator config template.
Worker template & gitignore
tools/deployment/presto-clp/worker/config-template/clp.properties, tools/deployment/presto-clp/.gitignore
Removed connector.name=clp from worker clp.properties template; added .gitignore rule to ignore worker/config-template/clp.properties.
Documentation
docs/src/user-docs/guides-using-presto.md
Bumped CLP JSON version to v0.5.0, added python3-venv, added S3 object-storage guidance (credentials-only), switched examples to split-filter.json, nested rangeMapping under customOptions, updated DESCRIBE/query examples and wording to reflect object storage and default dataset behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Setup as set-up-config.sh
  participant Init as init.py
  participant CLPPkg as CLP Package (etc/, credentials.yml, coordinator-common.env)
  participant Out as Output (.env / worker clp.properties)

  User->>Setup: run set-up-config.sh
  Setup->>Init: invoke init.py --clp-package-dir --output-file
  Init->>CLPPkg: load etc/clp-config.yml
  alt missing clp-config.yml
    Init-->>Setup: log error & exit 1
  else
    Init->>CLPPkg: read credentials.yml and coordinator-common.env
    Init->>Init: validate DB type and archive storage type
    alt storage.type == s3
      Init->>CLPPkg: extract S3 config and AWS creds
      Init->>Init: add S3-specific worker properties
    else storage.type == fs
      Init->>Init: resolve filesystem archive paths
    end
    Init->>Init: build env_vars and worker clp.properties
    par write outputs
      Init->>Out: write .env (KEY=VALUE)
      Init->>Out: write worker clp.properties
    end
    Init-->>Setup: exit 0
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • kirkrodrigues
  • junhaoliao
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@anlowee anlowee marked this pull request as ready for review August 19, 2025 14:14
@anlowee anlowee requested a review from a team as a code owner August 19, 2025 14:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/deployment/presto-clp/scripts/generate-configs.py (1)

66-69: Write the .env with restrictive permissions (secrets included).

The generated file includes database credentials and S3 secrets. Ensure the file is not world-readable.

Outside the selected range, adjust the write to enforce 0600:

import os  # at top

# replace the existing open/write block:
fd = os.open(output_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
with os.fdopen(fd, "w") as output_file_handle:
    for key, value in env_vars.items():
        output_file_handle.write(f"{key}={value}\n")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c04e093 and 5e13836.

📒 Files selected for processing (3)
  • tools/deployment/presto-clp/scripts/generate-configs.py (5 hunks)
  • tools/deployment/presto-clp/scripts/set-up-config.sh (1 hunks)
  • tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1 hunks)
🔇 Additional comments (9)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (2)

77-86: Lock down clp.properties permissions after writing AWS secrets

In tools/deployment/presto-clp/worker/scripts/generate-configs.sh (lines 77–86), you’re persisting AWS credentials to disk. To prevent unintended access, restrict ownership and mode immediately after the update_config_file calls:

# Restrict access to the Presto process account
# Replace <user>:<group> with the user/group that runs Presto in your container
chown <user>:<group> "$CLP_PROPERTIES_FILE" || true
# Only the owner needs read/write; remove all other access
chmod 0600 "$CLP_PROPERTIES_FILE" || true

If your deployment requires group read, use chmod 0640 instead. Please confirm which user and group the Presto service runs under in your container setup to avoid permission errors.


77-86: Verify CLP connector S3 endpoint property name
We only found clp.s3-end-point in your scripts—no occurrences of clp.s3-endpoint. The upstream YScope CLP docs configure S3 via clp-config.yml (under s3_config) and don’t list Presto-style properties, and Presto connectors each define their own keys in code or README.

Please confirm the exact S3 endpoint property your CLP → Presto connector expects (e.g., by checking its README or source) and update accordingly:

  • tools/deployment/presto-clp/worker/scripts/generate-configs.sh (line 84)
  • tools/deployment/presto-clp/scripts/generate-configs.py (line 146)

Once you’ve verified the correct key (likely clp.s3-endpoint if it follows common naming), align both the property name and the PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT env var.

tools/deployment/presto-clp/scripts/generate-configs.py (6)

5-5: Type import update looks good.

Importing Any is appropriate for the parsed YAML config structure.


31-33: ArgumentParser description change is fine.

No behavioural change; improves readability.


59-59: Signature change usage LGTM.

Passing clp_config into _add_clp_env_vars matches the refactor and reduces I/O duplication.


73-75: Signature refactor is appropriate.

Accepting clp_config explicitly makes the helper more testable and side‑effect free.


150-152: Error message is precise and user-facing. LGTM.

Clear guidance for supported storage types.


116-148: Ensure S3 config validation & clean endpoint

Please verify and update the S3‐related logic in tools/deployment/presto-clp/scripts/generate-configs.py:

  • _get_config_value returns None when a key is missing, which today gets written into the Presto .env and then into clp.properties as the literal “None”. We should validate that the following values are present before setting any environment variables:

    • <s3_credentials_key_prefix>.access_key_id
    • <s3_credentials_key_prefix>.secret_access_key
    • <s3_config_key_prefix>.bucket
    • <s3_config_key_prefix>.region_code
      If any are missing, log an error listing the missing keys and exit (return False).
  • Remove the trailing slash from the endpoint so it reads
    https://{bucket}.s3.{region}.amazonaws.com
    instead of ending with /.

  • Confirm whether the Presto CLP connector expects:

    1. A bucket-scoped endpoint ({bucket}.s3.{region}.amazonaws.com), or
    2. The service endpoint (s3.{region}.amazonaws.com) with the bucket supplied separately in metadata/URLs.

Suggested diff in generate-configs.py:

     elif "s3" == clp_archive_output_storage_type:
@@
-        s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+        # Build endpoint without trailing slash
+        s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com"

+        # Validate required S3 config values
+        missing = []
+        if not s3_access_key_id:
+            missing.append(f"{s3_credentials_key_prefix}.access_key_id")
+        if not s3_secret_access_key:
+            missing.append(f"{s3_credentials_key_prefix}.secret_access_key")
+        if not s3_bucket:
+            missing.append(f"{s3_config_key_prefix}.bucket")
+        if not s3_region_code:
+            missing.append(f"{s3_config_key_prefix}.region_code")
+        if missing:
+            logger.error(
+                "Missing required S3 config value(s): %s", ", ".join(missing)
+            )
+            return False

And note in tools/deployment/presto-clp/worker/scripts/generate-configs.sh at line 84 we write clp.s3-end-point from this value.

tools/deployment/presto-clp/scripts/set-up-config.sh (1)

26-26: Switch to generate-configs.py is aligned with the refactor.

The entry-point change matches the new flow that loads clp-config.yml in Python. No further changes needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

42-64: Make key matching literal in update_config_file to avoid regex collisions

Keys like presto.version and clp.s3-auth-provider contain dots which are regex metacharacters; your grep/sed patterns can match unintended lines. Use fixed-string grep and escape the key for sed replacement.

Apply this diff within update_config_file:

 update_config_file() {
     local file_path=$1
     local key=$2
     local value=$3
     local sensitive=${4:-false}
+    # Escape regex metacharacters in key for sed
+    local escaped_key
+    escaped_key=$(printf '%s' "$key" | sed -e 's/[][\\.^$*+?{}()|]/\\&/g')
 
-    if grep --quiet "^${key}=.*$" "$file_path"; then
-        sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path"
+    if grep --quiet -F "^${key}=" "$file_path"; then
+        sed --in-place "s|^${escaped_key}=.*|${key}=${value}|" "$file_path"
     else
         echo "${key}=${value}" >>"$file_path"
     fi
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5e13836 and d7478f8.

📒 Files selected for processing (2)
  • tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (6 hunks)
  • tools/deployment/presto-clp/worker/scripts/generate-configs.sh (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

85-101: S3 gating, fail-fast validation, and secret masking look solid

  • Correctly gated on PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE == "s3".
  • Fail-fast on missing inputs.
  • Sensitive values masked via update_config_file(..., true).

This addresses prior feedback around gating and secret leakage.

tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (3)

46-57: Early load and validation of clp-config.yml is good

Reading clp-config.yml up-front and failing with a clear error message if missing improves debuggability and keeps helper logic focused.


106-114: CLP_ARCHIVES_DIR resolution logic is sound

Resolving relative paths against the CLP package dir and switching default between archives vs. staged-archives based on storage type is appropriate.


73-76: Signature change verified: all call sites updated

The rg and ast-grep searches show only the one three-argument call to _add_clp_env_vars in generate-user-env-vars-file.py (line 59). No stale two-arg invocations remain.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

42-47: Harden update_config_file: escape regex metacharacters and mask-safe logging docs

Good addition of the sensitive flag. Two follow-ups:

  • The function comment still lists only 3 params; add $4 docs.
  • Key and value are interpolated into grep/sed regex/replacement without escaping. Dots and other metacharacters in keys (e.g., clp.s3-end-point) are treated as regex and may match unintended lines; unescaped ampersands in values can be expanded by sed.

Apply this diff to make the operation regex-safe and replacement-safe:

 update_config_file() {
     local file_path=$1
     local key=$2
     local value=$3
-    local sensitive=${4:-false}
+    local sensitive=${4:-false}
+
+    # Escape key for use in regex contexts
+    local key_re
+    key_re=$(printf '%s' "$key" | sed -e 's/[.[\*^$()+?{|\]/\\&/g')
+    # Escape '&' in the replacement to avoid sed expansion
+    local value_repl=${value//&/\\&}

-    if grep --quiet "^${key}=.*$" "$file_path"; then
-        sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path"
+    if grep --quiet -E "^${key_re}=.*$" "$file_path"; then
+        sed --in-place -E "s|^${key_re}=.*|${key}=${value_repl}|" "$file_path"
     else
         echo "${key}=${value}" >>"$file_path"
     fi

Also, please update the function header to document the optional $4 parameter (masking) for clarity.

Also applies to: 51-64

tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (1)

66-69: Write env-vars file with restrictive permissions (contains secrets)

The output contains AWS creds; use 0600 perms to reduce exposure.

Example change (outside the selected range):

import os

# ...
fd = os.open(output_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
with os.fdopen(fd, "w") as output_file_handle:
    for key, value in env_vars.items():
        output_file_handle.write(f"{key}={value}\n")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d7478f8 and 1bdb012.

📒 Files selected for processing (2)
  • tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (6 hunks)
  • tools/deployment/presto-clp/worker/scripts/generate-configs.sh (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (6)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

87-99: Nice UX improvement: aggregated missing-variable report

Collecting all missing S3 env vars and failing fast improves operator feedback and reduces iteration time. LGTM.

tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (5)

5-5: Type import update LGTM


31-33: CLI description update LGTM


73-75: Signature change LGTM; helps testability and reuse


106-113: Archive dir resolver LGTM

Dynamic default based on storage type and relative-to-package resolution is a good improvement.


132-147: S3 field validation LGTM

Good parity with the shell script; this catches misconfig early with a single clear error.

@anlowee anlowee force-pushed the xwei/s3-support-config branch from 1bdb012 to 14b3a80 Compare August 20, 2025 17:56
@anlowee anlowee changed the title tools: Update the docker compose config generator to support S3 configuration for Presto. chore: Update the docker compose config generator to support S3 configuration for Presto. Aug 20, 2025
@anlowee anlowee requested a review from kirkrodrigues August 20, 2025 18:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

58-63: Masking behaviour looks good; previous concern addressed.

Switching to a fixed mask ("****") avoids the short-value edge case. No further action needed here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 80959fc and 9d92fc9.

📒 Files selected for processing (3)
  • tools/deployment/presto-clp/scripts/init.py (6 hunks)
  • tools/deployment/presto-clp/scripts/set-up-config.sh (1 hunks)
  • tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)

66-66: Bake jq, wget, and ca-certificates into the Presto worker image

Runtime apt-get install in tools/deployment/presto-clp/worker/scripts/generate-configs.sh (line 66) slows startup and is fragile. Instead:

  • Remove the apt-get update && apt-get install --assume-yes --no-install-recommends jq wget from the script.
  • Extend the Presto worker Dockerfile to include jq, wget, and ca-certificates so that no runtime installs are needed.
  • Confirm the dependency PR (y-scope/presto#55) has been merged and identify which image tag now includes ca-certificates.
    • If it hasn’t merged, this PR should be held until that change lands.
    • If it has, update the documentation or CI to reference the new, compliant image tag.
tools/deployment/presto-clp/scripts/init.py (1)

109-116: Relative-to-package resolution for CLP_ARCHIVES_DIR is a nice quality-of-life improvement.

Good defaulting and relative-path handling via _resolve_archives_dir.

tools/deployment/presto-clp/scripts/set-up-config.sh (1)

26-28: Switch to init.py for env generation looks correct.

The CLI and output path remain consistent. This aligns with the new logic in init.py.

Comment on lines 117 to 156
if "fs" == clp_archive_output_storage_type:
pass
elif "s3" == clp_archive_output_storage_type:
s3_config_key_prefix = f"archive_output.storage.s3_config"
s3_credentials_key_prefix = f"{s3_config_key_prefix}.aws_authentication.credentials"

s3_access_key_id = _get_config_value(
clp_config, f"{s3_credentials_key_prefix}.access_key_id"
)

s3_bucket = _get_config_value(clp_config, f"{s3_config_key_prefix}.bucket")
s3_region_code = _get_config_value(clp_config, f"{s3_config_key_prefix}.region_code")

s3_secret_access_key = _get_config_value(
clp_config, f"{s3_credentials_key_prefix}.secret_access_key"
)

# Validate required S3 fields
missing = []

for k, v in {
f"{s3_credentials_key_prefix}.access_key_id": s3_access_key_id,
f"{s3_credentials_key_prefix}.secret_access_key": s3_secret_access_key,
f"{s3_config_key_prefix}.bucket": s3_bucket,
f"{s3_config_key_prefix}.region_code": s3_region_code,
}.items():
if not v:
missing.append(k)

if missing:
logger.error("Missing required S3 config key(s): %s", ", ".join(missing))
return False

env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id
s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

S3 env-var emission looks correct; double-check endpoint construction for buckets with dots and non-standard partitions.

Using https://.s3..amazonaws.com/ can break TLS for bucket names containing dots (multi-label hostnames) and won’t cover GovCloud/China partitions. Consider using a region endpoint without the bucket (e.g., https://s3..amazonaws.com) and rely on a separate bucket parameter if the Presto CLP connector supports it, or add an override knob for custom endpoints.

Would you confirm the CLP Presto connector expects clp.s3-end-point to include the bucket? If not required, switch to region endpoint only to avoid TLS pitfalls for dotted bucket names.


🌐 Web query:

What is the expected format of clp.s3-end-point for the CLP Presto connector? Does it require a bucket-qualified host or a service endpoint (e.g., https://s3.<region>.amazonaws.com)?

💡 Result:

Use a service endpoint (not a bucket-qualified host). clp.s3-end-point should point to the S3 service endpoint (for example https://s3.us‑west‑2.amazonaws.com or a custom S3-compatible service URL), not to a hostname that embeds a bucket (bucketname.s3...). See CLP/Presto S3 connector docs and AWS S3 endpoint guidance for examples [1][2].

References
[1] Cloudera documentation — Presto/CLP S3 connector configuration (see S3 endpoint/URL settings). https://docs.cloudera.com/
[2] AWS S3 endpoints and quotas — service endpoint examples (e.g., https://s3..amazonaws.com). https://docs.aws.amazon.com/general/latest/gr/s3.html


Update S3 endpoint to use service endpoint instead of bucket-qualified host

The Presto CLP connector expects clp.s3-end-point to point at the S3 service endpoint (e.g. https://s3.<region>.amazonaws.com), not a hostname embedding the bucket. Constructing the endpoint as https://<bucket>.s3.<region>.amazonaws.com/ can break TLS for dotted bucket names and won’t work in non-standard AWS partitions.

• In tools/deployment/presto-clp/scripts/init.py (lines 128–130), replace:

- s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+ s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"

• Ensure the bucket name is passed separately via the existing PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET (or equivalent) environment variable.
• Optionally, expose a config override (e.g. clp.s3-end-point-override) for custom S3-compatible services or GovCloud/China endpoints instead of hardcoding the AWS pattern.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if "fs" == clp_archive_output_storage_type:
pass
elif "s3" == clp_archive_output_storage_type:
s3_config_key_prefix = f"archive_output.storage.s3_config"
s3_credentials_key_prefix = f"{s3_config_key_prefix}.aws_authentication.credentials"
s3_access_key_id = _get_config_value(
clp_config, f"{s3_credentials_key_prefix}.access_key_id"
)
s3_bucket = _get_config_value(clp_config, f"{s3_config_key_prefix}.bucket")
s3_region_code = _get_config_value(clp_config, f"{s3_config_key_prefix}.region_code")
s3_secret_access_key = _get_config_value(
clp_config, f"{s3_credentials_key_prefix}.secret_access_key"
)
# Validate required S3 fields
missing = []
for k, v in {
f"{s3_credentials_key_prefix}.access_key_id": s3_access_key_id,
f"{s3_credentials_key_prefix}.secret_access_key": s3_secret_access_key,
f"{s3_config_key_prefix}.bucket": s3_bucket,
f"{s3_config_key_prefix}.region_code": s3_region_code,
}.items():
if not v:
missing.append(k)
if missing:
logger.error("Missing required S3 config key(s): %s", ", ".join(missing))
return False
env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id
s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key
else:
env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id
s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key

Comment on lines 216 to 232
def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
"""
Generates a clp.properties for worker config.
:param worker_config_template_path:
"""
clp_properties_path = worker_config_template_path / "clp.properties"
config_options = ["connector.name=clp"]
if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
config_options.append("clp.storage-type=s3")
config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
with clp_properties_path.open("w", encoding="utf-8") as f:
f.write("\n".join(config_options) + "\n")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Do not write plaintext AWS credentials to disk in the repo; generate a template with env placeholders instead.

This writes clp.s3-access-key-id and clp.s3-secret-access-key values into tools/deployment/presto-clp/worker/config-template/clp.properties on the host filesystem. That increases the blast radius (accidental commits, backups, local-machine compromise). Keep secrets in the .env and expand them inside the container during startup.

Apply this minimal change to avoid persisting secrets and to prevent KeyError when storage type is fs:

 def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
@@
-    clp_properties_path = worker_config_template_path / "clp.properties"
-    config_options = ["connector.name=clp"]
-    if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
-        config_options.append("clp.storage-type=s3")
-        config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
-        config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
-        config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
-        config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
-    with clp_properties_path.open("w", encoding="utf-8") as f:
+    clp_properties_path = worker_config_template_path / "clp.properties"
+    worker_config_template_path.mkdir(parents=True, exist_ok=True)
+    config_options = ["connector.name=clp"]
+    storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+    if storage_type == "s3":
+        # Intentionally reference env vars; secrets are expanded inside the container.
+        config_options.append("clp.storage-type=s3")
+        config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}")
+        config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}")
+        config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}")
+        config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}")
+    with clp_properties_path.open("w", encoding="utf-8") as f:
         f.write("\n".join(config_options) + "\n")

To ensure we’re not already tracking this file, please run:


🏁 Script executed:

#!/bin/bash
# Verify the generated template path is ignored and not tracked
rg -n '^tools/deployment/presto-clp/worker/config-template/clp\.properties$' .gitignore || true
git ls-files --error-unmatch tools/deployment/presto-clp/worker/config-template/clp.properties && echo "TRACKED" || echo "NOT_TRACKED"

Length of output: 282


Untrack clp.properties and use placeholder template for secrets

The file tools/deployment/presto-clp/worker/config-template/clp.properties is currently tracked in Git, which risks committing real AWS credentials. You must remove it from version control, add it to .gitignore, and update the generation logic to emit placeholders instead of actual secret values.

• tools/deployment/presto-clp/worker/config-template/clp.properties
– Run git rm --cached tools/deployment/presto-clp/worker/config-template/clp.properties
– Add tools/deployment/presto-clp/worker/config-template/clp.properties to .gitignore
• tools/deployment/presto-clp/scripts/init.py (around lines 216–232)
– Default PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE to "fs" via env_vars.get(...) to prevent KeyError
– In the "s3" branch, write literal ${…} placeholders instead of expanding AWS secret env vars
– Ensure the target directory exists with mkdir(parents=True, exist_ok=True)

Minimal diff to apply:

 def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
-    clp_properties_path = worker_config_template_path / "clp.properties"
-    config_options = ["connector.name=clp"]
-    if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
-        config_options.append("clp.storage-type=s3")
-        config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
-        config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
-        config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
-        config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
-    with clp_properties_path.open("w", encoding="utf-8") as f:
+    clp_properties_path = worker_config_template_path / "clp.properties"
+    worker_config_template_path.mkdir(parents=True, exist_ok=True)
+    config_options = ["connector.name=clp"]
+    storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+    if storage_type == "s3":
+        # Use placeholders for secrets; expand them at container startup
+        config_options.append("clp.storage-type=s3")
+        config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}")
+        config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}")
+        config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}")
+        config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}")
+    with clp_properties_path.open("w", encoding="utf-8") as f:
         f.write("\n".join(config_options) + "\n")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
"""
Generates a clp.properties for worker config.
:param worker_config_template_path:
"""
clp_properties_path = worker_config_template_path / "clp.properties"
config_options = ["connector.name=clp"]
if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
config_options.append("clp.storage-type=s3")
config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
with clp_properties_path.open("w", encoding="utf-8") as f:
f.write("\n".join(config_options) + "\n")
def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
"""
Generates a clp.properties for worker config.
:param worker_config_template_path:
"""
clp_properties_path = worker_config_template_path / "clp.properties"
worker_config_template_path.mkdir(parents=True, exist_ok=True)
config_options = ["connector.name=clp"]
storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
if storage_type == "s3":
# Use placeholders for secrets; expand them at container startup
config_options.append("clp.storage-type=s3")
config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}")
config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}")
config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}")
config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}")
with clp_properties_path.open("w", encoding="utf-8") as f:
f.write("\n".join(config_options) + "\n")
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 216–232, the
function generating clp.properties must avoid writing real secrets and handle
missing env keys and missing directories: change lookups to use
env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") to default to
"fs" and avoid KeyError; when storage type is "s3" write literal placeholder
tokens (e.g. '${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}') rather than
interpolating env_vars values; ensure the parent directory for
clp_properties_path exists before opening the file by calling
mkdir(parents=True, exist_ok=True); additionally untrack the existing
tools/deployment/presto-clp/worker/config-template/clp.properties with git rm
--cached and add that path to .gitignore so real credentials are not committed.

local file_path=$1
local key=$2
local value=$3
local sensitive=${4:-false}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Document the new "sensitive" parameter in update_config_file.

The function signature gained a fourth parameter, but the header comment still lists only three. Add the param to avoid misuse by future callers.

 # @param $1 Path to the properties file.
 # @param $2 The key to set.
 # @param $3 The value to set.
+# @param $4 (optional) "true" to mask the value when logging
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local sensitive=${4:-false}
# Sets/updates the given kv-pair in the given properties file.
#
# @param $1 Path to the properties file.
# @param $2 The key to set.
# @param $3 The value to set.
# @param $4 (optional) "true" to mask the value when logging
update_config_file() {
local file_path=$1
local key=$2
local value=$3
local sensitive=${4:-false}
}
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/scripts/generate-configs.sh around line
51, the function update_config_file now accepts a fourth parameter "sensitive"
but the header comment only documents three parameters; update that header to
include the new "sensitive" parameter (name: sensitive, purpose: whether the
value is sensitive and should be handled/hidden, type: boolean/string, default:
false) and describe its default behavior so future callers know how to pass and
interpret it. Ensure the param description matches existing comment style and
mentions the default value and any special handling (e.g., masking or skipping
echoing) used by the function.

Comment on lines 53 to 57
if grep --quiet "^${key}=.*$" "$file_path"; then
sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path"
else
echo "${key}=${value}" >>"$file_path"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make key/value matching and replacement regex-safe; avoid failures on missing files.

Keys like node.internal-address contain dots that are regex metacharacters. Using grep/sed with unescaped keys can produce false positives. Also, grep on a non-existent file will fail under set -e. Guard the file and use fixed-string grep; escape replacement metacharacters in the value.

-    if grep --quiet "^${key}=.*$" "$file_path"; then
-        sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path"
-    else
-        echo "${key}=${value}" >>"$file_path"
-    fi
+    # Ensure the file exists to avoid grep/sed failures under `set -e`
+    [[ -f "$file_path" ]] || : >"$file_path"
+    # Match keys literally to avoid regex surprises with dots, etc.
+    if grep -F --quiet "^${key}=" "$file_path"; then
+        # Escape replacement metacharacters in value for sed
+        local escaped_value=${value//\\/\\\\}
+        escaped_value=${escaped_value//|/\\|}
+        escaped_value=${escaped_value//&/\\&}
+        local escaped_key
+        escaped_key=$(printf '%s' "$key" | sed -e 's/[.[\*^$+?{}|()]/\\&/g')
+        sed --in-place "s|^${escaped_key}=.*|${key}=${escaped_value}|" "$file_path"
+    else
+        echo "${key}=${value}" >>"$file_path"
+    fi

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/scripts/generate-configs.sh around lines
53-57, the current grep/sed logic treats the key as a regex and will fail when
keys contain regex metacharacters and when the target file is missing; also the
replacement value can break sed if it contains &, \ or /. Fix by first ensuring
the file exists (touch the file_path if missing), use grep --quiet
--fixed-strings (or grep -qF) to check for the literal key, and when calling sed
escape the key so it is matched literally (escape regex metacharacters such as
.[]*^$) or build the sed command from a safely escaped key; additionally
escape/quote the replacement value for sed (escape &, \ and the chosen
delimiter) before passing it to sed --in-place so the substitution is robust.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/src/user-guide/guides-using-presto.md (1)

116-127: Tiny grammar nit.

-    * To use more than Presto worker, you can use the `--scale` option as follows:
+    * To use more than one Presto worker, use the `--scale` option:
♻️ Duplicate comments (12)
docs/src/user-guide/guides-using-presto.md (4)

96-101: Confirm split-filter keys and briefly explain lower/upperBound mapping.

Double-check the connector expects customOptions.rangeMapping.lowerBound/upperBound with values begin_timestamp/end_timestamp. If correct, add one clarifying sentence after the snippet.

       }
       ```
+      The lowerBound and upperBound map your dataset’s timestamp fields to CLP’s begin and end columns respectively.

Also verify the reference link label below still points to the split-filter section (it currently ends with “#metadata-filter-config-file”). If not, update it to the split-filter anchor.


88-88: Use a colon to introduce the following instruction; minor readability fix.

Keeps the bullets as a single instruction flow.

-    * Open and edit `coordinator/config-template/split-filter.json`.
+    * Open and edit `coordinator/config-template/split-filter.json`:

168-173: Avoid reserved/non-existent identifier in example query.

user may not exist and is reserved in some SQL dialects. Provide a safe, copy‑pasteable example.

-To query the logs in this dataset:
+To query the logs in this dataset:

-```sql
-SELECT user FROM default LIMIT 1;
-```
+```sql
+-- Replace <columnName> with one from the DESCRIBE output:
+SELECT "<columnName>" FROM default LIMIT 1;
+```

179-179: Remove extra blank line to satisfy markdownlint MD012.

-
-
+
tools/deployment/presto-clp/scripts/init.py (8)

10-21: Avoid double-configuring the root logger; use basicConfig with guard.

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
+if not logging.getLogger().handlers:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )

55-57: Handle YAML parse errors and read as UTF‑8.

Avoid unhandled stack traces on malformed clp-config.yml.

-    with open(clp_config_file_path, "r") as clp_config_file:
-        clp_config = yaml.safe_load(clp_config_file)
+    try:
+        with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file:
+            clp_config = yaml.safe_load(clp_config_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", clp_config_file_path, e)
+        return 1

71-75: Create parent dir, write with UTF‑8, and restrict permissions (secrets in env).

-    with open(output_file, "w") as output_file_handle:
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_file, "w", encoding="utf-8") as output_file_handle:
         for key, value in env_vars.items():
             output_file_handle.write(f"{key}={value}\n")
+    try:
+        output_file.chmod(0o600)
+    except Exception as e:
+        logger.warning("Failed to set permissions on '%s': %s", output_file, e)

81-88: Docstring: explicitly document params and return.

     """
     Adds environment variables for CLP config values to `env_vars`.
 
-    :param clp_config:
-    :param clp_package_dir:
-    :param env_vars:
-    :return: Whether the environment variables were successfully added.
+    :param clp_config: Parsed contents of etc/clp-config.yml.
+    :param clp_package_dir: Root directory of the CLP package.
+    :param env_vars: Output map mutated in-place with env vars.
+    :return: True on success; False if validation fails.

112-121: Set storage-type env for fs to keep invariants consistent.

     if "fs" == clp_archive_output_storage_type:
+        env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs"
         env_vars["CLP_ARCHIVES_DIR"] = str(

141-148: Handle credentials.yml parse errors and read as UTF‑8.

-    with open(credentials_file_path, "r") as credentials_file:
-        credentials = yaml.safe_load(credentials_file)
+    try:
+        with open(credentials_file_path, "r", encoding="utf-8") as credentials_file:
+            credentials = yaml.safe_load(credentials_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", credentials_file_path, e)
+        return False

193-199: Use S3 service endpoint (not bucket-host) and avoid TLS pitfalls with dotted buckets.

-    s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+    # Use region service endpoint; bucket is configured separately by the connector.
+    s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"

Optional: add an override knob (e.g., PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT_OVERRIDE) for GovCloud/China or S3‑compatible services.


239-256: Don’t write plaintext AWS secrets to clp.properties; default fs storage; ensure dir exists.

Writing secrets to disk expands blast radius and risks accidental commits. Emit env placeholders and expand in‑container; also avoid KeyError on fs and create the target directory.

-    properties = ["connector.name=clp"]
-    if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
+    properties = ["connector.name=clp"]
+    storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+    if storage_type == "s3":
         property_name_to_env_var_name = {
             "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
-            "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
-            "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
-            "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
-            "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
+            # Expand inside container; keep secrets out of the repo and host FS.
+            "clp.s3-auth-provider": "${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}",
+            "clp.s3-access-key-id": "${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}",
+            "clp.s3-end-point": "${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}",
+            "clp.s3-secret-access-key": "${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}",
         }
 
-        for property_name, env_var_name in property_name_to_env_var_name.items():
-            env_var_value = env_vars.get(env_var_name)
-            if not env_var_value:
-                logger.error("Internal error: Missing required env var '%s'", env_var_name)
-                return False
-            properties.append(f"{property_name}={env_var_value}")
+        for property_name, env_template in property_name_to_env_var_name.items():
+            properties.append(f"{property_name}={env_template}")
+    else:
+        properties.append("clp.storage-type=fs")
 
-    with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
+    (worker_config_template_path).mkdir(parents=True, exist_ok=True)
+    with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
         f.write("\n".join(properties) + "\n")

Follow‑up: ensure tools/deployment/presto-clp/worker/config-template/clp.properties is .gitignored and not tracked.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9842ef6 and 72ae05b.

📒 Files selected for processing (2)
  • docs/src/user-guide/guides-using-presto.md (5 hunks)
  • tools/deployment/presto-clp/scripts/init.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • docs/src/user-guide/guides-using-presto.md
🪛 LanguageTool
docs/src/user-guide/guides-using-presto.md

[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...

(DURING_THAT_TIME_COMMA)

🔇 Additional comments (6)
docs/src/user-guide/guides-using-presto.md (6)

58-66: S3 guidance is good; add merge-blocker note for worker image dependency.

Given this feature requires the Presto worker image with ca-certificates (y-scope/presto#55), add a short callout so users aren’t blocked if they follow this guide before that image is available. Example: “Ensure you’re using a worker image ≥ that includes ca-certificates.”

Would you like me to propose the exact wording once the image tag is known?


155-159: Good note.

Clearly explains the default dataset behaviour.


160-167: LGTM on DESCRIBE example.


174-177: Known issue note is helpful.

Clear warning about SELECT *.


197-197: Link likely points to old metadata-filter section; update to split-filter anchor.

Ensure [clp-connector-docs] targets split-filter config (not metadata-filter).

If needed, I can update the anchor once you confirm the canonical URL.


204-204: Reference looks good.

Links to the Velox issue for the SELECT * crash.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
tools/deployment/presto-clp/scripts/init.py (6)

10-21: Avoid double-configuring the root logger when imported.

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
+if not logging.getLogger().handlers:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )

55-57: Handle YAML parse errors and read clp-config.yml as UTF‑8.

-    with open(clp_config_file_path, "r") as clp_config_file:
-        clp_config = yaml.safe_load(clp_config_file)
+    try:
+        with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file:
+            clp_config = yaml.safe_load(clp_config_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", clp_config_file_path, e)
+        return 1

71-74: Create parent dir, write with UTF‑8, and chmod 0600 for the env file (contains secrets).

-    with open(output_file, "w") as output_file_handle:
-        for key, value in env_vars.items():
-            output_file_handle.write(f"{key}={value}\n")
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_file, "w", encoding="utf-8") as output_file_handle:
+        for key, value in env_vars.items():
+            output_file_handle.write(f"{key}={value}\n")
+    try:
+        output_file.chmod(0o600)
+    except Exception as e:
+        logger.warning("Failed to set permissions on '%s': %s", output_file, e)

112-121: Set storage-type env for fs to preserve invariants and avoid downstream KeyError.

     if "fs" == clp_archive_output_storage_type:
-        env_vars["CLP_ARCHIVES_DIR"] = str(
+        env_vars["CLP_ARCHIVES_DIR"] = str(
             _get_path_clp_config_value(
                 clp_config,
                 f"{archive_output_storage_key}.directory",
                 Path("var") / "data" / "archives",
                 clp_package_dir,
             )
         )
+        env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs"

141-148: Handle credentials.yml parse errors and read as UTF‑8.

-    with open(credentials_file_path, "r") as credentials_file:
-        credentials = yaml.safe_load(credentials_file)
+    try:
+        with open(credentials_file_path, "r", encoding="utf-8") as credentials_file:
+            credentials = yaml.safe_load(credentials_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", credentials_file_path, e)
+        return False

170-191: S3: log missing keys and use the service endpoint (not bucket-qualified) to avoid TLS issues with dotted bucket names.

-    except KeyError:
-        return False
+    except KeyError as e:
+        logger.error("Missing required S3 config key: %s", e)
+        return False
@@
-    s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+    # Prefer service endpoint; avoids TLS problems with dotted bucket names and works across partitions.
+    s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"

Please confirm the CLP Presto connector expects a service endpoint for clp.s3-end-point and does not require a bucket-qualified host.

Also applies to: 196-198

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 72ae05b and 34617b9.

📒 Files selected for processing (2)
  • docs/src/user-docs/guides-using-presto.md (5 hunks)
  • tools/deployment/presto-clp/scripts/init.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T13:35:33.366Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.366Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.

Applied to files:

  • tools/deployment/presto-clp/scripts/init.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md

[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...

(DURING_THAT_TIME_COMMA)

🔇 Additional comments (4)
docs/src/user-docs/guides-using-presto.md (3)

58-66: S3 setup guidance reads well and scopes auth support clearly — LGTM.


155-158: Querying section improvements are solid; the SELECT * warning is especially helpful.

Also applies to: 160-167, 168-177


88-101: Keep existing anchor
The Presto connector docs only provide a “Metadata Filter Config File” section (anchor #metadata-filter-config-file); there is no “split-filter-config” anchor. Leave the link pointing to #metadata-filter-config-file as is.

Likely an incorrect or invalid review comment.

tools/deployment/presto-clp/scripts/init.py (1)

298-306: Type hint Optional[str] for _get_config_value matches the codebase convention — LGTM.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (8)
tools/deployment/presto-clp/scripts/init.py (8)

10-20: Guard logging setup to prevent duplicate handlers when imported

Use basicConfig only if no handlers are present.

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
+if not logging.getLogger().handlers:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )

55-57: Handle YAML parse errors and read clp-config.yml as UTF‑8

Avoids noisy stack traces on malformed YAML.

-    with open(clp_config_file_path, "r") as clp_config_file:
-        clp_config = yaml.safe_load(clp_config_file)
+    try:
+        with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file:
+            clp_config = yaml.safe_load(clp_config_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", clp_config_file_path, e)
+        return 1

71-74: Ensure parent dir exists, write as UTF‑8, and chmod 0600 (env contains secrets)

Prevents failures on missing dirs and reduces secret exposure.

-    with open(output_file, "w") as output_file_handle:
-        for key, value in env_vars.items():
-            output_file_handle.write(f"{key}={value}\n")
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_file, "w", encoding="utf-8") as output_file_handle:
+        for key, value in env_vars.items():
+            output_file_handle.write(f"{key}={value}\n")
+    try:
+        output_file.chmod(0o600)
+    except Exception as e:
+        logger.warning("Failed to set permissions on '%s': %s", output_file, e)

81-88: Docstring: document params and return contract

Clarify usage for maintainers.

     """
-    Adds environment variables for CLP config values to `env_vars`.
-
-    :param clp_config:
-    :param clp_package_dir:
-    :param env_vars:
-    :return: Whether the environment variables were successfully added.
+    Adds environment variables for CLP config values to env_vars.
+
+    :param clp_config: Parsed contents of etc/clp-config.yml.
+    :param clp_package_dir: Root of the CLP package; used to resolve relative paths.
+    :param env_vars: Output map mutated in-place.
+    :return: True on success; False if validation fails.
     """

112-121: Set storage-type env for fs to keep invariants consistent

Downstream code can rely on this always being present.

-    if "fs" == clp_archive_output_storage_type:
-        env_vars["CLP_ARCHIVES_DIR"] = str(
+    if "fs" == clp_archive_output_storage_type:
+        env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs"
+        env_vars["CLP_ARCHIVES_DIR"] = str(

146-148: Handle YAML parse errors for credentials.yml and read as UTF‑8

Fail gracefully with a clear message.

-    with open(credentials_file_path, "r") as credentials_file:
-        credentials = yaml.safe_load(credentials_file)
+    try:
+        with open(credentials_file_path, "r", encoding="utf-8") as credentials_file:
+            credentials = yaml.safe_load(credentials_file)
+    except yaml.YAMLError as e:
+        logger.error("Failed to parse '%s': %s", credentials_file_path, e)
+        return False

196-198: Use S3 service endpoint (avoid bucket‑qualified host to prevent TLS issues with dotted bucket names)

Prefer region service endpoint; bucket should not be embedded in the host.

-    s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+    s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com"

239-258: Do not write plaintext AWS secrets to clp.properties; write ${ENV} placeholders, default storage-type, and ensure dir exists

Reduces secret exposure and fixes KeyError on fs.

-    properties = ["connector.name=clp"]
-    if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"):
+    properties = ["connector.name=clp"]
+    storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+    if storage_type == "s3":
         property_name_to_env_var_name = {
             "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
             "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
             "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
             "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
             "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
         }
 
         for property_name, env_var_name in property_name_to_env_var_name.items():
             env_var_value = env_vars.get(env_var_name)
             if not env_var_value:
                 logger.error("Internal error: Missing required env var '%s'", env_var_name)
                 return False
-            properties.append(f"{property_name}={env_var_value}")
+            properties.append(f"{property_name}=${{{env_var_name}}}")
+    else:
+        properties.append("clp.storage-type=fs")
 
-    with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
+    worker_config_template_path.mkdir(parents=True, exist_ok=True)
+    with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
         f.write("\n".join(properties) + "\n")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 34617b9 and b3694b7.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/scripts/init.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-02T13:35:33.366Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.366Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.

Applied to files:

  • tools/deployment/presto-clp/scripts/init.py
🔇 Additional comments (2)
tools/deployment/presto-clp/scripts/init.py (2)

183-189: Good: both access_key_id and secret_access_key are now required

Prevents late failures due to missing secrets.


298-306: Good: _get_config_value() returns Optional[str] per CLP config conventions

Matches the project learning that YAML values are strings.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/src/user-docs/guides-using-presto.md (2)

114-121: Add a short note about S3 TLS prereq (CA certs in worker image).

Users on S3 may hit TLS failures without CA certificates in the Presto worker image. Add a warning to prevent confusion until the dependent image is available.

Apply after the compose command block:

     docker compose up
     ```
 
+    :::{warning}
+    If you are using S3 archives over HTTPS, ensure your Presto worker image includes CA certificates; otherwise S3 requests may fail with TLS errors.
+    :::

197-197: Update docs anchor to split-filter section.

The link still targets #metadata-filter-config-file. With the move to split filters, point to the split-filter section.

Apply:

-[clp-connector-docs]: https://docs.yscope.com/presto/connector/clp.html#metadata-filter-config-file
+[clp-connector-docs]: https://docs.yscope.com/presto/connector/clp.html#split-filter-config-file
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b3694b7 and 6bcacfb.

📒 Files selected for processing (1)
  • docs/src/user-docs/guides-using-presto.md (5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md

[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...

(DURING_THAT_TIME_COMMA)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-docs/guides-using-presto.md

190-190: Reference links and images should use a label that is defined
Missing link or image reference definition: "y-scope/velox#27"

(MD052, reference-links-images)

🔇 Additional comments (4)
docs/src/user-docs/guides-using-presto.md (4)

88-101: Confirm split-filter JSON shape (customOptions.rangeMapping).

Double-check that customOptions.rangeMapping.lowerBound|upperBound is the expected nesting for the current CLP connector; older examples placed rangeMapping at the top level.

Would you like me to verify against the latest connector docs and propose an adjusted example if needed?


155-158: Good addition: default dataset note is clear.


160-167: LGTM: switch to DESCRIBE and clarify schema.


168-177: LGTM: warn about SELECT * crash with reference.


The Presto CLP integration has the following limitations at present:

* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix broken Velox reference (build failure under MD052).

Text cites [y-scope/velox#27] but only #28 is defined below; also your warning above references #28. Unify to #28.

Apply:

-* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27].
+* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#28].
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27].
* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#28].
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

190-190: Reference links and images should use a label that is defined
Missing link or image reference definition: "y-scope/velox#27"

(MD052, reference-links-images)

🤖 Prompt for AI Agents
In docs/src/user-docs/guides-using-presto.md around line 190, the markdown
reference uses [y-scope/velox#27] but the defined reference and the warning
above use #28; update the inline citation to [y-scope/velox#28] so it matches
the defined reference and resolves the MD052 build failure.

kirkrodrigues
kirkrodrigues previously approved these changes Sep 3, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(presto-clp): Update docs and config generation to support reading archives from S3 and Presto split-filtering config.

[credentials](guides-using-object-storage/clp-config.md#credentials) authentication type.
:::

4. Continue following the [quick-start](./quick-start/index.md#using-clp) guide to start CLP and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use quick-start/index.md#using-clp? Because at L63 it is guides-using-object-storage/clp-config.md#credentials

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

To query the logs in this dataset:

```sql
SELECT * FROM default LIMIT 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention the limitation that doesn't work for postgreqsql (or potential other datasets)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work for postgresql. The problem was that we needed to update the worker image to work with the latest clp. For elasticsearch, it works, but @timestamp shows up as NULL. That you'll need to verify on your end. I haven't tested other datasets. Does it not work for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mongodb (original dataset w/o preprocessing) the query directly failed (Query is gone? in presto-cli).
For cockroachdb and spark datasets:

presto:default> select * from datacockroachdb limit 1;
Query 20250903_152806_00005_35wkg failed:  Failed to parse type [hb]. Type not registered.

presto:default> select * from dataspark limit 1;
Query is gone (server restarted?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have issues for all of the bugs causing those failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anlowee anlowee changed the title chore: Update the docker compose config generator to support S3 configuration and latest split filter configuration for Presto. feat(presto-clp): Update docs and config generation to support reading archives from S3 and Presto split-filtering config. Sep 3, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/src/user-docs/guides-using-presto.md (2)

89-107: Clarify JSON example and point to full schema.

Consider adding a short sentence clarifying that must match the CLP timestamp key for that dataset, and that lowerBound/upperBound should be the CLP column names begin_timestamp/end_timestamp as shown. The rest looks correct.


135-137: Prefer docker compose down --volumes for clean-up.

rm won’t remove networks/volumes; down --volumes fully tears down the stack.

Apply:

-docker compose rm
+docker compose down --volumes
♻️ Duplicate comments (6)
tools/deployment/presto-clp/scripts/init.py (6)

146-153: Handle YAML parse errors and read as UTF‑8 (credentials.yml).

Fail gracefully on malformed credentials.

Apply:

-with open(credentials_file_path, "r") as credentials_file:
-    credentials = yaml.safe_load(credentials_file)
+try:
+    with open(credentials_file_path, "r", encoding="utf-8") as credentials_file:
+        credentials = yaml.safe_load(credentials_file)
+except yaml.YAMLError as e:
+    logger.error("Failed to parse '%s': %s", credentials_file_path, e)
+    return False

55-57: Handle YAML parse errors and read as UTF‑8 (clp-config.yml).

Avoids stack traces on malformed YAML.

Apply:

-with open(clp_config_file_path, "r") as clp_config_file:
-    clp_config = yaml.safe_load(clp_config_file)
+try:
+    with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file:
+        clp_config = yaml.safe_load(clp_config_file)
+except yaml.YAMLError as e:
+    logger.error("Failed to parse '%s': %s", clp_config_file_path, e)
+    return 1

10-21: Avoid double-configuring the root logger (use basicConfig with a guard).

Prevents duplicate handlers when imported.

Apply:

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
+if not logging.getLogger().handlers:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )

71-75: Create parent dir, write with UTF‑8, and restrict permissions (env file contains secrets).

Sets sane defaults for secret material.

Apply:

-with open(output_file, "w") as output_file_handle:
-    for key, value in env_vars.items():
-        output_file_handle.write(f"{key}={value}\n")
+output_file.parent.mkdir(parents=True, exist_ok=True)
+with open(output_file, "w", encoding="utf-8") as output_file_handle:
+    for key, value in env_vars.items():
+        output_file_handle.write(f"{key}={value}\n")
+try:
+    output_file.chmod(0o600)
+except Exception as e:
+    logger.warning("Failed to set permissions on '%s': %s", output_file, e)

314-317: Type hints: return Optional[str] for config values.

CLP YAML values are strings; tighten types for clarity.

Apply:

-def _get_config_value(
-    config: Dict[str, Any], key: str, default_value: Optional[Any] = None
-) -> Optional[Any]:
+def _get_config_value(
+    config: Dict[str, Any], key: str, default_value: Optional[str] = None
+) -> Optional[str]:

209-214: Use S3 service endpoint (not bucket-qualified) and emit the bucket separately.

Avoids TLS issues with dotted bucket names and supports non-standard partitions.

Apply:

 env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package"
 env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id
-s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
-env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point
 env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key
+env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET"] = s3_bucket
+# Use the regional S3 service endpoint.
+env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = f"https://s3.{s3_region_code}.amazonaws.com"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc90d7f and 2306ba1.

📒 Files selected for processing (2)
  • docs/src/user-docs/guides-using-presto.md (7 hunks)
  • tools/deployment/presto-clp/scripts/init.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T13:35:33.396Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.396Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.

Applied to files:

  • tools/deployment/presto-clp/scripts/init.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md

[typographical] ~90-~90: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...

(DURING_THAT_TIME_COMMA)

🔇 Additional comments (6)
docs/src/user-docs/guides-using-presto.md (6)

19-19: Confirm minimum CLP version aligns with split-filter support.

Please verify v0.5.0 is indeed the minimum that includes split-filter support required by this doc and the new templates. If not, bump accordingly.


58-65: LGTM: S3/object-storage guidance and closed note block.

The guidance reads clearly and the admonition is properly closed; rendering/numbering should be stable.


66-66: LGTM: Quick-start anchor fix.

The relative anchor to quick-start usage looks correct.


121-128: LGTM: Scaling workers wording.

Clearer phrasing and example.


156-160: LGTM: Default dataset note is accurate and closed.


161-168: LGTM: Describe guidance for default and non-default datasets.

Comment on lines +104 to +110
database_host = _get_config_value(clp_config, "database.host", "localhost")
database_port = _get_config_value(clp_config, "database.port", 3306)
database_name = _get_config_value(clp_config, "database.name", "clp-db")
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = (
f"jdbc:mysql://{database_host}:{database_port}"
)
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Keep config values as strings; adjust default port type.

Config values are strings in CLP; pass the default port as a string for consistency.

Apply:

-database_port = _get_config_value(clp_config, "database.port", 3306)
+database_port = _get_config_value(clp_config, "database.port", "3306")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
database_host = _get_config_value(clp_config, "database.host", "localhost")
database_port = _get_config_value(clp_config, "database.port", 3306)
database_name = _get_config_value(clp_config, "database.name", "clp-db")
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = (
f"jdbc:mysql://{database_host}:{database_port}"
)
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name
database_host = _get_config_value(clp_config, "database.host", "localhost")
database_port = _get_config_value(clp_config, "database.port", "3306")
database_name = _get_config_value(clp_config, "database.name", "clp-db")
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = (
f"jdbc:mysql://{database_host}:{database_port}"
)
env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 104 to 110, the CLP
config values are treated as strings so the default database port should be a
string as well; change the call _get_config_value(clp_config, "database.port",
3306) to use "3306" as the default string (so database_port is always a string)
and ensure any downstream concatenation or usage treats database_port as a
string.

Comment on lines +233 to +239
except KeyError as e:
logger.error(
"Missing required key %s in '%s'",
e,
coordinator_common_env_file_path,
)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Log the missing key name without extra quotes.

Improves readability of error logs.

Apply:

-except KeyError as e:
-    logger.error(
-        "Missing required key %s in '%s'",
-        e,
-        coordinator_common_env_file_path,
-    )
+except KeyError as e:
+    missing_key = e.args[0] if e.args else str(e)
+    logger.error(
+        "Missing required key %s in '%s'",
+        missing_key,
+        coordinator_common_env_file_path,
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except KeyError as e:
logger.error(
"Missing required key %s in '%s'",
e,
coordinator_common_env_file_path,
)
return False
except KeyError as e:
missing_key = e.args[0] if e.args else str(e)
logger.error(
"Missing required key %s in '%s'",
missing_key,
coordinator_common_env_file_path,
)
return False
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 233 to 239, the
logger currently logs the KeyError object which prints the missing key with
extra quotes; replace the logged argument with the raw key name (use e.args[0])
so the message reads the key without surrounding quotes, e.g. pass e.args[0] to
logger.error instead of e.

Comment on lines +254 to +272
properties = ["connector.name=clp"]
if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"):
property_name_to_env_var_name = {
"clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
"clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
"clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
"clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
"clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
}

for property_name, env_var_name in property_name_to_env_var_name.items():
env_var_value = env_vars.get(env_var_name)
if not env_var_value:
logger.error("Internal error: Missing required env var '%s'", env_var_name)
return False
properties.append(f"{property_name}={env_var_value}")

with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
f.write("\n".join(properties) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Do not write plaintext secrets to clp.properties; write env placeholders and ensure dir exists.

Prevents secret sprawl; also adds fs default and bucket property.

Apply:

-properties = ["connector.name=clp"]
-if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"):
-    property_name_to_env_var_name = {
-        "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
-        "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
-        "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
-        "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
-        "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
-    }
-
-    for property_name, env_var_name in property_name_to_env_var_name.items():
-        env_var_value = env_vars.get(env_var_name)
-        if not env_var_value:
-            logger.error("Internal error: Missing required env var '%s'", env_var_name)
-            return False
-        properties.append(f"{property_name}={env_var_value}")
-
-with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
+properties = ["connector.name=clp"]
+storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+if storage_type == "s3":
+    property_name_to_env_var_name = {
+        "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
+        "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
+        "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
+        "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
+        "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
+        "clp.s3-bucket": "PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET",
+    }
+
+    for property_name, env_var_name in property_name_to_env_var_name.items():
+        if not env_vars.get(env_var_name):
+            logger.error("Internal error: Missing required env var '%s'", env_var_name)
+            return False
+        # Expand inside the container; avoid persisting secrets on disk.
+        properties.append(f"{property_name}=${{{env_var_name}}}")
+else:
+    properties.append("clp.storage-type=fs")
+
+worker_config_template_path.mkdir(parents=True, exist_ok=True)
+with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
     f.write("\n".join(properties) + "\n")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
properties = ["connector.name=clp"]
if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"):
property_name_to_env_var_name = {
"clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
"clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
"clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
"clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
"clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
}
for property_name, env_var_name in property_name_to_env_var_name.items():
env_var_value = env_vars.get(env_var_name)
if not env_var_value:
logger.error("Internal error: Missing required env var '%s'", env_var_name)
return False
properties.append(f"{property_name}={env_var_value}")
with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
f.write("\n".join(properties) + "\n")
properties = ["connector.name=clp"]
storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
if storage_type == "s3":
property_name_to_env_var_name = {
"clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
"clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
"clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
"clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
"clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
"clp.s3-bucket": "PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET",
}
for property_name, env_var_name in property_name_to_env_var_name.items():
if not env_vars.get(env_var_name):
logger.error("Internal error: Missing required env var '%s'", env_var_name)
return False
# Expand inside the container; avoid persisting secrets on disk.
properties.append(f"{property_name}=${{{env_var_name}}}")
else:
properties.append("clp.storage-type=fs")
# Ensure the template directory exists before writing
worker_config_template_path.mkdir(parents=True, exist_ok=True)
with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
f.write("\n".join(properties) + "\n")

@anlowee anlowee merged commit de378af into y-scope:main Sep 3, 2025
7 checks passed
@anlowee anlowee deleted the xwei/s3-support-config branch September 3, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants