-
Notifications
You must be signed in to change notification settings - Fork 83
feat(presto-clp): Update docs and config generation to support reading archives from S3 and Presto split-filtering config. #1228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e the worker's catalog.
This reverts commit aa3e18d.
WalkthroughAdds a new CLI tool that generates CLP environment variables and worker clp.properties from CLP YAML, credentials and coordinator env files; updates the setup script to call it; switches coordinator/worker templates to split-filter config; adds a coordinator env var; updates docs for S3/object-storage and split-filter usage; updates .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Setup as set-up-config.sh
participant Init as init.py
participant CLPPkg as CLP Package (etc/, credentials.yml, coordinator-common.env)
participant Out as Output (.env / worker clp.properties)
User->>Setup: run set-up-config.sh
Setup->>Init: invoke init.py --clp-package-dir --output-file
Init->>CLPPkg: load etc/clp-config.yml
alt missing clp-config.yml
Init-->>Setup: log error & exit 1
else
Init->>CLPPkg: read credentials.yml and coordinator-common.env
Init->>Init: validate DB type and archive storage type
alt storage.type == s3
Init->>CLPPkg: extract S3 config and AWS creds
Init->>Init: add S3-specific worker properties
else storage.type == fs
Init->>Init: resolve filesystem archive paths
end
Init->>Init: build env_vars and worker clp.properties
par write outputs
Init->>Out: write .env (KEY=VALUE)
Init->>Out: write worker clp.properties
end
Init-->>Setup: exit 0
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/deployment/presto-clp/scripts/generate-configs.py (1)
66-69: Write the .env with restrictive permissions (secrets included).The generated file includes database credentials and S3 secrets. Ensure the file is not world-readable.
Outside the selected range, adjust the write to enforce 0600:
import os # at top # replace the existing open/write block: fd = os.open(output_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600) with os.fdopen(fd, "w") as output_file_handle: for key, value in env_vars.items(): output_file_handle.write(f"{key}={value}\n")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tools/deployment/presto-clp/scripts/generate-configs.py(5 hunks)tools/deployment/presto-clp/scripts/set-up-config.sh(1 hunks)tools/deployment/presto-clp/worker/scripts/generate-configs.sh(1 hunks)
🔇 Additional comments (9)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (2)
77-86: Lock down clp.properties permissions after writing AWS secretsIn tools/deployment/presto-clp/worker/scripts/generate-configs.sh (lines 77–86), you’re persisting AWS credentials to disk. To prevent unintended access, restrict ownership and mode immediately after the update_config_file calls:
# Restrict access to the Presto process account # Replace <user>:<group> with the user/group that runs Presto in your container chown <user>:<group> "$CLP_PROPERTIES_FILE" || true # Only the owner needs read/write; remove all other access chmod 0600 "$CLP_PROPERTIES_FILE" || trueIf your deployment requires group read, use chmod 0640 instead. Please confirm which user and group the Presto service runs under in your container setup to avoid permission errors.
77-86: Verify CLP connector S3 endpoint property name
We only foundclp.s3-end-pointin your scripts—no occurrences ofclp.s3-endpoint. The upstream YScope CLP docs configure S3 viaclp-config.yml(unders3_config) and don’t list Presto-style properties, and Presto connectors each define their own keys in code or README.Please confirm the exact S3 endpoint property your CLP → Presto connector expects (e.g., by checking its README or source) and update accordingly:
- tools/deployment/presto-clp/worker/scripts/generate-configs.sh (line 84)
- tools/deployment/presto-clp/scripts/generate-configs.py (line 146)
Once you’ve verified the correct key (likely
clp.s3-endpointif it follows common naming), align both the property name and thePRESTO_WORKER_CLPPROPERTIES_S3_END_POINTenv var.tools/deployment/presto-clp/scripts/generate-configs.py (6)
5-5: Type import update looks good.Importing Any is appropriate for the parsed YAML config structure.
31-33: ArgumentParser description change is fine.No behavioural change; improves readability.
59-59: Signature change usage LGTM.Passing clp_config into _add_clp_env_vars matches the refactor and reduces I/O duplication.
73-75: Signature refactor is appropriate.Accepting clp_config explicitly makes the helper more testable and side‑effect free.
150-152: Error message is precise and user-facing. LGTM.Clear guidance for supported storage types.
116-148: Ensure S3 config validation & clean endpointPlease verify and update the S3‐related logic in tools/deployment/presto-clp/scripts/generate-configs.py:
_get_config_valuereturnsNonewhen a key is missing, which today gets written into the Presto .env and then into clp.properties as the literal “None”. We should validate that the following values are present before setting any environment variables:
<s3_credentials_key_prefix>.access_key_id<s3_credentials_key_prefix>.secret_access_key<s3_config_key_prefix>.bucket<s3_config_key_prefix>.region_code
If any are missing, log an error listing the missing keys and exit (return False).Remove the trailing slash from the endpoint so it reads
https://{bucket}.s3.{region}.amazonaws.com
instead of ending with/.Confirm whether the Presto CLP connector expects:
- A bucket-scoped endpoint (
{bucket}.s3.{region}.amazonaws.com), or- The service endpoint (
s3.{region}.amazonaws.com) with the bucket supplied separately in metadata/URLs.Suggested diff in generate-configs.py:
elif "s3" == clp_archive_output_storage_type: @@ - s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" + # Build endpoint without trailing slash + s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com" + # Validate required S3 config values + missing = [] + if not s3_access_key_id: + missing.append(f"{s3_credentials_key_prefix}.access_key_id") + if not s3_secret_access_key: + missing.append(f"{s3_credentials_key_prefix}.secret_access_key") + if not s3_bucket: + missing.append(f"{s3_config_key_prefix}.bucket") + if not s3_region_code: + missing.append(f"{s3_config_key_prefix}.region_code") + if missing: + logger.error( + "Missing required S3 config value(s): %s", ", ".join(missing) + ) + return FalseAnd note in tools/deployment/presto-clp/worker/scripts/generate-configs.sh at line 84 we write
clp.s3-end-pointfrom this value.tools/deployment/presto-clp/scripts/set-up-config.sh (1)
26-26: Switch to generate-configs.py is aligned with the refactor.The entry-point change matches the new flow that loads clp-config.yml in Python. No further changes needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
42-64: Make key matching literal in update_config_file to avoid regex collisionsKeys like presto.version and clp.s3-auth-provider contain dots which are regex metacharacters; your grep/sed patterns can match unintended lines. Use fixed-string grep and escape the key for sed replacement.
Apply this diff within update_config_file:
update_config_file() { local file_path=$1 local key=$2 local value=$3 local sensitive=${4:-false} + # Escape regex metacharacters in key for sed + local escaped_key + escaped_key=$(printf '%s' "$key" | sed -e 's/[][\\.^$*+?{}()|]/\\&/g') - if grep --quiet "^${key}=.*$" "$file_path"; then - sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path" + if grep --quiet -F "^${key}=" "$file_path"; then + sed --in-place "s|^${escaped_key}=.*|${key}=${value}|" "$file_path" else echo "${key}=${value}" >>"$file_path" fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py(6 hunks)tools/deployment/presto-clp/worker/scripts/generate-configs.sh(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
85-101: S3 gating, fail-fast validation, and secret masking look solid
- Correctly gated on PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE == "s3".
- Fail-fast on missing inputs.
- Sensitive values masked via update_config_file(..., true).
This addresses prior feedback around gating and secret leakage.
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (3)
46-57: Early load and validation of clp-config.yml is goodReading clp-config.yml up-front and failing with a clear error message if missing improves debuggability and keeps helper logic focused.
106-114: CLP_ARCHIVES_DIR resolution logic is soundResolving relative paths against the CLP package dir and switching default between archives vs. staged-archives based on storage type is appropriate.
73-76: Signature change verified: all call sites updatedThe
rgandast-grepsearches show only the one three-argument call to_add_clp_env_varsingenerate-user-env-vars-file.py(line 59). No stale two-arg invocations remain.
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
42-47: Harden update_config_file: escape regex metacharacters and mask-safe logging docsGood addition of the sensitive flag. Two follow-ups:
- The function comment still lists only 3 params; add $4 docs.
- Key and value are interpolated into grep/sed regex/replacement without escaping. Dots and other metacharacters in keys (e.g., clp.s3-end-point) are treated as regex and may match unintended lines; unescaped ampersands in values can be expanded by sed.
Apply this diff to make the operation regex-safe and replacement-safe:
update_config_file() { local file_path=$1 local key=$2 local value=$3 - local sensitive=${4:-false} + local sensitive=${4:-false} + + # Escape key for use in regex contexts + local key_re + key_re=$(printf '%s' "$key" | sed -e 's/[.[\*^$()+?{|\]/\\&/g') + # Escape '&' in the replacement to avoid sed expansion + local value_repl=${value//&/\\&} - if grep --quiet "^${key}=.*$" "$file_path"; then - sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path" + if grep --quiet -E "^${key_re}=.*$" "$file_path"; then + sed --in-place -E "s|^${key_re}=.*|${key}=${value_repl}|" "$file_path" else echo "${key}=${value}" >>"$file_path" fiAlso, please update the function header to document the optional $4 parameter (masking) for clarity.
Also applies to: 51-64
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (1)
66-69: Write env-vars file with restrictive permissions (contains secrets)The output contains AWS creds; use 0600 perms to reduce exposure.
Example change (outside the selected range):
import os # ... fd = os.open(output_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600) with os.fdopen(fd, "w") as output_file_handle: for key, value in env_vars.items(): output_file_handle.write(f"{key}={value}\n")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py(6 hunks)tools/deployment/presto-clp/worker/scripts/generate-configs.sh(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (6)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
87-99: Nice UX improvement: aggregated missing-variable reportCollecting all missing S3 env vars and failing fast improves operator feedback and reduces iteration time. LGTM.
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py (5)
5-5: Type import update LGTM
31-33: CLI description update LGTM
73-75: Signature change LGTM; helps testability and reuse
106-113: Archive dir resolver LGTMDynamic default based on storage type and relative-to-package resolution is a good improvement.
132-147: S3 field validation LGTMGood parity with the shell script; this catches misconfig early with a single clear error.
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Outdated
Show resolved
Hide resolved
1bdb012 to
14b3a80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
58-63: Masking behaviour looks good; previous concern addressed.Switching to a fixed mask ("****") avoids the short-value edge case. No further action needed here.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tools/deployment/presto-clp/scripts/init.py(6 hunks)tools/deployment/presto-clp/scripts/set-up-config.sh(1 hunks)tools/deployment/presto-clp/worker/scripts/generate-configs.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
tools/deployment/presto-clp/worker/scripts/generate-configs.sh (1)
66-66: Bake jq, wget, and ca-certificates into the Presto worker imageRuntime
apt-get installintools/deployment/presto-clp/worker/scripts/generate-configs.sh(line 66) slows startup and is fragile. Instead:
- Remove the
apt-get update && apt-get install --assume-yes --no-install-recommends jq wgetfrom the script.- Extend the Presto worker Dockerfile to include
jq,wget, andca-certificatesso that no runtime installs are needed.- Confirm the dependency PR (y-scope/presto#55) has been merged and identify which image tag now includes
ca-certificates.
• If it hasn’t merged, this PR should be held until that change lands.
• If it has, update the documentation or CI to reference the new, compliant image tag.tools/deployment/presto-clp/scripts/init.py (1)
109-116: Relative-to-package resolution for CLP_ARCHIVES_DIR is a nice quality-of-life improvement.Good defaulting and relative-path handling via _resolve_archives_dir.
tools/deployment/presto-clp/scripts/set-up-config.sh (1)
26-28: Switch to init.py for env generation looks correct.The CLI and output path remain consistent. This aligns with the new logic in init.py.
| if "fs" == clp_archive_output_storage_type: | ||
| pass | ||
| elif "s3" == clp_archive_output_storage_type: | ||
| s3_config_key_prefix = f"archive_output.storage.s3_config" | ||
| s3_credentials_key_prefix = f"{s3_config_key_prefix}.aws_authentication.credentials" | ||
|
|
||
| s3_access_key_id = _get_config_value( | ||
| clp_config, f"{s3_credentials_key_prefix}.access_key_id" | ||
| ) | ||
|
|
||
| s3_bucket = _get_config_value(clp_config, f"{s3_config_key_prefix}.bucket") | ||
| s3_region_code = _get_config_value(clp_config, f"{s3_config_key_prefix}.region_code") | ||
|
|
||
| s3_secret_access_key = _get_config_value( | ||
| clp_config, f"{s3_credentials_key_prefix}.secret_access_key" | ||
| ) | ||
|
|
||
| # Validate required S3 fields | ||
| missing = [] | ||
|
|
||
| for k, v in { | ||
| f"{s3_credentials_key_prefix}.access_key_id": s3_access_key_id, | ||
| f"{s3_credentials_key_prefix}.secret_access_key": s3_secret_access_key, | ||
| f"{s3_config_key_prefix}.bucket": s3_bucket, | ||
| f"{s3_config_key_prefix}.region_code": s3_region_code, | ||
| }.items(): | ||
| if not v: | ||
| missing.append(k) | ||
|
|
||
| if missing: | ||
| logger.error("Missing required S3 config key(s): %s", ", ".join(missing)) | ||
| return False | ||
|
|
||
| env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3" | ||
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package" | ||
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id | ||
| s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" | ||
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point | ||
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
S3 env-var emission looks correct; double-check endpoint construction for buckets with dots and non-standard partitions.
Using https://.s3..amazonaws.com/ can break TLS for bucket names containing dots (multi-label hostnames) and won’t cover GovCloud/China partitions. Consider using a region endpoint without the bucket (e.g., https://s3..amazonaws.com) and rely on a separate bucket parameter if the Presto CLP connector supports it, or add an override knob for custom endpoints.
Would you confirm the CLP Presto connector expects clp.s3-end-point to include the bucket? If not required, switch to region endpoint only to avoid TLS pitfalls for dotted bucket names.
🌐 Web query:
What is the expected format of clp.s3-end-point for the CLP Presto connector? Does it require a bucket-qualified host or a service endpoint (e.g., https://s3.<region>.amazonaws.com)?
💡 Result:
Use a service endpoint (not a bucket-qualified host). clp.s3-end-point should point to the S3 service endpoint (for example https://s3.us‑west‑2.amazonaws.com or a custom S3-compatible service URL), not to a hostname that embeds a bucket (bucketname.s3...). See CLP/Presto S3 connector docs and AWS S3 endpoint guidance for examples [1][2].
References
[1] Cloudera documentation — Presto/CLP S3 connector configuration (see S3 endpoint/URL settings). https://docs.cloudera.com/
[2] AWS S3 endpoints and quotas — service endpoint examples (e.g., https://s3..amazonaws.com). https://docs.aws.amazon.com/general/latest/gr/s3.html
Update S3 endpoint to use service endpoint instead of bucket-qualified host
The Presto CLP connector expects clp.s3-end-point to point at the S3 service endpoint (e.g. https://s3.<region>.amazonaws.com), not a hostname embedding the bucket. Constructing the endpoint as https://<bucket>.s3.<region>.amazonaws.com/ can break TLS for dotted bucket names and won’t work in non-standard AWS partitions.
• In tools/deployment/presto-clp/scripts/init.py (lines 128–130), replace:
- s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/"
+ s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"• Ensure the bucket name is passed separately via the existing PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET (or equivalent) environment variable.
• Optionally, expose a config override (e.g. clp.s3-end-point-override) for custom S3-compatible services or GovCloud/China endpoints instead of hardcoding the AWS pattern.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if "fs" == clp_archive_output_storage_type: | |
| pass | |
| elif "s3" == clp_archive_output_storage_type: | |
| s3_config_key_prefix = f"archive_output.storage.s3_config" | |
| s3_credentials_key_prefix = f"{s3_config_key_prefix}.aws_authentication.credentials" | |
| s3_access_key_id = _get_config_value( | |
| clp_config, f"{s3_credentials_key_prefix}.access_key_id" | |
| ) | |
| s3_bucket = _get_config_value(clp_config, f"{s3_config_key_prefix}.bucket") | |
| s3_region_code = _get_config_value(clp_config, f"{s3_config_key_prefix}.region_code") | |
| s3_secret_access_key = _get_config_value( | |
| clp_config, f"{s3_credentials_key_prefix}.secret_access_key" | |
| ) | |
| # Validate required S3 fields | |
| missing = [] | |
| for k, v in { | |
| f"{s3_credentials_key_prefix}.access_key_id": s3_access_key_id, | |
| f"{s3_credentials_key_prefix}.secret_access_key": s3_secret_access_key, | |
| f"{s3_config_key_prefix}.bucket": s3_bucket, | |
| f"{s3_config_key_prefix}.region_code": s3_region_code, | |
| }.items(): | |
| if not v: | |
| missing.append(k) | |
| if missing: | |
| logger.error("Missing required S3 config key(s): %s", ", ".join(missing)) | |
| return False | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id | |
| s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key | |
| else: | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "s3" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id | |
| s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/" | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point | |
| env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key |
| def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool: | ||
| """ | ||
| Generates a clp.properties for worker config. | ||
| :param worker_config_template_path: | ||
| """ | ||
| clp_properties_path = worker_config_template_path / "clp.properties" | ||
| config_options = ["connector.name=clp"] | ||
| if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]: | ||
| config_options.append("clp.storage-type=s3") | ||
| config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}') | ||
| config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}') | ||
| config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}') | ||
| config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}') | ||
| with clp_properties_path.open("w", encoding="utf-8") as f: | ||
| f.write("\n".join(config_options) + "\n") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not write plaintext AWS credentials to disk in the repo; generate a template with env placeholders instead.
This writes clp.s3-access-key-id and clp.s3-secret-access-key values into tools/deployment/presto-clp/worker/config-template/clp.properties on the host filesystem. That increases the blast radius (accidental commits, backups, local-machine compromise). Keep secrets in the .env and expand them inside the container during startup.
Apply this minimal change to avoid persisting secrets and to prevent KeyError when storage type is fs:
def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
@@
- clp_properties_path = worker_config_template_path / "clp.properties"
- config_options = ["connector.name=clp"]
- if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
- config_options.append("clp.storage-type=s3")
- config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
- config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
- config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
- config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
- with clp_properties_path.open("w", encoding="utf-8") as f:
+ clp_properties_path = worker_config_template_path / "clp.properties"
+ worker_config_template_path.mkdir(parents=True, exist_ok=True)
+ config_options = ["connector.name=clp"]
+ storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+ if storage_type == "s3":
+ # Intentionally reference env vars; secrets are expanded inside the container.
+ config_options.append("clp.storage-type=s3")
+ config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}")
+ config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}")
+ config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}")
+ config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}")
+ with clp_properties_path.open("w", encoding="utf-8") as f:
f.write("\n".join(config_options) + "\n")To ensure we’re not already tracking this file, please run:
🏁 Script executed:
#!/bin/bash
# Verify the generated template path is ignored and not tracked
rg -n '^tools/deployment/presto-clp/worker/config-template/clp\.properties$' .gitignore || true
git ls-files --error-unmatch tools/deployment/presto-clp/worker/config-template/clp.properties && echo "TRACKED" || echo "NOT_TRACKED"Length of output: 282
Untrack clp.properties and use placeholder template for secrets
The file tools/deployment/presto-clp/worker/config-template/clp.properties is currently tracked in Git, which risks committing real AWS credentials. You must remove it from version control, add it to .gitignore, and update the generation logic to emit placeholders instead of actual secret values.
• tools/deployment/presto-clp/worker/config-template/clp.properties
– Run git rm --cached tools/deployment/presto-clp/worker/config-template/clp.properties
– Add tools/deployment/presto-clp/worker/config-template/clp.properties to .gitignore
• tools/deployment/presto-clp/scripts/init.py (around lines 216–232)
– Default PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE to "fs" via env_vars.get(...) to prevent KeyError
– In the "s3" branch, write literal ${…} placeholders instead of expanding AWS secret env vars
– Ensure the target directory exists with mkdir(parents=True, exist_ok=True)
Minimal diff to apply:
def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool:
- clp_properties_path = worker_config_template_path / "clp.properties"
- config_options = ["connector.name=clp"]
- if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]:
- config_options.append("clp.storage-type=s3")
- config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}')
- config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}')
- config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}')
- config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}')
- with clp_properties_path.open("w", encoding="utf-8") as f:
+ clp_properties_path = worker_config_template_path / "clp.properties"
+ worker_config_template_path.mkdir(parents=True, exist_ok=True)
+ config_options = ["connector.name=clp"]
+ storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+ if storage_type == "s3":
+ # Use placeholders for secrets; expand them at container startup
+ config_options.append("clp.storage-type=s3")
+ config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}")
+ config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}")
+ config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}")
+ config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}")
+ with clp_properties_path.open("w", encoding="utf-8") as f:
f.write("\n".join(config_options) + "\n")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool: | |
| """ | |
| Generates a clp.properties for worker config. | |
| :param worker_config_template_path: | |
| """ | |
| clp_properties_path = worker_config_template_path / "clp.properties" | |
| config_options = ["connector.name=clp"] | |
| if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]: | |
| config_options.append("clp.storage-type=s3") | |
| config_options.append(f'clp.s3-auth-provider={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"]}') | |
| config_options.append(f'clp.s3-access-key-id={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"]}') | |
| config_options.append(f'clp.s3-end-point={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"]}') | |
| config_options.append(f'clp.s3-secret-access-key={env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"]}') | |
| with clp_properties_path.open("w", encoding="utf-8") as f: | |
| f.write("\n".join(config_options) + "\n") | |
| def _generate_worker_clp_properties(worker_config_template_path: Path, env_vars: Dict[str, str]) -> bool: | |
| """ | |
| Generates a clp.properties for worker config. | |
| :param worker_config_template_path: | |
| """ | |
| clp_properties_path = worker_config_template_path / "clp.properties" | |
| worker_config_template_path.mkdir(parents=True, exist_ok=True) | |
| config_options = ["connector.name=clp"] | |
| storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") | |
| if storage_type == "s3": | |
| # Use placeholders for secrets; expand them at container startup | |
| config_options.append("clp.storage-type=s3") | |
| config_options.append("clp.s3-auth-provider=${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}") | |
| config_options.append("clp.s3-access-key-id=${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}") | |
| config_options.append("clp.s3-end-point=${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}") | |
| config_options.append("clp.s3-secret-access-key=${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}") | |
| with clp_properties_path.open("w", encoding="utf-8") as f: | |
| f.write("\n".join(config_options) + "\n") |
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 216–232, the
function generating clp.properties must avoid writing real secrets and handle
missing env keys and missing directories: change lookups to use
env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") to default to
"fs" and avoid KeyError; when storage type is "s3" write literal placeholder
tokens (e.g. '${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}') rather than
interpolating env_vars values; ensure the parent directory for
clp_properties_path exists before opening the file by calling
mkdir(parents=True, exist_ok=True); additionally untrack the existing
tools/deployment/presto-clp/worker/config-template/clp.properties with git rm
--cached and add that path to .gitignore so real credentials are not committed.
| local file_path=$1 | ||
| local key=$2 | ||
| local value=$3 | ||
| local sensitive=${4:-false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Document the new "sensitive" parameter in update_config_file.
The function signature gained a fourth parameter, but the header comment still lists only three. Add the param to avoid misuse by future callers.
# @param $1 Path to the properties file.
# @param $2 The key to set.
# @param $3 The value to set.
+# @param $4 (optional) "true" to mask the value when logging📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local sensitive=${4:-false} | |
| # Sets/updates the given kv-pair in the given properties file. | |
| # | |
| # @param $1 Path to the properties file. | |
| # @param $2 The key to set. | |
| # @param $3 The value to set. | |
| # @param $4 (optional) "true" to mask the value when logging | |
| update_config_file() { | |
| local file_path=$1 | |
| local key=$2 | |
| local value=$3 | |
| local sensitive=${4:-false} | |
| … | |
| } |
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/scripts/generate-configs.sh around line
51, the function update_config_file now accepts a fourth parameter "sensitive"
but the header comment only documents three parameters; update that header to
include the new "sensitive" parameter (name: sensitive, purpose: whether the
value is sensitive and should be handled/hidden, type: boolean/string, default:
false) and describe its default behavior so future callers know how to pass and
interpret it. Ensure the param description matches existing comment style and
mentions the default value and any special handling (e.g., masking or skipping
echoing) used by the function.
| if grep --quiet "^${key}=.*$" "$file_path"; then | ||
| sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path" | ||
| else | ||
| echo "${key}=${value}" >>"$file_path" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make key/value matching and replacement regex-safe; avoid failures on missing files.
Keys like node.internal-address contain dots that are regex metacharacters. Using grep/sed with unescaped keys can produce false positives. Also, grep on a non-existent file will fail under set -e. Guard the file and use fixed-string grep; escape replacement metacharacters in the value.
- if grep --quiet "^${key}=.*$" "$file_path"; then
- sed --in-place "s|^${key}=.*|${key}=${value}|" "$file_path"
- else
- echo "${key}=${value}" >>"$file_path"
- fi
+ # Ensure the file exists to avoid grep/sed failures under `set -e`
+ [[ -f "$file_path" ]] || : >"$file_path"
+ # Match keys literally to avoid regex surprises with dots, etc.
+ if grep -F --quiet "^${key}=" "$file_path"; then
+ # Escape replacement metacharacters in value for sed
+ local escaped_value=${value//\\/\\\\}
+ escaped_value=${escaped_value//|/\\|}
+ escaped_value=${escaped_value//&/\\&}
+ local escaped_key
+ escaped_key=$(printf '%s' "$key" | sed -e 's/[.[\*^$+?{}|()]/\\&/g')
+ sed --in-place "s|^${escaped_key}=.*|${key}=${escaped_value}|" "$file_path"
+ else
+ echo "${key}=${value}" >>"$file_path"
+ fiCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/scripts/generate-configs.sh around lines
53-57, the current grep/sed logic treats the key as a regex and will fail when
keys contain regex metacharacters and when the target file is missing; also the
replacement value can break sed if it contains &, \ or /. Fix by first ensuring
the file exists (touch the file_path if missing), use grep --quiet
--fixed-strings (or grep -qF) to check for the literal key, and when calling sed
escape the key so it is matched literally (escape regex metacharacters such as
.[]*^$) or build the sed command from a safely escaped key; additionally
escape/quote the replacement value for sed (escape &, \ and the chosen
delimiter) before passing it to sed --in-place so the substitution is robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/src/user-guide/guides-using-presto.md (1)
116-127: Tiny grammar nit.- * To use more than Presto worker, you can use the `--scale` option as follows: + * To use more than one Presto worker, use the `--scale` option:
♻️ Duplicate comments (12)
docs/src/user-guide/guides-using-presto.md (4)
96-101: Confirm split-filter keys and briefly explain lower/upperBound mapping.Double-check the connector expects customOptions.rangeMapping.lowerBound/upperBound with values begin_timestamp/end_timestamp. If correct, add one clarifying sentence after the snippet.
} ``` + The lowerBound and upperBound map your dataset’s timestamp fields to CLP’s begin and end columns respectively.Also verify the reference link label below still points to the split-filter section (it currently ends with “#metadata-filter-config-file”). If not, update it to the split-filter anchor.
88-88: Use a colon to introduce the following instruction; minor readability fix.Keeps the bullets as a single instruction flow.
- * Open and edit `coordinator/config-template/split-filter.json`. + * Open and edit `coordinator/config-template/split-filter.json`:
168-173: Avoid reserved/non-existent identifier in example query.user may not exist and is reserved in some SQL dialects. Provide a safe, copy‑pasteable example.
-To query the logs in this dataset: +To query the logs in this dataset: -```sql -SELECT user FROM default LIMIT 1; -``` +```sql +-- Replace <columnName> with one from the DESCRIBE output: +SELECT "<columnName>" FROM default LIMIT 1; +```
179-179: Remove extra blank line to satisfy markdownlint MD012.- - +tools/deployment/presto-clp/scripts/init.py (8)
10-21: Avoid double-configuring the root logger; use basicConfig with guard.-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) +if not logging.getLogger().handlers: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + )
55-57: Handle YAML parse errors and read as UTF‑8.Avoid unhandled stack traces on malformed clp-config.yml.
- with open(clp_config_file_path, "r") as clp_config_file: - clp_config = yaml.safe_load(clp_config_file) + try: + with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file: + clp_config = yaml.safe_load(clp_config_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", clp_config_file_path, e) + return 1
71-75: Create parent dir, write with UTF‑8, and restrict permissions (secrets in env).- with open(output_file, "w") as output_file_handle: + output_file.parent.mkdir(parents=True, exist_ok=True) + with open(output_file, "w", encoding="utf-8") as output_file_handle: for key, value in env_vars.items(): output_file_handle.write(f"{key}={value}\n") + try: + output_file.chmod(0o600) + except Exception as e: + logger.warning("Failed to set permissions on '%s': %s", output_file, e)
81-88: Docstring: explicitly document params and return.""" Adds environment variables for CLP config values to `env_vars`. - :param clp_config: - :param clp_package_dir: - :param env_vars: - :return: Whether the environment variables were successfully added. + :param clp_config: Parsed contents of etc/clp-config.yml. + :param clp_package_dir: Root directory of the CLP package. + :param env_vars: Output map mutated in-place with env vars. + :return: True on success; False if validation fails.
112-121: Set storage-type env for fs to keep invariants consistent.if "fs" == clp_archive_output_storage_type: + env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs" env_vars["CLP_ARCHIVES_DIR"] = str(
141-148: Handle credentials.yml parse errors and read as UTF‑8.- with open(credentials_file_path, "r") as credentials_file: - credentials = yaml.safe_load(credentials_file) + try: + with open(credentials_file_path, "r", encoding="utf-8") as credentials_file: + credentials = yaml.safe_load(credentials_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", credentials_file_path, e) + return False
193-199: Use S3 service endpoint (not bucket-host) and avoid TLS pitfalls with dotted buckets.- s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" + # Use region service endpoint; bucket is configured separately by the connector. + s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"Optional: add an override knob (e.g., PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT_OVERRIDE) for GovCloud/China or S3‑compatible services.
239-256: Don’t write plaintext AWS secrets to clp.properties; default fs storage; ensure dir exists.Writing secrets to disk expands blast radius and risks accidental commits. Emit env placeholders and expand in‑container; also avoid KeyError on fs and create the target directory.
- properties = ["connector.name=clp"] - if "s3" == env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"]: + properties = ["connector.name=clp"] + storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") + if storage_type == "s3": property_name_to_env_var_name = { "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", - "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER", - "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID", - "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT", - "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY", + # Expand inside container; keep secrets out of the repo and host FS. + "clp.s3-auth-provider": "${PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER}", + "clp.s3-access-key-id": "${PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID}", + "clp.s3-end-point": "${PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT}", + "clp.s3-secret-access-key": "${PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY}", } - for property_name, env_var_name in property_name_to_env_var_name.items(): - env_var_value = env_vars.get(env_var_name) - if not env_var_value: - logger.error("Internal error: Missing required env var '%s'", env_var_name) - return False - properties.append(f"{property_name}={env_var_value}") + for property_name, env_template in property_name_to_env_var_name.items(): + properties.append(f"{property_name}={env_template}") + else: + properties.append("clp.storage-type=fs") - with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: + (worker_config_template_path).mkdir(parents=True, exist_ok=True) + with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: f.write("\n".join(properties) + "\n")Follow‑up: ensure tools/deployment/presto-clp/worker/config-template/clp.properties is .gitignored and not tracked.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/src/user-guide/guides-using-presto.md(5 hunks)tools/deployment/presto-clp/scripts/init.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
docs/src/user-guide/guides-using-presto.md
🪛 LanguageTool
docs/src/user-guide/guides-using-presto.md
[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...
(DURING_THAT_TIME_COMMA)
🔇 Additional comments (6)
docs/src/user-guide/guides-using-presto.md (6)
58-66: S3 guidance is good; add merge-blocker note for worker image dependency.Given this feature requires the Presto worker image with ca-certificates (y-scope/presto#55), add a short callout so users aren’t blocked if they follow this guide before that image is available. Example: “Ensure you’re using a worker image ≥ that includes ca-certificates.”
Would you like me to propose the exact wording once the image tag is known?
155-159: Good note.Clearly explains the default dataset behaviour.
160-167: LGTM on DESCRIBE example.
174-177: Known issue note is helpful.Clear warning about SELECT *.
197-197: Link likely points to old metadata-filter section; update to split-filter anchor.Ensure [clp-connector-docs] targets split-filter config (not metadata-filter).
If needed, I can update the anchor once you confirm the canonical URL.
204-204: Reference looks good.Links to the Velox issue for the SELECT * crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
tools/deployment/presto-clp/scripts/init.py (6)
10-21: Avoid double-configuring the root logger when imported.-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) +if not logging.getLogger().handlers: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + )
55-57: Handle YAML parse errors and read clp-config.yml as UTF‑8.- with open(clp_config_file_path, "r") as clp_config_file: - clp_config = yaml.safe_load(clp_config_file) + try: + with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file: + clp_config = yaml.safe_load(clp_config_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", clp_config_file_path, e) + return 1
71-74: Create parent dir, write with UTF‑8, and chmod 0600 for the env file (contains secrets).- with open(output_file, "w") as output_file_handle: - for key, value in env_vars.items(): - output_file_handle.write(f"{key}={value}\n") + output_file.parent.mkdir(parents=True, exist_ok=True) + with open(output_file, "w", encoding="utf-8") as output_file_handle: + for key, value in env_vars.items(): + output_file_handle.write(f"{key}={value}\n") + try: + output_file.chmod(0o600) + except Exception as e: + logger.warning("Failed to set permissions on '%s': %s", output_file, e)
112-121: Set storage-type env for fs to preserve invariants and avoid downstream KeyError.if "fs" == clp_archive_output_storage_type: - env_vars["CLP_ARCHIVES_DIR"] = str( + env_vars["CLP_ARCHIVES_DIR"] = str( _get_path_clp_config_value( clp_config, f"{archive_output_storage_key}.directory", Path("var") / "data" / "archives", clp_package_dir, ) ) + env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs"
141-148: Handle credentials.yml parse errors and read as UTF‑8.- with open(credentials_file_path, "r") as credentials_file: - credentials = yaml.safe_load(credentials_file) + try: + with open(credentials_file_path, "r", encoding="utf-8") as credentials_file: + credentials = yaml.safe_load(credentials_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", credentials_file_path, e) + return False
170-191: S3: log missing keys and use the service endpoint (not bucket-qualified) to avoid TLS issues with dotted bucket names.- except KeyError: - return False + except KeyError as e: + logger.error("Missing required S3 config key: %s", e) + return False @@ - s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" + # Prefer service endpoint; avoids TLS problems with dotted bucket names and works across partitions. + s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com/"Please confirm the CLP Presto connector expects a service endpoint for
clp.s3-end-pointand does not require a bucket-qualified host.Also applies to: 196-198
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/src/user-docs/guides-using-presto.md(5 hunks)tools/deployment/presto-clp/scripts/init.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T13:35:33.366Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.366Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.
Applied to files:
tools/deployment/presto-clp/scripts/init.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md
[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...
(DURING_THAT_TIME_COMMA)
🔇 Additional comments (4)
docs/src/user-docs/guides-using-presto.md (3)
58-66: S3 setup guidance reads well and scopes auth support clearly — LGTM.
155-158: Querying section improvements are solid; the SELECT * warning is especially helpful.Also applies to: 160-167, 168-177
88-101: Keep existing anchor
The Presto connector docs only provide a “Metadata Filter Config File” section (anchor #metadata-filter-config-file); there is no “split-filter-config” anchor. Leave the link pointing to #metadata-filter-config-file as is.Likely an incorrect or invalid review comment.
tools/deployment/presto-clp/scripts/init.py (1)
298-306: Type hint Optional[str] for _get_config_value matches the codebase convention — LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (8)
tools/deployment/presto-clp/scripts/init.py (8)
10-20: Guard logging setup to prevent duplicate handlers when importedUse basicConfig only if no handlers are present.
-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) +if not logging.getLogger().handlers: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + )
55-57: Handle YAML parse errors and read clp-config.yml as UTF‑8Avoids noisy stack traces on malformed YAML.
- with open(clp_config_file_path, "r") as clp_config_file: - clp_config = yaml.safe_load(clp_config_file) + try: + with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file: + clp_config = yaml.safe_load(clp_config_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", clp_config_file_path, e) + return 1
71-74: Ensure parent dir exists, write as UTF‑8, and chmod 0600 (env contains secrets)Prevents failures on missing dirs and reduces secret exposure.
- with open(output_file, "w") as output_file_handle: - for key, value in env_vars.items(): - output_file_handle.write(f"{key}={value}\n") + output_file.parent.mkdir(parents=True, exist_ok=True) + with open(output_file, "w", encoding="utf-8") as output_file_handle: + for key, value in env_vars.items(): + output_file_handle.write(f"{key}={value}\n") + try: + output_file.chmod(0o600) + except Exception as e: + logger.warning("Failed to set permissions on '%s': %s", output_file, e)
81-88: Docstring: document params and return contractClarify usage for maintainers.
""" - Adds environment variables for CLP config values to `env_vars`. - - :param clp_config: - :param clp_package_dir: - :param env_vars: - :return: Whether the environment variables were successfully added. + Adds environment variables for CLP config values to env_vars. + + :param clp_config: Parsed contents of etc/clp-config.yml. + :param clp_package_dir: Root of the CLP package; used to resolve relative paths. + :param env_vars: Output map mutated in-place. + :return: True on success; False if validation fails. """
112-121: Set storage-type env for fs to keep invariants consistentDownstream code can rely on this always being present.
- if "fs" == clp_archive_output_storage_type: - env_vars["CLP_ARCHIVES_DIR"] = str( + if "fs" == clp_archive_output_storage_type: + env_vars["PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"] = "fs" + env_vars["CLP_ARCHIVES_DIR"] = str(
146-148: Handle YAML parse errors for credentials.yml and read as UTF‑8Fail gracefully with a clear message.
- with open(credentials_file_path, "r") as credentials_file: - credentials = yaml.safe_load(credentials_file) + try: + with open(credentials_file_path, "r", encoding="utf-8") as credentials_file: + credentials = yaml.safe_load(credentials_file) + except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", credentials_file_path, e) + return False
196-198: Use S3 service endpoint (avoid bucket‑qualified host to prevent TLS issues with dotted bucket names)Prefer region service endpoint; bucket should not be embedded in the host.
- s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" + s3_end_point = f"https://s3.{s3_region_code}.amazonaws.com"
239-258: Do not write plaintext AWS secrets to clp.properties; write ${ENV} placeholders, default storage-type, and ensure dir existsReduces secret exposure and fixes KeyError on fs.
- properties = ["connector.name=clp"] - if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"): + properties = ["connector.name=clp"] + storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") + if storage_type == "s3": property_name_to_env_var_name = { "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER", "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID", "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT", "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY", } for property_name, env_var_name in property_name_to_env_var_name.items(): env_var_value = env_vars.get(env_var_name) if not env_var_value: logger.error("Internal error: Missing required env var '%s'", env_var_name) return False - properties.append(f"{property_name}={env_var_value}") + properties.append(f"{property_name}=${{{env_var_name}}}") + else: + properties.append("clp.storage-type=fs") - with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: + worker_config_template_path.mkdir(parents=True, exist_ok=True) + with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: f.write("\n".join(properties) + "\n")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tools/deployment/presto-clp/scripts/init.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-02T13:35:33.366Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.366Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.
Applied to files:
tools/deployment/presto-clp/scripts/init.py
🔇 Additional comments (2)
tools/deployment/presto-clp/scripts/init.py (2)
183-189: Good: both access_key_id and secret_access_key are now requiredPrevents late failures due to missing secrets.
298-306: Good: _get_config_value() returns Optional[str] per CLP config conventionsMatches the project learning that YAML values are strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/src/user-docs/guides-using-presto.md (2)
114-121: Add a short note about S3 TLS prereq (CA certs in worker image).Users on S3 may hit TLS failures without CA certificates in the Presto worker image. Add a warning to prevent confusion until the dependent image is available.
Apply after the compose command block:
docker compose up ``` + :::{warning} + If you are using S3 archives over HTTPS, ensure your Presto worker image includes CA certificates; otherwise S3 requests may fail with TLS errors. + :::
197-197: Update docs anchor to split-filter section.The link still targets
#metadata-filter-config-file. With the move to split filters, point to the split-filter section.Apply:
-[clp-connector-docs]: https://docs.yscope.com/presto/connector/clp.html#metadata-filter-config-file +[clp-connector-docs]: https://docs.yscope.com/presto/connector/clp.html#split-filter-config-file
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/src/user-docs/guides-using-presto.md(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md
[typographical] ~89-~89: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...
(DURING_THAT_TIME_COMMA)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-docs/guides-using-presto.md
190-190: Reference links and images should use a label that is defined
Missing link or image reference definition: "y-scope/velox#27"
(MD052, reference-links-images)
🔇 Additional comments (4)
docs/src/user-docs/guides-using-presto.md (4)
88-101: Confirm split-filter JSON shape (customOptions.rangeMapping).Double-check that
customOptions.rangeMapping.lowerBound|upperBoundis the expected nesting for the current CLP connector; older examples placedrangeMappingat the top level.Would you like me to verify against the latest connector docs and propose an adjusted example if needed?
155-158: Good addition: default dataset note is clear.
160-167: LGTM: switch to DESCRIBE and clarify schema.
168-177: LGTM: warn about SELECT * crash with reference.
|
|
||
| The Presto CLP integration has the following limitations at present: | ||
|
|
||
| * `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix broken Velox reference (build failure under MD052).
Text cites [y-scope/velox#27] but only #28 is defined below; also your warning above references #28. Unify to #28.
Apply:
-* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27].
+* `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#28].📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#27]. | |
| * `SELECT *` currently causes a crash due to a [known issue][y-scope/velox#28]. |
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
190-190: Reference links and images should use a label that is defined
Missing link or image reference definition: "y-scope/velox#27"
(MD052, reference-links-images)
🤖 Prompt for AI Agents
In docs/src/user-docs/guides-using-presto.md around line 190, the markdown
reference uses [y-scope/velox#27] but the defined reference and the warning
above use #28; update the inline citation to [y-scope/velox#28] so it matches
the defined reference and resolves the MD052 build failure.
…; Fix other minor issues.
…ile path to error message.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(presto-clp): Update docs and config generation to support reading archives from S3 and Presto split-filtering config.
| [credentials](guides-using-object-storage/clp-config.md#credentials) authentication type. | ||
| ::: | ||
|
|
||
| 4. Continue following the [quick-start](./quick-start/index.md#using-clp) guide to start CLP and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use quick-start/index.md#using-clp? Because at L63 it is guides-using-object-storage/clp-config.md#credentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| To query the logs in this dataset: | ||
|
|
||
| ```sql | ||
| SELECT * FROM default LIMIT 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention the limitation that doesn't work for postgreqsql (or potential other datasets)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work for postgresql. The problem was that we needed to update the worker image to work with the latest clp. For elasticsearch, it works, but @timestamp shows up as NULL. That you'll need to verify on your end. I haven't tested other datasets. Does it not work for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mongodb (original dataset w/o preprocessing) the query directly failed (Query is gone? in presto-cli).
For cockroachdb and spark datasets:
presto:default> select * from datacockroachdb limit 1;
Query 20250903_152806_00005_35wkg failed: Failed to parse type [hb]. Type not registered.
presto:default> select * from dataspark limit 1;
Query is gone (server restarted?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have issues for all of the bugs causing those failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…xwei/s3-support-config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/src/user-docs/guides-using-presto.md (2)
89-107: Clarify JSON example and point to full schema.Consider adding a short sentence clarifying that must match the CLP timestamp key for that dataset, and that lowerBound/upperBound should be the CLP column names begin_timestamp/end_timestamp as shown. The rest looks correct.
135-137: Preferdocker compose down --volumesfor clean-up.
rmwon’t remove networks/volumes;down --volumesfully tears down the stack.Apply:
-docker compose rm +docker compose down --volumes
♻️ Duplicate comments (6)
tools/deployment/presto-clp/scripts/init.py (6)
146-153: Handle YAML parse errors and read as UTF‑8 (credentials.yml).Fail gracefully on malformed credentials.
Apply:
-with open(credentials_file_path, "r") as credentials_file: - credentials = yaml.safe_load(credentials_file) +try: + with open(credentials_file_path, "r", encoding="utf-8") as credentials_file: + credentials = yaml.safe_load(credentials_file) +except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", credentials_file_path, e) + return False
55-57: Handle YAML parse errors and read as UTF‑8 (clp-config.yml).Avoids stack traces on malformed YAML.
Apply:
-with open(clp_config_file_path, "r") as clp_config_file: - clp_config = yaml.safe_load(clp_config_file) +try: + with open(clp_config_file_path, "r", encoding="utf-8") as clp_config_file: + clp_config = yaml.safe_load(clp_config_file) +except yaml.YAMLError as e: + logger.error("Failed to parse '%s': %s", clp_config_file_path, e) + return 1
10-21: Avoid double-configuring the root logger (use basicConfig with a guard).Prevents duplicate handlers when imported.
Apply:
-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) +if not logging.getLogger().handlers: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + )
71-75: Create parent dir, write with UTF‑8, and restrict permissions (env file contains secrets).Sets sane defaults for secret material.
Apply:
-with open(output_file, "w") as output_file_handle: - for key, value in env_vars.items(): - output_file_handle.write(f"{key}={value}\n") +output_file.parent.mkdir(parents=True, exist_ok=True) +with open(output_file, "w", encoding="utf-8") as output_file_handle: + for key, value in env_vars.items(): + output_file_handle.write(f"{key}={value}\n") +try: + output_file.chmod(0o600) +except Exception as e: + logger.warning("Failed to set permissions on '%s': %s", output_file, e)
314-317: Type hints: return Optional[str] for config values.CLP YAML values are strings; tighten types for clarity.
Apply:
-def _get_config_value( - config: Dict[str, Any], key: str, default_value: Optional[Any] = None -) -> Optional[Any]: +def _get_config_value( + config: Dict[str, Any], key: str, default_value: Optional[str] = None +) -> Optional[str]:
209-214: Use S3 service endpoint (not bucket-qualified) and emit the bucket separately.Avoids TLS issues with dotted bucket names and supports non-standard partitions.
Apply:
env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER"] = "clp_package" env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID"] = s3_access_key_id -s3_end_point = f"https://{s3_bucket}.s3.{s3_region_code}.amazonaws.com/" -env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = s3_end_point env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY"] = s3_secret_access_key +env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET"] = s3_bucket +# Use the regional S3 service endpoint. +env_vars["PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT"] = f"https://s3.{s3_region_code}.amazonaws.com"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/src/user-docs/guides-using-presto.md(7 hunks)tools/deployment/presto-clp/scripts/init.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T13:35:33.396Z
Learnt from: anlowee
PR: y-scope/clp#1228
File: tools/deployment/presto-clp/scripts/init.py:298-306
Timestamp: 2025-09-02T13:35:33.396Z
Learning: In the CLP codebase, all configuration values loaded from YAML files are string values, so functions that retrieve config values should use Optional[str] rather than Optional[Any] for type hints.
Applied to files:
tools/deployment/presto-clp/scripts/init.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
PR: y-scope/clp#1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
docs/src/user-docs/guides-using-presto.md
🪛 LanguageTool
docs/src/user-docs/guides-using-presto.md
[typographical] ~90-~90: It appears that a comma is missing.
Context: ...late/split-filter.json`. * For each dataset you want to query, add a filter config ...
(DURING_THAT_TIME_COMMA)
🔇 Additional comments (6)
docs/src/user-docs/guides-using-presto.md (6)
19-19: Confirm minimum CLP version aligns with split-filter support.Please verify v0.5.0 is indeed the minimum that includes split-filter support required by this doc and the new templates. If not, bump accordingly.
58-65: LGTM: S3/object-storage guidance and closed note block.The guidance reads clearly and the admonition is properly closed; rendering/numbering should be stable.
66-66: LGTM: Quick-start anchor fix.The relative anchor to quick-start usage looks correct.
121-128: LGTM: Scaling workers wording.Clearer phrasing and example.
156-160: LGTM: Default dataset note is accurate and closed.
161-168: LGTM: Describe guidance for default and non-default datasets.
| database_host = _get_config_value(clp_config, "database.host", "localhost") | ||
| database_port = _get_config_value(clp_config, "database.port", 3306) | ||
| database_name = _get_config_value(clp_config, "database.name", "clp-db") | ||
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = ( | ||
| f"jdbc:mysql://{database_host}:{database_port}" | ||
| ) | ||
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Keep config values as strings; adjust default port type.
Config values are strings in CLP; pass the default port as a string for consistency.
Apply:
-database_port = _get_config_value(clp_config, "database.port", 3306)
+database_port = _get_config_value(clp_config, "database.port", "3306")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| database_host = _get_config_value(clp_config, "database.host", "localhost") | |
| database_port = _get_config_value(clp_config, "database.port", 3306) | |
| database_name = _get_config_value(clp_config, "database.name", "clp-db") | |
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = ( | |
| f"jdbc:mysql://{database_host}:{database_port}" | |
| ) | |
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name | |
| database_host = _get_config_value(clp_config, "database.host", "localhost") | |
| database_port = _get_config_value(clp_config, "database.port", "3306") | |
| database_name = _get_config_value(clp_config, "database.name", "clp-db") | |
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL"] = ( | |
| f"jdbc:mysql://{database_host}:{database_port}" | |
| ) | |
| env_vars["PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME"] = database_name |
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 104 to 110, the CLP
config values are treated as strings so the default database port should be a
string as well; change the call _get_config_value(clp_config, "database.port",
3306) to use "3306" as the default string (so database_port is always a string)
and ensure any downstream concatenation or usage treats database_port as a
string.
| except KeyError as e: | ||
| logger.error( | ||
| "Missing required key %s in '%s'", | ||
| e, | ||
| coordinator_common_env_file_path, | ||
| ) | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Log the missing key name without extra quotes.
Improves readability of error logs.
Apply:
-except KeyError as e:
- logger.error(
- "Missing required key %s in '%s'",
- e,
- coordinator_common_env_file_path,
- )
+except KeyError as e:
+ missing_key = e.args[0] if e.args else str(e)
+ logger.error(
+ "Missing required key %s in '%s'",
+ missing_key,
+ coordinator_common_env_file_path,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except KeyError as e: | |
| logger.error( | |
| "Missing required key %s in '%s'", | |
| e, | |
| coordinator_common_env_file_path, | |
| ) | |
| return False | |
| except KeyError as e: | |
| missing_key = e.args[0] if e.args else str(e) | |
| logger.error( | |
| "Missing required key %s in '%s'", | |
| missing_key, | |
| coordinator_common_env_file_path, | |
| ) | |
| return False |
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 233 to 239, the
logger currently logs the KeyError object which prints the missing key with
extra quotes; replace the logged argument with the raw key name (use e.args[0])
so the message reads the key without surrounding quotes, e.g. pass e.args[0] to
logger.error instead of e.
| properties = ["connector.name=clp"] | ||
| if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"): | ||
| property_name_to_env_var_name = { | ||
| "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", | ||
| "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER", | ||
| "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID", | ||
| "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT", | ||
| "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY", | ||
| } | ||
|
|
||
| for property_name, env_var_name in property_name_to_env_var_name.items(): | ||
| env_var_value = env_vars.get(env_var_name) | ||
| if not env_var_value: | ||
| logger.error("Internal error: Missing required env var '%s'", env_var_name) | ||
| return False | ||
| properties.append(f"{property_name}={env_var_value}") | ||
|
|
||
| with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: | ||
| f.write("\n".join(properties) + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not write plaintext secrets to clp.properties; write env placeholders and ensure dir exists.
Prevents secret sprawl; also adds fs default and bucket property.
Apply:
-properties = ["connector.name=clp"]
-if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"):
- property_name_to_env_var_name = {
- "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
- "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
- "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
- "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
- "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
- }
-
- for property_name, env_var_name in property_name_to_env_var_name.items():
- env_var_value = env_vars.get(env_var_name)
- if not env_var_value:
- logger.error("Internal error: Missing required env var '%s'", env_var_name)
- return False
- properties.append(f"{property_name}={env_var_value}")
-
-with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
+properties = ["connector.name=clp"]
+storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs")
+if storage_type == "s3":
+ property_name_to_env_var_name = {
+ "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE",
+ "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER",
+ "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID",
+ "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT",
+ "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY",
+ "clp.s3-bucket": "PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET",
+ }
+
+ for property_name, env_var_name in property_name_to_env_var_name.items():
+ if not env_vars.get(env_var_name):
+ logger.error("Internal error: Missing required env var '%s'", env_var_name)
+ return False
+ # Expand inside the container; avoid persisting secrets on disk.
+ properties.append(f"{property_name}=${{{env_var_name}}}")
+else:
+ properties.append("clp.storage-type=fs")
+
+worker_config_template_path.mkdir(parents=True, exist_ok=True)
+with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f:
f.write("\n".join(properties) + "\n")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| properties = ["connector.name=clp"] | |
| if "s3" == env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE"): | |
| property_name_to_env_var_name = { | |
| "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", | |
| "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER", | |
| "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID", | |
| "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT", | |
| "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY", | |
| } | |
| for property_name, env_var_name in property_name_to_env_var_name.items(): | |
| env_var_value = env_vars.get(env_var_name) | |
| if not env_var_value: | |
| logger.error("Internal error: Missing required env var '%s'", env_var_name) | |
| return False | |
| properties.append(f"{property_name}={env_var_value}") | |
| with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: | |
| f.write("\n".join(properties) + "\n") | |
| properties = ["connector.name=clp"] | |
| storage_type = env_vars.get("PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", "fs") | |
| if storage_type == "s3": | |
| property_name_to_env_var_name = { | |
| "clp.storage-type": "PRESTO_WORKER_CLPPROPERTIES_STORAGE_TYPE", | |
| "clp.s3-auth-provider": "PRESTO_WORKER_CLPPROPERTIES_S3_AUTH_PROVIDER", | |
| "clp.s3-access-key-id": "PRESTO_WORKER_CLPPROPERTIES_S3_ACCESS_KEY_ID", | |
| "clp.s3-end-point": "PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT", | |
| "clp.s3-secret-access-key": "PRESTO_WORKER_CLPPROPERTIES_S3_SECRET_ACCESS_KEY", | |
| "clp.s3-bucket": "PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET", | |
| } | |
| for property_name, env_var_name in property_name_to_env_var_name.items(): | |
| if not env_vars.get(env_var_name): | |
| logger.error("Internal error: Missing required env var '%s'", env_var_name) | |
| return False | |
| # Expand inside the container; avoid persisting secrets on disk. | |
| properties.append(f"{property_name}=${{{env_var_name}}}") | |
| else: | |
| properties.append("clp.storage-type=fs") | |
| # Ensure the template directory exists before writing | |
| worker_config_template_path.mkdir(parents=True, exist_ok=True) | |
| with open(worker_config_template_path / "clp.properties", "w", encoding="utf-8") as f: | |
| f.write("\n".join(properties) + "\n") |
Description
This PR added the support for S3 config for Presto. It scans the
clp-config.yamlto decide whether the S3 config should be generated if thearchive_output.storage.typeiss3.Since from user's end, there is no changes in the usage, no need to update the doc.
Note that this PR should not be merged before Presto update the worker image dependency (y-scope/presto#55). It needs to install
ca-certificatespackage.Checklist
breaking change.
Validation performed
Test with a S3-sonfigured clp-package and it works.
Summary by CodeRabbit
New Features
Chores
Documentation