Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3a84840
Init
anlowee Jul 27, 2025
9574059
Add new files to yaml linting; Fix yamllint violations.
kirkrodrigues Jul 27, 2025
8dd8795
Remove unnecessarily blank lines.
kirkrodrigues Jul 27, 2025
a8af40c
Remove unnecesary script.
kirkrodrigues Jul 27, 2025
2015a4a
Replace demo-assets/init.sh and demo CLP config file with more robust…
kirkrodrigues Jul 27, 2025
b78682f
Add missing return on error.
kirkrodrigues Jul 27, 2025
1247bba
Apply shell linters to worker/scripts/generate-configs.sh
kirkrodrigues Jul 28, 2025
1400a90
Use Docker compose to wait for the coordinator to be ready.
kirkrodrigues Jul 28, 2025
91040d1
Use jq to parse Presto version info.
kirkrodrigues Jul 28, 2025
9ec648c
Clean-up wget command.
kirkrodrigues Jul 28, 2025
4097478
Use /usr/bin/env.
kirkrodrigues Jul 28, 2025
b1ce135
Use function to update kv-pairs in config file. Set kv-pairs if they …
kirkrodrigues Jul 28, 2025
70a56d7
Move getting Presto coordinator version into a function.
kirkrodrigues Jul 28, 2025
99ac4e1
Minor edits for consistency.
kirkrodrigues Jul 28, 2025
6dde297
fix: Set error policies.
kirkrodrigues Jul 28, 2025
b22746d
Mark constants readonly.
kirkrodrigues Jul 28, 2025
708756d
Clean-up comments.
kirkrodrigues Jul 28, 2025
2eeda5c
Quote paths.
kirkrodrigues Jul 28, 2025
8bb98f8
Clean-up presto-clp/coordinator/scripts/generate-configs.sh.
kirkrodrigues Jul 28, 2025
9fc487d
Remove deprecated version property.
kirkrodrigues Jul 28, 2025
64e8edf
Alphabetize mounts.
kirkrodrigues Jul 28, 2025
00bd3f1
Rename environment variables for clarity.
kirkrodrigues Jul 28, 2025
28f5b70
fix: Remove spurious equals sign.
kirkrodrigues Jul 28, 2025
cccf9fe
Lint set-up-config.sh.
kirkrodrigues Jul 28, 2025
04f3d2a
Add docs and remove README.
kirkrodrigues Jul 28, 2025
3381eb8
Set coordinator log level to INFO.
kirkrodrigues Jul 28, 2025
bea5bb6
Validate CLP metadata database type.
kirkrodrigues Jul 28, 2025
1f68965
Use logging function rather than echos.
kirkrodrigues Jul 28, 2025
94b0210
Reorder functions.
kirkrodrigues Jul 28, 2025
22c53e0
Add new docs to index.
kirkrodrigues Jul 28, 2025
2f0de28
Add S3 limitation.
kirkrodrigues Jul 28, 2025
d621bf3
Add clone step to docs.
kirkrodrigues Jul 28, 2025
7300d20
Add required CLP version to docs.
kirkrodrigues Jul 28, 2025
4bc2f58
Rename PRESTO_WORKER_HTTPPORT.
kirkrodrigues Jul 28, 2025
80c41d4
Remove unnecessary quotes from env var files.
kirkrodrigues Jul 28, 2025
9d2146b
Address some rabbit feedback.
kirkrodrigues Jul 28, 2025
ff06c62
Address the nested field limitation comment
anlowee Jul 28, 2025
cf11694
Add metadata filter config
anlowee Jul 28, 2025
2618f13
Merge branch 'main' into xwei/yscope-comose
anlowee Jul 28, 2025
782b87d
Docs edits.
kirkrodrigues Jul 28, 2025
8107518
Add error checking for config files not existing.
kirkrodrigues Jul 28, 2025
4d896e5
More docs edits.
kirkrodrigues Jul 28, 2025
713670a
Remove extra spaces.
kirkrodrigues Jul 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion taskfiles/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,8 @@ tasks:
components/package-template/src/etc \
docs \
taskfile.yaml \
taskfiles
taskfiles \
tools/deployment

check-cpp-format:
sources: &cpp_source_files
Expand Down Expand Up @@ -772,6 +773,7 @@ tasks:
- "components/clp-py-utils/clp_py_utils"
- "components/core/tools/scripts/utils"
- "components/job-orchestration/job_orchestration"
- "tools/deployment"
- "tools/scripts"
Comment on lines +776 to 777
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Black/Ruff may recurse through large non-code trees under tools/deployment.

Running black . from tools/deployment will walk every sub-directory (configs, templates, logs). That can add noticeable time and, in worst cases, choke on extremely large data files a user may drop under that tree. Consider narrowing the path to tools/deployment/**/*.py or adding an --extend-exclude for obvious non-code directories (e.g., config-template, scripts/generated).

🤖 Prompt for AI Agents
In taskfiles/lint.yaml around lines 776 to 777, the current lint paths include
"tools/deployment" which causes Black and Ruff to recursively process large
non-code directories, slowing down linting and potentially causing errors.
Modify the lint paths to only include Python files under "tools/deployment" by
changing the path to "tools/deployment/**/*.py" or add an --extend-exclude
option to exclude known non-code directories like "configs" or "templates" to
prevent unnecessary recursion.

- "docs/conf"
cmd: |-
Expand Down
73 changes: 73 additions & 0 deletions tools/deployment/presto-clp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Setup local docker stack for presto + clp

## Install docker

Follow the guide here: [docker]

# Launch clp-package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Convert secondary titles to H2 and fix punctuation for Markdown lint-cleanliness

The document currently contains five #-level headings (Setup…, Launch…, Create Docker Cluster, Use cli: and Delete docker Cluster).
Markdown-lint (MD025/MD026) flags this as multiple H1s and trailing punctuation. Keeping a single top-level H1 improves SEO and stops most linters from failing CI.

-# Launch clp-package
+# ## Launch clp-package-# Create Docker Cluster
+# ## Create Docker Cluster-# Use cli:
+# ## Use CLI-# Delete docker Cluster
+# ## Delete Docker Cluster

Also applies to: 35-41, 48-57, 65-69

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

7-7: Multiple top-level headings in the same document

(MD025, single-title, single-h1)

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/README.md around lines 1 to 7 and also lines
35-41, 48-57, and 65-69, change all secondary titles currently using a single #
(H1) to use ## (H2) to ensure only one top-level H1 heading in the document.
Additionally, remove any trailing punctuation from these headings to comply with
markdown lint rules MD025 and MD026. This will improve SEO and prevent linter
failures in CI.


1. Find the clp-package for test on our official website [clp-json-v0.4.0]. Here is a sample dataset for demo testing: [postgresql dataset].

2. Untar the clp-package and the postgresql dataset.

3. Launch:

```bash
# You probably want to run a python 3.9 or newer virtual environment
sbin/start-clp.sh
```

5. Compress:

```bash
# You can also use your own dataset
sbin/compress.sh --timestamp-key 'timestamp' /path/to/postgresql.log
```

6. Use the following command to update `.env`:

```bash
scripts/set-up-config.sh /path/to/clp-json-package
```

# Create Docker Cluster

Create a local docker stack:

```bash
docker compose up
```

To create a docker stack with more than 1 worker (e.g., 3 workers):
```
docker compose up --scale presto-worker=3
```

# Use cli:

After all containers are in "Started" states (check by `docker ps`):

```bash
# On your host
docker exec -it compose-presto-coordinator-1 sh

# In presto-coordinator container
/opt/presto-cli --catalog clp --schema default --server localhost:8080
```

Example query:
```sql
SELECT * FROM default LIMIT 1;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor: qualify SQL fenced block for syntax highlighting

Add sql after the back-ticks.

-```sql
+```sql
 SELECT * FROM default LIMIT 1;

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

61-61: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In tools/deployment/presto-clp/README.md around lines 61 to 63, the SQL code
block is missing the language specifier for syntax highlighting. Add "sql"
immediately after the opening triple backticks to enable proper SQL syntax
highlighting for the code block.


</details>

<!-- fingerprinting:phantom:triton:cougar -->

<!-- This is an auto-generated comment by CodeRabbit -->


# Delete docker Cluster

```bash
docker compose down
```



[clp-json-v0.4.0]: https://github.com/y-scope/clp/releases/tag/v0.4.0
[docker]: https://docs.docker.com/engine/install
[postgresql dataset]: https://zenodo.org/records/10516402
5 changes: 5 additions & 0 deletions tools/deployment/presto-clp/coordinator-common.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
PRESTO_COORDINATOR_HTTPPORT="8080"
PRESTO_COORDINATOR_SERVICENAME="presto-coordinator"

# node.properties
PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT="production"
14 changes: 14 additions & 0 deletions tools/deployment/presto-clp/coordinator.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# clp.properties
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE="mysql"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_SPLITPROVIDER="mysql"

# config.properties
PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORY="1GB"
PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORYPERNODE="1GB"

# jvm.config
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE="4G"
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_G1HEAPREGIONSIZE="32M"

# log.properties
PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL="DEBUG"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
connector.name=clp
clp.metadata-provider-type=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE}
clp.metadata-db-url=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL}
clp.metadata-db-name=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME}
clp.metadata-db-user=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER}
clp.metadata-db-password=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD}
clp.metadata-table-prefix=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX}
clp.split-provider-type=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_SPLITPROVIDER}
clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Several placeholders have no matching definitions – coordinator fails to start.

coordinator.env currently defines only METADATA_PROVIDERTYPE and SPLITPROVIDER.
The following variables are missing and will survive as literal ${…} after envsubst, causing Presto to error at startup:

  • PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL
  • …_DATABASE_NAME
  • …_DATABASE_USER
  • …_DATABASE_PASSWORD
  • …_METADATA_TABLEPREFIX

Add them to coordinator.env (or remove the placeholders) before merging.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/clp.properties lines
2 to 9, several environment variable placeholders used for database
configuration do not have corresponding definitions in coordinator.env, causing
the coordinator to fail at startup. To fix this, add the missing environment
variables PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD, and
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX to the
coordinator.env file with appropriate values, or alternatively remove these
placeholders from clp.properties if they are not needed.

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT}
query.max-memory=${PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORY}
query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORYPERNODE}
discovery-server.enabled=true
discovery.uri=http://${PRESTO_COORDINATOR_SERVICENAME}:${PRESTO_COORDINATOR_HTTPPORT}
optimizer.optimize-hash-generation=false
regex-library=RE2J
use-alternative-function-signatures=true
inline-sql-functions=false
nested-data-serialization-enabled=false
native-execution-enabled=true
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
-server
-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE}
-XX:+UseG1GC
-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_G1HEAPREGIONSIZE}
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid quoting the log level value in ${…}

Docker .env files often wrap string values in quotes ("DEBUG"). When the template engine substitutes the value here, the resulting line becomes:

com.facebook.presto="DEBUG"

The extra quotes are not a valid log level and Presto will fall back to the default (INFO). Either:

-com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
+com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL:-INFO}

and remove surrounding quotes in the .env file, or strip them inside the script that performs substitution.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL:-INFO}
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/log.properties at
line 1, the log level value substituted from the environment variable may
include quotes, causing invalid log level syntax. To fix this, ensure that the
substitution process strips any surrounding quotes from the environment variable
value before inserting it into the config file, or update the .env file to
provide the log level without quotes.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{
}
Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Provide a minimal example or document expected schema

An empty JSON object is syntactically valid, but future maintainers may be unsure what keys are supported. A commented exemplar or pointer to docs beside this file would improve clarity without affecting runtime.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json
at lines 1 to 2, the JSON file is currently empty, which may confuse future
maintainers about the expected keys. Add a minimal example JSON object with
typical keys and values or include comments or a reference to documentation
explaining the expected schema to improve clarity without impacting runtime
behavior.

Comment on lines +1 to +2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anlowee I think we should explain to the user how to configure this for the timestamp field in their logs, right? And also that it may need to be different for each dataset they compress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we direct them to the related presto-doc section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not published and is more general than they need, right? We should write a simplified section for them here.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.id=${PRESTO_COORDINATOR_SERVICENAME}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/sh

# Exit on error
set -e

PRESTO_CONFIG_DIR="/opt/presto-server/etc"

# Substitute environemnt variables in config template
find /configs -type f | while read -r f; do
( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security risk: Avoid shell evaluation for template processing.

The current approach using shell evaluation (( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh) poses a security risk if template files contain malicious shell code. Additionally, the unquoted variable $f could cause issues with filenames containing spaces.

Consider using envsubst for safer environment variable substitution:

-# Substitute environemnt variables in config template
-find /configs -type f | while read -r f; do
-  ( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
-done
+# Substitute environment variables in config template
+find /configs -type f | while read -r f; do
+  envsubst < "$f" > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
+done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Substitute environemnt variables in config template
find /configs -type f | while read -r f; do
( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
# Substitute environment variables in config template
find /configs -type f | while read -r f; do
envsubst < "$f" > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
🧰 Tools
🪛 Shellcheck (0.10.0)

[info] 10-10: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/scripts/generate-configs.sh around
lines 8 to 11, the script uses shell evaluation to substitute environment
variables in config templates, which is a security risk and mishandles filenames
with spaces. Replace the current shell evaluation approach with the use of
`envsubst` to safely substitute environment variables without executing
arbitrary shell code, and ensure all file path variables are properly quoted to
handle spaces correctly.


# Setup the config directory hierarchy
rm -f ${PRESTO_CONFIG_DIR}/catalog/*

# Copy over files
mv ${PRESTO_CONFIG_DIR}/clp.properties ${PRESTO_CONFIG_DIR}/catalog
42 changes: 42 additions & 0 deletions tools/deployment/presto-clp/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
version: "3.9"

services:
presto-coordinator:
image: "ghcr.io/y-scope/presto/coordinator:dev"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using stable image tags instead of 'dev'

Using the 'dev' tag in production deployments can lead to unpredictable behavior as the image content may change without notice.

Consider using specific version tags or at minimum a more stable tag:

-    image: "ghcr.io/y-scope/presto/coordinator:dev"
+    image: "ghcr.io/y-scope/presto/coordinator:${PRESTO_VERSION:-latest}"

And similarly for the worker:

-    image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
+    image: "ghcr.io/y-scope/presto/prestissimo-worker:${PRESTO_VERSION:-latest}"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/docker-compose.yaml at line 3, the image tag
"dev" is used, which is unstable for production. Replace the "dev" tag with a
specific version number or a stable tag to ensure predictable and consistent
deployments. Also, apply the same change to the worker image tag to maintain
stability across all services.

entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- ".env"
- "coordinator-common.env"
- "coordinator.env"
volumes:
- "./coordinator/scripts:/scripts:ro"
- "coordinator-config:/opt/presto-server/etc"
- "./coordinator/config-template:/configs:ro"
networks:
- "presto"

presto-worker:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should define a healthcheck for the worker as well; otherwise, users who run docker compose up --detach --wait would observe the services are started successfully, but the worker service may silently fail (since we --detach the output from the console), which may cause confusions to users about the status of the cluster

@coderabbitai file an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for adding healthcheck to Presto worker service

gh issue create \
  --repo "y-scope/clp" \
  --title "Add healthcheck to Presto worker service in docker-compose.yaml" \
  --body "## Description

Currently, the Presto coordinator service in the docker-compose.yaml configuration has a healthcheck defined, but the worker service does not. This creates a potential issue where users running \`docker compose up --detach --wait\` may see services reported as started successfully, while the worker service could be silently failing.

Without a healthcheck on the worker, Docker Compose cannot accurately determine if the worker service is healthy and ready, which may cause confusion about the actual status of the Presto cluster.

## Context

Raised by @junhaoliao during review of PR #1132.

- PR: https://github.com/y-scope/clp/pull/1132
- Review comment: https://github.com/y-scope/clp/pull/1132#discussion_r2462811962

## Expected Behaviour

The worker service should have an appropriate healthcheck defined so that Docker Compose can correctly report the service status when using the \`--wait\` flag." \
  --assignee "junhaoliao"

Length of output: 42


@junhaoliao, I've filed issue #1504 to track adding a healthcheck to the Presto worker service. The issue has been assigned to you.

image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
depends_on:
presto-coordinator:
condition: "service_started"
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- ".env"
- "coordinator-common.env"
- "worker.env"
volumes:
- "./worker/scripts:/scripts:ro"
- "worker-config:/opt/presto-server/etc"
- "./worker/config-template:/configs:ro"
- "${CLP_PACKAGE_ARCHIVES}:${CLP_PACKAGE_ARCHIVES}"
networks:
- "presto"

volumes:
coordinator-config:
worker-config:

networks:
presto:
driver: "bridge"
1 change: 1 addition & 0 deletions tools/deployment/presto-clp/scripts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.venv/
120 changes: 120 additions & 0 deletions tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import argparse
import logging
import sys
from pathlib import Path
from typing import Optional

import yaml

# Set up console logging
logging_console_handler = logging.StreamHandler()
logging_formatter = logging.Formatter(
"%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
)
logging_console_handler.setFormatter(logging_formatter)

# Set up root logger
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(logging_console_handler)

# Create logger
logger = logging.getLogger(__name__)
Comment on lines +10 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider using structured logging configuration

The logging setup could be more maintainable using dictionary configuration or the logging.basicConfig approach.

Consider simplifying the logging setup:

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
-
-# Create logger
-logger = logging.getLogger(__name__)
+# Set up logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+    datefmt="%Y-%m-%dT%H:%M:%S"
+)
+logger = logging.getLogger(__name__)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Set up console logging
logging_console_handler = logging.StreamHandler()
logging_formatter = logging.Formatter(
"%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
)
logging_console_handler.setFormatter(logging_formatter)
# Set up root logger
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(logging_console_handler)
# Create logger
logger = logging.getLogger(__name__)
# Set up logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S"
)
logger = logging.getLogger(__name__)
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/generate-user-env-vars-file.py around
lines 10 to 23, the current logging setup manually creates handlers and
formatters, which can be simplified. Replace this manual setup with a structured
logging configuration using logging.basicConfig or a dictionary configuration to
define format, level, and handlers in a more maintainable and concise way.



def main(argv=None) -> int:
if argv is None:
argv = sys.argv

args_parser = argparse.ArgumentParser(
description="Generates an environment variables file for any user-configured properties."
)
args_parser.add_argument(
"--clp-package-dir", help="CLP package directory.", required=True, type=Path
)
args_parser.add_argument(
"--output-file", help="Path for the environment variables file.", required=True, type=Path
)

parsed_args = args_parser.parse_args(argv[1:])
clp_package_dir: Path = parsed_args.clp_package_dir.resolve()
output_file: Path = parsed_args.output_file

clp_config_file_path = clp_package_dir / "etc" / "clp-config.yml"
with open(clp_config_file_path, "r") as clp_config_file:
clp_config = yaml.safe_load(clp_config_file)

database_host = _get_config_value(clp_config, "database.host", "localhost")
database_port = _get_config_value(clp_config, "database.port", 3306)
database_name = _get_config_value(clp_config, "database.name", "clp-db")

clp_archive_output_storage_type = _get_config_value(
clp_config, "archive_output.storage.type", "fs"
)
if "fs" != clp_archive_output_storage_type:
logger.error(
"Expected CLP's archive_output.storage.type to be fs but found '%s'. Presto currently only supports"
" reading archives from the fs storage type.",
clp_archive_output_storage_type,
)

clp_archives_dir = _get_config_value(
clp_config,
"archive_output.storage.directory",
str(clp_package_dir / "var" / "data" / "archives"),
)

credentials_file_path = clp_package_dir / "etc" / "credentials.yml"
with open(credentials_file_path, "r") as credentials_file:
credentials = yaml.safe_load(credentials_file)

database_user = _get_config_value(credentials, "database.user")
database_password = _get_config_value(credentials, "database.password")
if not database_user or not database_password:
logger.error(
"database.user and database.password must be specified in '%s'.", credentials_file_path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we tell user that to launch clp-package first? Because that will generate the credentials.

)
return 1

with open(output_file, "w") as env_file:
env_file.write(
"PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL"
f"=jdbc:mysql://{database_host}:{database_port}\n"
)
env_file.write(
f"PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME={database_name}\n"
)
env_file.write(
f"PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER={database_user}\n"
)
env_file.write(
f"PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD={database_password}\n"
)
env_file.write(f"PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX=clp_\n")
env_file.write(f"CLP_PACKAGE_ARCHIVES={clp_archives_dir}\n")

return 0


def _get_config_value(config: dict, key: str, default_value: Optional[str] = None) -> str:
"""
Gets the value corresponding to `key` from `config` if it exists.
:param config: The config.
:param key: The key to look for in the config, in dot notation (e.g., "database.host").
:param default_value: The value to return if `key` doesn't exist in `config`.
:return: The value corresponding to `key` if it exists, otherwise `default_value`.
"""

keys = key.split(".")
value = config
for k in keys:
if isinstance(value, dict) and k in value:
value = value[k]
else:
return default_value
return value


if "__main__" == __name__:
sys.exit(main(sys.argv))
1 change: 1 addition & 0 deletions tools/deployment/presto-clp/scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PyYAML
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin dependency version to ensure deterministic builds

Unversioned requirements can silently break the deployment if PyYAML ships a backwards-incompatible update. Pin to a tested version range, e.g.:

-PyYAML
+PyYAML>=6.0,<7.0

and revisit periodically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PyYAML
# tools/deployment/presto-clp/scripts/requirements.txt
-PyYAML
+PyYAML>=6.0,<7.0
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/requirements.txt at line 1, the PyYAML
dependency is unversioned, which can cause unpredictable build issues if a new
incompatible version is released. Fix this by specifying a version range for
PyYAML that is known to work, for example, pin it to a specific version or a
range like PyYAML>=5.4,<6.0, and update this periodically to maintain
compatibility.

28 changes: 28 additions & 0 deletions tools/deployment/presto-clp/scripts/set-up-config.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

set -eu
set -o pipefail

script_dir=$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

cUsage="Usage: ${BASH_SOURCE[0]} <clp-package-dir>"
if [ "$#" -lt 1 ] ; then
echo "$cUsage"
exit
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Return non-zero on incorrect usage

exit without an argument returns the status of the previous command (likely 0), so the script “succeeds” even when called incorrectly. Use a non-zero code.

-    echo "$cUsage"
-    exit
+    echo "$cUsage" >&2
+    exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ "$#" -lt 1 ] ; then
echo "$cUsage"
exit
fi
if [ "$#" -lt 1 ] ; then
echo "$cUsage" >&2
exit 1
fi
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/set-up-config.sh at lines 9 to 12, the
script uses 'exit' without an argument on incorrect usage, which returns the
previous command's status (likely 0). Change the 'exit' command to 'exit 1' to
ensure the script returns a non-zero status code indicating failure when called
incorrectly.

clp_package_dir=$1

Comment on lines +13 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Quote positional parameter when capturing it

Keeps paths with spaces intact.

-clp_package_dir=$1
+clp_package_dir="$1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
clp_package_dir=$1
clp_package_dir="$1"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/set-up-config.sh at line 13, the
positional parameter $1 is assigned to clp_package_dir without quotes, which can
break paths containing spaces. Fix this by enclosing $1 in double quotes when
assigning it to clp_package_dir to preserve any spaces in the path.

venv_dir=${script_dir}/.venv
if [ ! -d "${venv_dir}" ]; then
echo "Setting up Python venv in '${venv_dir}'..."
python3 -m venv "${script_dir}/.venv"
fi
source "${script_dir}/.venv/bin/activate"

echo "Installing required Python packages..."
pip3 install -r "${script_dir}/requirements.txt"

echo "Generating environment variables file for user-configured properties..."
python3 "${script_dir}/generate-user-env-vars-file.py" \
--clp-package-dir "${clp_package_dir}" \
--output-file "${script_dir}/../.env"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if .gitignore exists at tools/deployment/presto-clp/.gitignore
if [ -f "tools/deployment/presto-clp/.gitignore" ]; then
    echo "=== Existing .gitignore file found ==="
    cat tools/deployment/presto-clp/.gitignore
else
    echo "=== No .gitignore file exists at tools/deployment/presto-clp/.gitignore ==="
fi

echo ""
echo "=== Creating GitHub issue ==="

# Create the issue
gh issue create \
    --title "Add .env file to .gitignore in tools/deployment/presto-clp" \
    --body "The generated \`.env\` file in \`tools/deployment/presto-clp/\` should be added to a \`.gitignore\` file to prevent it from being committed to the repository.

**Context:**
The \`set-up-config.sh\` script generates a \`.env\` file containing environment variables at \`tools/deployment/presto-clp/.env\`. This file may contain sensitive information (database credentials, etc.) and should not be tracked by Git.

**Requested by:** @junhaoliao
**Related PR:** https://github.com/y-scope/clp/pull/1132
**Comment:** https://github.com/y-scope/clp/pull/1132#discussion_r2433423044" \
    --assignee junhaoliao

Length of output: 149


@junhaoliao, I've created issue #1422 to track adding the .env file to .gitignore in the tools/deployment/presto-clp/ directory. The issue has been assigned to you and includes all the relevant context and backlinks to this PR and comment.


🧠 Learnings used
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.

4 changes: 4 additions & 0 deletions tools/deployment/presto-clp/worker.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
PRESTO_WORKER_HTTPPORT="8080"

# node.properties
PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION="worker-location"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Rename variable for readability & future-proofing

Using PRESTO_WORKER_HTTPPORT without a separator makes the variable name harder to scan and deviates from the common *_HTTP_PORT pattern used elsewhere in the repo (e.g., PRESTO_COORDINATOR_HTTP_PORT). Consider renaming:

-PRESTO_WORKER_HTTPPORT="8080"
+PRESTO_WORKER_HTTP_PORT="8080"

and update every reference in the generation scripts accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PRESTO_WORKER_HTTPPORT="8080"
# node.properties
PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION="worker-location"
PRESTO_WORKER_HTTP_PORT="8080"
# node.properties
PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION="worker-location"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker.env lines 1 to 4, rename the environment
variable PRESTO_WORKER_HTTPPORT to PRESTO_WORKER_HTTP_PORT to improve
readability and maintain consistency with other variables like
PRESTO_COORDINATOR_HTTP_PORT. After renaming, update all references to this
variable in the generation scripts to match the new name.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
connector.name=clp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set system-memory-gb to be about 90% of available machine memory of the deployment according to https://prestodb.io/docs/current/presto_cpp/properties.html#system-memory-gb

@coderabbitai file an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Create a GitHub issue for the system-memory-gb configuration
gh issue create \
  --title "Configure system-memory-gb for Presto worker deployment" \
  --body "## Description

The Presto worker configuration should include the \`system-memory-gb\` property set to approximately 90% of the available machine memory according to the [Presto documentation](https://prestodb.io/docs/current/presto_cpp/properties.html#system-memory-gb).

Currently, the worker configuration template at \`tools/deployment/presto-clp/worker/config-template/config.properties\` does not include this setting, which may lead to suboptimal memory usage.

## Suggested Action

Add the \`system-memory-gb\` property to the worker configuration template, either:
- As a hardcoded value with documentation explaining how to adjust it
- As an environment variable placeholder for flexible configuration

## References

- PR: https://github.com/y-scope/clp/pull/1132
- Comment: https://github.com/y-scope/clp/pull/1132#discussion_r2462809174
- Documentation: https://prestodb.io/docs/current/presto_cpp/properties.html#system-memory-gb

Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1503 to track the system-memory-gb configuration requirement for the Presto worker deployment. The issue includes the documentation reference and suggests adding this property to the worker configuration template with a value set to approximately 90% of available memory.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
discovery.uri=http://${PRESTO_COORDINATOR_SERVICENAME}:${PRESTO_COORDINATOR_HTTPPORT}
presto.version=REPLACE_ME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Prefer an env placeholder over hard-coded REPLACE_ME for Presto version.

Replacing REPLACE_ME via sed in the bootstrap script works, but using an env var keeps the template declarative and avoids sed-ordering bugs:

-presto.version=REPLACE_ME
+presto.version=${PRESTO_VERSION}

The script can then simply export PRESTO_VERSION=$(curl …) and rely on envsubst.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
presto.version=REPLACE_ME
presto.version=${PRESTO_VERSION}
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/config-template/config.properties at line
2, replace the hard-coded "REPLACE_ME" for presto.version with an environment
variable placeholder like "${PRESTO_VERSION}". This change allows the bootstrap
script to export PRESTO_VERSION dynamically and use envsubst for substitution,
making the template declarative and avoiding sed ordering issues.

http-server.http.port=${PRESTO_WORKER_HTTPPORT}
shutdown-onset-sec=1
register-test-functions=false
runtime-metrics-collection-enabled=false
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Environment variable name may be misleading for the worker template.

node.environment expands ${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}, which is defined in the coordinator env file. That couples the worker to the coordinator implementation detail and will break if someone supplies only worker.env.
Rename to a worker-scoped variable or alias it in worker.env to avoid hidden dependency.

-node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
+node.environment=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
node.environment=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/config-template/node.properties lines 1 to
4, the environment variable used for node.environment is referencing a
coordinator-specific variable, creating an unintended dependency. Change the
variable to a worker-specific environment variable by renaming it to something
like PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT or create an alias in the
worker.env file that maps to the coordinator variable to avoid coupling and
ensure the worker config is self-contained.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mutable-config=true
Loading
Loading