Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
00efbc1
feat(deployment): Containerize CLP tools via Docker Compose.
junhaoliao Oct 28, 2025
ed3dde2
revert unrelated change
junhaoliao Oct 28, 2025
d307457
fix(docker): Simplify clp-runtime service configuration in Docker Com…
junhaoliao Oct 28, 2025
f871424
debloat the package by avoiding duplicating assets already in the con…
junhaoliao Oct 28, 2025
b03539a
fix(env): Update container image reference format in .common-env.sh
junhaoliao Oct 28, 2025
1b65afe
feat(core): Add resolve_host_path utility for path resolution in cont…
junhaoliao Oct 29, 2025
3b4d44c
lint
junhaoliao Oct 29, 2025
347b181
Merge branch 'main' into containerize-tools
junhaoliao Oct 29, 2025
74a3921
refactor(utils): Simplify path validation by removing host mount para…
junhaoliao Oct 29, 2025
885e9b5
revert unrelated change
junhaoliao Oct 29, 2025
0db2f30
fix package docker workflow
junhaoliao Oct 29, 2025
695ed77
lint
junhaoliao Oct 29, 2025
4f21e0f
docs(dev-docs): Update task command name for building packages.
junhaoliao Oct 29, 2025
a7c81a3
docs: Remove Python section and streamline Docker Desktop requirements
junhaoliao Oct 29, 2025
e6e8217
Merge branch 'main' into containerize-tools
junhaoliao Oct 29, 2025
be2d200
refactor(clp-py-utils): Remove redundant debug print in config direct…
junhaoliao Oct 29, 2025
d910944
chore(scripts): Add shellcheck directive comments for sourcing common…
junhaoliao Oct 29, 2025
163921b
fix SC2155
junhaoliao Oct 29, 2025
3f5c302
fix(scripts): Ensure CLP_DOCKER_SOCK_PATH is set only when undefined.
junhaoliao Oct 29, 2025
65d421f
fix(init): invert CLP_HOME check to use -z for unset detection
junhaoliao Oct 29, 2025
83c581c
fix(scripts): Remove exit 1; Remove redundant stderr redirection in …
junhaoliao Oct 29, 2025
e03a361
chore(clp-py-utils): Remove unused os import from core module.
junhaoliao Oct 29, 2025
299795f
Update volume definitions in docker-compose.runtime.yaml to use the e…
junhaoliao Oct 29, 2025
0ccd155
fix(deployment): Add default values and validations in docker-compose…
junhaoliao Oct 29, 2025
beacdcf
fix(scripts): Adjust formatting for better readability in .common-env…
junhaoliao Oct 29, 2025
c7141e7
move `>&2` before echo
junhaoliao Oct 29, 2025
25c2450
fix(scripts): Resolve host paths for config files during load operati…
junhaoliao Oct 29, 2025
0ac4501
fix(clp-config): Add `use_host_mount` parameter to validation methods…
junhaoliao Oct 29, 2025
479fc81
lint
junhaoliao Oct 29, 2025
9921d48
fix(ci): Update package build step name to indicate the package image…
junhaoliao Oct 30, 2025
a2dd5cf
refactor(clp-py-utils): Rename CONTAINER_HOST_ROOT_DIR -> CONTAINER_D…
junhaoliao Oct 30, 2025
8d6deb6
refactor(clp-py-utils): Rename resolve_host_path -> resolve_host_path…
junhaoliao Oct 30, 2025
c6c5164
Update docstring to clarify `path` translation return value
junhaoliao Oct 30, 2025
3b44945
refactor(clp-py-utils): Rename path -> host_path and resolved -> tran…
junhaoliao Oct 30, 2025
4b6f337
refactor(taskfile): Move NODE_ENV definition to webui task.
junhaoliao Oct 30, 2025
bb90b69
refactor(taskfile): Reorder package-build-deps dependencies for clarity.
junhaoliao Oct 30, 2025
ae825c0
refactor(taskfile): Change "{{.G_BUILD_DIR}}/{{.TASK}}.md5" -> G_WEBU…
junhaoliao Oct 30, 2025
9a66669
Add back package-template to package dependencies.
junhaoliao Oct 30, 2025
6983943
fix(clp-py-utils): Explain in docs that only single-level symlink are…
junhaoliao Oct 30, 2025
e372661
Merge branch 'main' into containerize-tools
junhaoliao Oct 30, 2025
bee1b79
Reorder dependencies
junhaoliao Oct 30, 2025
b0e8abf
shfmt - Apply suggestions from code review
junhaoliao Oct 31, 2025
5172963
avoid hardcoding names - Apply suggestions from code review
junhaoliao Oct 31, 2025
bb5d8d2
shfmt
junhaoliao Nov 3, 2025
95a8e1f
Use standard form for stderr redirection
junhaoliao Nov 3, 2025
395e42f
shfmt
junhaoliao Nov 3, 2025
3af2e4d
Rename dir -> compose_plugin_dir to avoid overlap with the command dir
junhaoliao Nov 3, 2025
198f398
add docs for resolve_host_path_in_container
junhaoliao Nov 3, 2025
1d02051
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
3063e85
format - Apply suggestions from code review
junhaoliao Nov 4, 2025
306997a
split volumes into groups and add docs
junhaoliao Nov 4, 2025
4f6c5b4
Move resolved webui settings paths declarations closer to their usages
junhaoliao Nov 4, 2025
ef627c6
format - Apply suggestions from code review
junhaoliao Nov 4, 2025
75ada35
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
969fde0
Clarify `HOME` forwarding into the container - Apply suggestions from…
junhaoliao Nov 4, 2025
732dc92
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/clp-artifact-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -548,7 +548,8 @@ jobs:
${{needs.filter-relevant-changes.outputs.ubuntu_jammy_image_changed == 'false'
|| (github.event_name != 'pull_request' && github.ref == 'refs/heads/main')}}
run_command: >-
CLP_CPP_MAX_PARALLELISM_PER_BUILD_TASK=$(getconf _NPROCESSORS_ONLN) task package
CLP_CPP_MAX_PARALLELISM_PER_BUILD_TASK=$(getconf _NPROCESSORS_ONLN)
task package-build-deps

- uses: "./.github/actions/clp-build-runtime-image"
with:
Expand Down
76 changes: 50 additions & 26 deletions components/clp-package-utils/clp_package_utils/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
get_datasets_table_name,
get_files_table_name,
)
from clp_py_utils.core import resolve_host_path

from clp_package_utils.general import (
check_docker_dependencies,
Expand Down Expand Up @@ -119,9 +120,12 @@ def _set_up_env_for_database(self) -> EnvVarsDict:
logs_dir = self._clp_config.logs_directory / component_name
validate_db_config(self._clp_config, conf_logging_file, data_dir, logs_dir)

data_dir.mkdir(exist_ok=True, parents=True)
logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(data_dir, logs_dir)
resolved_data_dir = resolve_host_path(data_dir)
resolved_logs_dir = resolve_host_path(logs_dir)

resolved_data_dir.mkdir(exist_ok=True, parents=True)
resolved_logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(resolved_data_dir, resolved_logs_dir)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -166,8 +170,10 @@ def _set_up_env_for_queue(self) -> EnvVarsDict:
logs_dir = self._clp_config.logs_directory / component_name
validate_queue_config(self._clp_config, logs_dir)

logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(logs_dir)
resolved_logs_dir = resolve_host_path(logs_dir)

resolved_logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(resolved_logs_dir)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -204,9 +210,12 @@ def _set_up_env_for_redis(self) -> EnvVarsDict:
logs_dir = self._clp_config.logs_directory / component_name
validate_redis_config(self._clp_config, conf_file, data_dir, logs_dir)

data_dir.mkdir(exist_ok=True, parents=True)
logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(data_dir, logs_dir)
resolved_data_dir = resolve_host_path(data_dir)
resolved_logs_dir = resolve_host_path(logs_dir)

resolved_data_dir.mkdir(exist_ok=True, parents=True)
resolved_logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(resolved_data_dir, resolved_logs_dir)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -252,9 +261,12 @@ def _set_up_env_for_results_cache(self) -> EnvVarsDict:
logs_dir = self._clp_config.logs_directory / component_name
validate_results_cache_config(self._clp_config, conf_file, data_dir, logs_dir)

data_dir.mkdir(exist_ok=True, parents=True)
logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(data_dir, logs_dir)
resolved_data_dir = resolve_host_path(data_dir)
resolved_logs_dir = resolve_host_path(logs_dir)

resolved_data_dir.mkdir(exist_ok=True, parents=True)
resolved_logs_dir.mkdir(exist_ok=True, parents=True)
_chown_paths_if_root(resolved_data_dir, resolved_logs_dir)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -291,7 +303,8 @@ def _set_up_env_for_compression_scheduler(self) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand All @@ -314,7 +327,8 @@ def _set_up_env_for_query_scheduler(self) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand All @@ -336,7 +350,8 @@ def _set_up_env_for_compression_worker(self, num_workers: int) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -365,7 +380,8 @@ def _set_up_env_for_query_worker(self, num_workers: int) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand All @@ -392,7 +408,8 @@ def _set_up_env_for_reducer(self, num_workers: int) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -426,8 +443,12 @@ def _set_up_env_for_webui(self, container_clp_config: CLPConfig) -> EnvVarsDict:
server_settings_json_path = (
self._clp_home / "var" / "www" / "webui" / "server" / "dist" / "settings.json"
)
resolved_client_settings_json_path = resolve_host_path(client_settings_json_path)
resolved_server_settings_json_path = resolve_host_path(server_settings_json_path)
validate_webui_config(
self._clp_config, client_settings_json_path, server_settings_json_path
self._clp_config,
client_settings_json_path,
server_settings_json_path,
)

# Read, update, and write back client's and server's settings.json
Expand Down Expand Up @@ -455,9 +476,9 @@ def _set_up_env_for_webui(self, container_clp_config: CLPConfig) -> EnvVarsDict:
"SqlDbCompressionJobsTableName": COMPRESSION_JOBS_TABLE_NAME,
}
client_settings_json = self._read_and_update_settings_json(
client_settings_json_path, client_settings_json_updates
resolved_client_settings_json_path, client_settings_json_updates
)
with open(client_settings_json_path, "w") as client_settings_json_file:
with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
client_settings_json_file.write(json.dumps(client_settings_json))
Comment on lines +479 to 482
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer json.dump with indentation for readability.

Small DX improvement for troubleshooting settings.

-        with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
-            client_settings_json_file.write(json.dumps(client_settings_json))
+        with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
+            json.dump(client_settings_json, client_settings_json_file, indent=2, sort_keys=True)
@@
-        with open(resolved_server_settings_json_path, "w") as settings_json_file:
-            settings_json_file.write(json.dumps(server_settings_json))
+        with open(resolved_server_settings_json_path, "w") as settings_json_file:
+            json.dump(server_settings_json, settings_json_file, indent=2, sort_keys=True)

Also applies to: 536-537

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines
479-482 (and also apply the same change at lines ~536-537), replace the manual
json.dumps write with json.dump using an indent parameter for readability; open
the file as currently done but call json.dump(client_settings_json,
client_settings_json_file, indent=2) (or another suitable indent) so the written
JSON is pretty-printed and easier to troubleshoot.


server_settings_json_updates = {
Expand Down Expand Up @@ -510,9 +531,9 @@ def _set_up_env_for_webui(self, container_clp_config: CLPConfig) -> EnvVarsDict:
server_settings_json_updates["PrestoPort"] = None

server_settings_json = self._read_and_update_settings_json(
server_settings_json_path, server_settings_json_updates
resolved_server_settings_json_path, server_settings_json_updates
)
with open(server_settings_json_path, "w") as settings_json_file:
with open(resolved_server_settings_json_path, "w") as settings_json_file:
settings_json_file.write(json.dumps(server_settings_json))

env_vars = EnvVarsDict()
Expand Down Expand Up @@ -544,7 +565,8 @@ def _set_up_env_for_mcp_server(self) -> EnvVarsDict:

logs_dir = self._clp_config.logs_directory / component_name
validate_mcp_server_config(self._clp_config, logs_dir)
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -581,7 +603,8 @@ def _set_up_env_for_garbage_collector(self) -> EnvVarsDict:
logger.info(f"Setting up environment for {component_name}...")

logs_dir = self._clp_config.logs_directory / component_name
logs_dir.mkdir(parents=True, exist_ok=True)
resolved_logs_dir = resolve_host_path(logs_dir)
resolved_logs_dir.mkdir(parents=True, exist_ok=True)

env_vars = EnvVarsDict()

Expand Down Expand Up @@ -804,13 +827,14 @@ def get_or_create_instance_id(clp_config: CLPConfig) -> str:
:return: The instance ID.
"""
instance_id_file_path = clp_config.logs_directory / "instance-id"
resolved_instance_id_file_path = resolve_host_path(instance_id_file_path)

if instance_id_file_path.exists():
with open(instance_id_file_path, "r") as f:
if resolved_instance_id_file_path.exists():
with open(resolved_instance_id_file_path, "r") as f:
instance_id = f.readline()
else:
instance_id = str(uuid.uuid4())[-4:]
with open(instance_id_file_path, "w") as f:
with open(resolved_instance_id_file_path, "w") as f:
f.write(instance_id)

return instance_id
Comment on lines 841 to 852
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Trim minor style issue and reduce collision risk for instance IDs.

  • open(..., "r") mode is redundant; default is read (UP015).
  • Optional: use 8 hex chars to reduce collisions.
-    if resolved_instance_id_file_path.exists():
-        with open(resolved_instance_id_file_path, "r") as f:
+    if resolved_instance_id_file_path.exists():
+        with open(resolved_instance_id_file_path) as f:
             instance_id = f.readline()
     else:
-        instance_id = str(uuid.uuid4())[-4:]
+        instance_id = uuid.uuid4().hex[-8:]
         with open(resolved_instance_id_file_path, "w") as f:
             f.write(instance_id)
🧰 Tools
🪛 Ruff (0.14.2)

835-835: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines 831
to 842, remove the redundant "r" mode when opening the instance-id file and trim
the read value, and generate a longer 8-hex-character instance id to reduce
collision risk: when reading, open with default read mode (omit "r") and use
.readline().strip() to remove trailing newlines; when creating a new id, use
uuid.uuid4().hex and take the last 8 characters (or otherwise produce 8 hex
chars) before writing it to the resolved path using the default write mode.

Expand Down
32 changes: 20 additions & 12 deletions components/clp-package-utils/clp_package_utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
get_config_value,
make_config_path_absolute,
read_yaml_config_file,
resolve_host_path,
validate_path_could_be_dir,
)
from strenum import KebabCaseStrEnum
Expand Down Expand Up @@ -355,7 +356,8 @@ def dump_container_config(
"""
config_file_path_on_host = clp_config.logs_directory / config_filename
config_file_path_on_container = container_clp_config.logs_directory / config_filename
with open(config_file_path_on_host, "w") as f:
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create parent directory before writing the container config file.

Without this, first-run writes fail if the logs dir hierarchy is missing.

Apply this diff:

-    resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
-    with open(resolved_config_file_path_on_host, "w") as f:
+    resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
+    resolved_config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
+    with open(resolved_config_file_path_on_host, "w") as f:
         yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
resolved_config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 359 to
361, the code opens the resolved config file path for writing without ensuring
its parent directory exists; create the parent directory before opening the file
by resolving the path (use Path or os.path), computing parent_dir =
resolved_path.parent (or os.path.dirname), and calling os.makedirs(parent_dir,
exist_ok=True) (skip if parent_dir is empty or root), then open and write the
YAML as before so first-run writes won't fail when the logs dir hierarchy is
missing.


return config_file_path_on_container, config_file_path_on_host
Expand Down Expand Up @@ -427,8 +429,9 @@ def validate_config_key_existence(config, key):
def load_config_file(
config_file_path: pathlib.Path, default_config_file_path: pathlib.Path, clp_home: pathlib.Path
):
if config_file_path.exists():
raw_clp_config = read_yaml_config_file(config_file_path)
resolved_config_file_path = resolve_host_path(config_file_path)
if resolved_config_file_path.exists():
raw_clp_config = read_yaml_config_file(resolved_config_file_path)
if raw_clp_config is None:
clp_config = CLPConfig()
else:
Expand Down Expand Up @@ -464,16 +467,17 @@ def validate_credentials_file_path(
clp_config: CLPConfig, clp_home: pathlib.Path, generate_default_file: bool
):
credentials_file_path = clp_config.credentials_file_path
if not credentials_file_path.exists():
resolved_credentials_file_path = resolve_host_path(credentials_file_path)
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
generate_credentials_file(credentials_file_path)
generate_credentials_file(resolved_credentials_file_path)
else:
Comment on lines +469 to 476
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Also mkdir for generated default credentials file.

Generating the default file will fail if its parent dir does not exist.

Apply this diff:

-    resolved_credentials_file_path = resolve_host_path(credentials_file_path)
+    resolved_credentials_file_path = resolve_host_path(credentials_file_path)
@@
-            generate_credentials_file(resolved_credentials_file_path)
+            resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)
+            generate_credentials_file(resolved_credentials_file_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
generate_credentials_file(credentials_file_path)
generate_credentials_file(resolved_credentials_file_path)
else:
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)
generate_credentials_file(resolved_credentials_file_path)
else:
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 471 to
478, the code generates a default credentials file but doesn't ensure its parent
directory exists; before calling
generate_credentials_file(resolved_credentials_file_path) create the parent
directory for resolved_credentials_file_path (e.g.,
resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)) so
file creation won't fail when the directory is missing.

raise ValueError(f"Credentials file path '{credentials_file_path}' does not exist.")
elif not credentials_file_path.is_file():
elif not resolved_credentials_file_path.is_file():
raise ValueError(f"Credentials file path '{credentials_file_path}' is not a file.")


Expand Down Expand Up @@ -504,7 +508,8 @@ def validate_db_config(
data_dir: pathlib.Path,
logs_dir: pathlib.Path,
):
if not component_config.exists():
resolved_component_config = resolve_host_path(component_config)
if not resolved_component_config.exists():
raise ValueError(f"{DB_COMPONENT_NAME} configuration file missing: '{component_config}'.")
_validate_data_directory(data_dir, DB_COMPONENT_NAME)
_validate_log_directory(logs_dir, DB_COMPONENT_NAME)
Expand All @@ -524,7 +529,8 @@ def validate_redis_config(
data_dir: pathlib.Path,
logs_dir: pathlib.Path,
):
if not component_config.exists():
resolved_component_config = resolve_host_path(component_config)
if not resolved_component_config.exists():
raise ValueError(
f"{REDIS_COMPONENT_NAME} configuration file missing: '{component_config}'."
)
Expand All @@ -551,7 +557,8 @@ def validate_results_cache_config(
data_dir: pathlib.Path,
logs_dir: pathlib.Path,
):
if not component_config.exists():
resolved_component_config = resolve_host_path(component_config)
if not resolved_component_config.exists():
raise ValueError(
f"{RESULTS_CACHE_COMPONENT_NAME} configuration file missing: '{component_config}'."
)
Expand Down Expand Up @@ -583,7 +590,8 @@ def validate_webui_config(
server_settings_json_path: pathlib.Path,
):
for path in [client_settings_json_path, server_settings_json_path]:
if not path.exists():
resolved_path = resolve_host_path(path)
if not resolved_path.exists():
raise ValueError(f"{WEBUI_COMPONENT_NAME} {path} is not a valid path to settings.json")

validate_port(f"{WEBUI_COMPONENT_NAME}.port", clp_config.webui.host, clp_config.webui.port)
Expand Down Expand Up @@ -765,7 +773,7 @@ def _is_docker_compose_project_running(project_name: str) -> bool:

def _validate_data_directory(data_dir: pathlib.Path, component_name: str) -> None:
try:
validate_path_could_be_dir(data_dir)
validate_path_could_be_dir(resolve_host_path(data_dir))
except ValueError as ex:
raise ValueError(f"{component_name} data directory is invalid: {ex}")

Comment on lines 768 to 773
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Chain validation errors to preserve root cause.

Use from ex in _validate_data_directory and _validate_log_directory.

-    try:
-        validate_path_could_be_dir(resolve_host_path(data_dir))
-    except ValueError as ex:
-        raise ValueError(f"{component_name} data directory is invalid: {ex}")
+    try:
+        validate_path_could_be_dir(resolve_host_path(data_dir))
+    except ValueError as ex:
+        raise ValueError(f"{component_name} data directory is invalid") from ex
@@
-    try:
-        validate_path_could_be_dir(resolve_host_path(logs_dir))
-    except ValueError as ex:
-        raise ValueError(f"{component_name} logs directory is invalid: {ex}")
+    try:
+        validate_path_could_be_dir(resolve_host_path(logs_dir))
+    except ValueError as ex:
+        raise ValueError(f"{component_name} logs directory is invalid") from ex

Also applies to: 783-786

🧰 Tools
🪛 Ruff (0.14.2)

772-772: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


772-772: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines
768-773 and 783-786, the ValueError raised in the except blocks should chain the
original exception to preserve the root cause; update the raise statements to
use "raise ValueError(... ) from ex" so the original exception (ex) is attached
to the new ValueError for proper exception chaining and debugging.

Expand All @@ -779,6 +787,6 @@ def _validate_log_directory(logs_dir: pathlib.Path, component_name: str):
:raise ValueError: If the path is invalid or can't be a directory.
"""
try:
validate_path_could_be_dir(logs_dir)
validate_path_could_be_dir(resolve_host_path(logs_dir))
except ValueError as ex:
raise ValueError(f"{component_name} logs directory is invalid: {ex}")
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
StorageEngine,
StorageType,
)
from clp_py_utils.core import resolve_host_path

from clp_package_utils.general import (
CLPConfig,
Expand Down Expand Up @@ -220,7 +221,6 @@ def main(argv: List[str]) -> int:
)

necessary_mounts: List[Optional[DockerMount]] = [
mounts.clp_home,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mounts like this have to be removed, or they conflict with the contents inside the docker image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain more?

Copy link
Member Author

@junhaoliao junhaoliao Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that comment was so unclear that it took me time to recall why it was done -

on the host, clp_home is essentially the clp-package directory root. previously we were mounting the package root into the container so the python modules in lib/python3/site-packages can be accessed in the container. however, we have now removed such contents from the package (recall "now the package is only 180K"); if we mount an empty directory into /opt/clp in the container, anything in /opt/clp is essentially gone. we should just use the assets in the docker image and remove such mounts then

mounts.logs_dir,
mounts.archives_output_dir,
]
Expand Down Expand Up @@ -281,7 +281,8 @@ def main(argv: List[str]) -> int:
logger.debug(f"Docker command failed: {shlex.join(cmd)}")

# Remove generated files
generated_config_path_on_host.unlink()
resolved_generated_config_path_on_host = resolve_host_path(generated_config_path_on_host)
resolved_generated_config_path_on_host.unlink()

return ret_code

Expand Down
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure the .resolve() calls in this script can properly resolve relative and symlinks in the host path space; otherwise, the "mounted_path"s might point to some invalid file in the Package services containers

@coderabbitai create an issue to track this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the path resolution concern

gh issue create \
  --repo y-scope/clp \
  --title "Ensure .resolve() calls in compress.py properly handle host path resolution" \
  --body "## Description

In the containerized runtime environment introduced in PR #1512, the \`.resolve()\` calls in \`components/clp-package-utils/clp_package_utils/scripts/compress.py\` need to properly resolve relative paths and symlinks in the host path space. Otherwise, the \`mounted_path\`s might point to invalid files in the Package services containers.

## Context

The \`_generate_logs_list\` function uses \`pathlib.Path(...).resolve()\` to convert user-provided paths into absolute paths before mounting them in containers. However, in the containerized runtime, these resolve calls may operate in the container's filesystem space rather than the host's filesystem space.

## Related

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2495876730
- Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1558 to track this concern about ensuring .resolve() calls in compress.py properly handle host path resolution.

You're right to flag this—the .resolve() calls at lines 48 and 69 (in _generate_logs_list) currently resolve paths in whatever filesystem context the script runs in. Since the script now executes inside the container, these calls would resolve symlinks and relative paths against the container's filesystem instead of the host's, which could lead to incorrect mounted_path values that don't correspond to valid files in the container mounts.


🧬 Code Graph Analysis Results

[content below identifies relevant snippets to address the user comment; do not perform the user comment analysis here]

  1. File: components/clp-py-utils/clp_py_utils/core.py
    Lines: 67-92
    Content:
    def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
    """
    Translates a host path to its container-mount equivalent. It also resolves a single level of
    symbolic link if the host path itself is a symlink.

    :param host_path: The host path.
    :return: The translated path.
    """
    host_path = host_path.absolute()
    translated_path = CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")
    
    try:
        if not translated_path.is_symlink():
            return translated_path
    
        link_target = translated_path.readlink()
        if link_target.is_absolute():
            return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
        else:
            # If the symlink points to a relative path, resolve it relative to the symlink's parent.
            return (translated_path.parent / link_target).resolve()
    except OSError:
        # Ignore if reading the symlink fails (e.g., broken link or permission error).
        pass
    
    return translated_path
    
  2. File: components/clp-package-utils/clp_package_utils/scripts/compress.py
    Lines: 31-54 (approximate)
    Content (relevant excerpt):
    def _generate_logs_list(
    container_logs_list_path: pathlib.Path,
    parsed_args: argparse.Namespace,
    ) -> bool:
    """
    Generates logs list file for the native compression script.

    :param container_logs_list_path: Path to write logs list.
    :param parsed_args: Parsed command-line arguments.
    :return: Whether any paths were written to the logs list.
    """
    host_logs_list_path = parsed_args.path_list
    with open(container_logs_list_path, "w") as container_logs_list_file:
        if host_logs_list_path is None:
            for path in parsed_args.paths:
                resolved_path = pathlib.Path(path).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
            return len(parsed_args.paths) != 0
    
        no_path_found = True
        resolved_host_logs_list_path = resolve_host_path_in_container(
            pathlib.Path(host_logs_list_path)
        )
        with open(resolved_host_logs_list_path, "r") as host_logs_list_file:
            for line in host_logs_list_file:
                stripped_path_str = line.rstrip()
                if "" == stripped_path_str:
                    # Skip empty paths
                    continue
                no_path_found = False
                resolved_path = pathlib.Path(stripped_path_str).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
        return not no_path_found
    

Note: The user comment mentions ensuring .resolve() calls properly resolve relative and symlinks in host path space so that "mounted_path"s are valid inside containers. The above snippets illustrate:

  • How host paths are translated to container paths with potential single-level symlink handling (resolve_host_path_in_container).
  • How the script resolves host paths on the host side and computes mounted container paths (Path.resolve() + relative_to("/")) to form the mounted paths written to the logs list. These are the relevant points to audit and adjust for host-space resolution behavior.

🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
StorageEngine,
StorageType,
)
from clp_py_utils.core import resolve_host_path

from clp_package_utils.general import (
CONTAINER_INPUT_LOGS_ROOT_DIR,
Expand Down Expand Up @@ -56,7 +57,8 @@ def _generate_logs_list(
return len(parsed_args.paths) != 0

no_path_found = True
with open(host_logs_list_path, "r") as host_logs_list_file:
resolved_host_logs_list_path = resolve_host_path(pathlib.Path(host_logs_list_path))
with open(resolved_host_logs_list_path, "r") as host_logs_list_file:
for line in host_logs_list_file:
stripped_path_str = line.rstrip()
if "" == stripped_path_str:
Expand Down Expand Up @@ -230,20 +232,19 @@ def main(argv):
container_clp_config, clp_config, get_container_config_filename(container_name)
)

necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]
necessary_mounts = [mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]

# Write compression logs to a file
while True:
# Get unused output path
container_logs_list_filename = f"{uuid.uuid4()}.txt"
container_logs_list_path = clp_config.logs_directory / container_logs_list_filename
logs_list_path_on_container = (
container_clp_config.logs_directory / container_logs_list_filename
)
if not container_logs_list_path.exists():
logs_list_filename = f"{uuid.uuid4()}.txt"
logs_list_path_on_host = clp_config.logs_directory / logs_list_filename
resolved_logs_list_path_on_host = resolve_host_path(logs_list_path_on_host)
logs_list_path_on_container = container_clp_config.logs_directory / logs_list_filename
if not resolved_logs_list_path_on_host.exists():
break

if not _generate_logs_list(container_logs_list_path, parsed_args):
if not _generate_logs_list(resolved_logs_list_path_on_host, parsed_args):
Comment on lines 246 to 253
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Eliminate TOCTOU when generating logs-list temp file.

Use an exclusive create to avoid race between existence check and write.

-    while True:
-        # Get unused output path
-        logs_list_filename = f"{uuid.uuid4()}.txt"
-        logs_list_path_on_host = clp_config.logs_directory / logs_list_filename
-        resolved_logs_list_path_on_host = resolve_host_path(logs_list_path_on_host)
-        logs_list_path_on_container = container_clp_config.logs_directory / logs_list_filename
-        if not resolved_logs_list_path_on_host.exists():
-            break
+    import tempfile, os
+    resolved_logs_dir = resolve_host_path(clp_config.logs_directory)
+    fd, tmp_path = tempfile.mkstemp(dir=resolved_logs_dir, suffix=".txt")
+    os.close(fd)
+    resolved_logs_list_path_on_host = pathlib.Path(tmp_path)
+    logs_list_path_on_container = container_clp_config.logs_directory / resolved_logs_list_path_on_host.name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logs_list_filename = f"{uuid.uuid4()}.txt"
logs_list_path_on_host = clp_config.logs_directory / logs_list_filename
resolved_logs_list_path_on_host = resolve_host_path(logs_list_path_on_host)
logs_list_path_on_container = container_clp_config.logs_directory / logs_list_filename
if not resolved_logs_list_path_on_host.exists():
break
if not _generate_logs_list(container_logs_list_path, parsed_args):
if not _generate_logs_list(resolved_logs_list_path_on_host, parsed_args):
import tempfile, os
resolved_logs_dir = resolve_host_path(clp_config.logs_directory)
fd, tmp_path = tempfile.mkstemp(dir=resolved_logs_dir, suffix=".txt")
os.close(fd)
resolved_logs_list_path_on_host = pathlib.Path(tmp_path)
logs_list_path_on_container = container_clp_config.logs_directory / resolved_logs_list_path_on_host.name
if not _generate_logs_list(resolved_logs_list_path_on_host, parsed_args):
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress.py around
lines 244-251, the current code does a separate exists() check then calls
_generate_logs_list(path), creating a TOCTOU window; replace that pattern with
an exclusive-create operation so the file is created atomically. Concretely,
attempt to create the temp logs-list file with exclusive semantics (e.g.,
open(path, "x") or os.open(path, O_CREAT|O_EXCL|...), or use
tempfile.NamedTemporaryFile with dir=... and delete=False) and then write the
logs list into that newly-created handle (or adapt _generate_logs_list to accept
a file/fd) so you fail fast if the filename already exists and avoid the race
between check and write.

logger.error("No filesystem paths given for compression.")
return -1

Expand All @@ -266,9 +267,11 @@ def main(argv):
logger.error("Compression failed.")
logger.debug(f"Docker command failed: {shlex.join(cmd)}")
else:
container_logs_list_path.unlink()
resolved_logs_list_path_on_host.unlink()

resolved_generated_config_path_on_host = resolve_host_path(generated_config_path_on_host)
resolved_generated_config_path_on_host.unlink()

generated_config_path_on_host.unlink()
return ret_code


Expand Down
Loading
Loading