Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
00efbc1
feat(deployment): Containerize CLP tools via Docker Compose.
junhaoliao Oct 28, 2025
ed3dde2
revert unrelated change
junhaoliao Oct 28, 2025
d307457
fix(docker): Simplify clp-runtime service configuration in Docker Com…
junhaoliao Oct 28, 2025
f871424
debloat the package by avoiding duplicating assets already in the con…
junhaoliao Oct 28, 2025
b03539a
fix(env): Update container image reference format in .common-env.sh
junhaoliao Oct 28, 2025
1b65afe
feat(core): Add resolve_host_path utility for path resolution in cont…
junhaoliao Oct 29, 2025
3b4d44c
lint
junhaoliao Oct 29, 2025
347b181
Merge branch 'main' into containerize-tools
junhaoliao Oct 29, 2025
74a3921
refactor(utils): Simplify path validation by removing host mount para…
junhaoliao Oct 29, 2025
885e9b5
revert unrelated change
junhaoliao Oct 29, 2025
0db2f30
fix package docker workflow
junhaoliao Oct 29, 2025
695ed77
lint
junhaoliao Oct 29, 2025
4f21e0f
docs(dev-docs): Update task command name for building packages.
junhaoliao Oct 29, 2025
a7c81a3
docs: Remove Python section and streamline Docker Desktop requirements
junhaoliao Oct 29, 2025
e6e8217
Merge branch 'main' into containerize-tools
junhaoliao Oct 29, 2025
be2d200
refactor(clp-py-utils): Remove redundant debug print in config direct…
junhaoliao Oct 29, 2025
d910944
chore(scripts): Add shellcheck directive comments for sourcing common…
junhaoliao Oct 29, 2025
163921b
fix SC2155
junhaoliao Oct 29, 2025
3f5c302
fix(scripts): Ensure CLP_DOCKER_SOCK_PATH is set only when undefined.
junhaoliao Oct 29, 2025
65d421f
fix(init): invert CLP_HOME check to use -z for unset detection
junhaoliao Oct 29, 2025
83c581c
fix(scripts): Remove exit 1; Remove redundant stderr redirection in …
junhaoliao Oct 29, 2025
e03a361
chore(clp-py-utils): Remove unused os import from core module.
junhaoliao Oct 29, 2025
299795f
Update volume definitions in docker-compose.runtime.yaml to use the e…
junhaoliao Oct 29, 2025
0ccd155
fix(deployment): Add default values and validations in docker-compose…
junhaoliao Oct 29, 2025
beacdcf
fix(scripts): Adjust formatting for better readability in .common-env…
junhaoliao Oct 29, 2025
c7141e7
move `>&2` before echo
junhaoliao Oct 29, 2025
25c2450
fix(scripts): Resolve host paths for config files during load operati…
junhaoliao Oct 29, 2025
0ac4501
fix(clp-config): Add `use_host_mount` parameter to validation methods…
junhaoliao Oct 29, 2025
479fc81
lint
junhaoliao Oct 29, 2025
9921d48
fix(ci): Update package build step name to indicate the package image…
junhaoliao Oct 30, 2025
a2dd5cf
refactor(clp-py-utils): Rename CONTAINER_HOST_ROOT_DIR -> CONTAINER_D…
junhaoliao Oct 30, 2025
8d6deb6
refactor(clp-py-utils): Rename resolve_host_path -> resolve_host_path…
junhaoliao Oct 30, 2025
c6c5164
Update docstring to clarify `path` translation return value
junhaoliao Oct 30, 2025
3b44945
refactor(clp-py-utils): Rename path -> host_path and resolved -> tran…
junhaoliao Oct 30, 2025
4b6f337
refactor(taskfile): Move NODE_ENV definition to webui task.
junhaoliao Oct 30, 2025
bb90b69
refactor(taskfile): Reorder package-build-deps dependencies for clarity.
junhaoliao Oct 30, 2025
ae825c0
refactor(taskfile): Change "{{.G_BUILD_DIR}}/{{.TASK}}.md5" -> G_WEBU…
junhaoliao Oct 30, 2025
9a66669
Add back package-template to package dependencies.
junhaoliao Oct 30, 2025
6983943
fix(clp-py-utils): Explain in docs that only single-level symlink are…
junhaoliao Oct 30, 2025
e372661
Merge branch 'main' into containerize-tools
junhaoliao Oct 30, 2025
bee1b79
Reorder dependencies
junhaoliao Oct 30, 2025
b0e8abf
shfmt - Apply suggestions from code review
junhaoliao Oct 31, 2025
5172963
avoid hardcoding names - Apply suggestions from code review
junhaoliao Oct 31, 2025
bb5d8d2
shfmt
junhaoliao Nov 3, 2025
95a8e1f
Use standard form for stderr redirection
junhaoliao Nov 3, 2025
395e42f
shfmt
junhaoliao Nov 3, 2025
3af2e4d
Rename dir -> compose_plugin_dir to avoid overlap with the command dir
junhaoliao Nov 3, 2025
198f398
add docs for resolve_host_path_in_container
junhaoliao Nov 3, 2025
1d02051
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
3063e85
format - Apply suggestions from code review
junhaoliao Nov 4, 2025
306997a
split volumes into groups and add docs
junhaoliao Nov 4, 2025
4f6c5b4
Move resolved webui settings paths declarations closer to their usages
junhaoliao Nov 4, 2025
ef627c6
format - Apply suggestions from code review
junhaoliao Nov 4, 2025
75ada35
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
969fde0
Clarify `HOME` forwarding into the container - Apply suggestions from…
junhaoliao Nov 4, 2025
732dc92
Merge branch 'main' into containerize-tools
junhaoliao Nov 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -396,7 +396,6 @@ def generate_container_start_cmd(
"--rm",
"--network", "host",
"-w", str(CONTAINER_CLP_HOME),
"-e", f"PYTHONPATH={clp_site_packages_dir}",
"-u", f"{os.getuid()}:{os.getgid()}",
"--name", container_name,
"--log-driver", "local"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,6 @@ def main(argv: List[str]) -> int:
)

necessary_mounts: List[Optional[DockerMount]] = [
mounts.clp_home,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mounts like this have to be removed, or they conflict with the contents inside the docker image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain more?

Copy link
Member Author

@junhaoliao junhaoliao Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that comment was so unclear that it took me time to recall why it was done -

on the host, clp_home is essentially the clp-package directory root. previously we were mounting the package root into the container so the python modules in lib/python3/site-packages can be accessed in the container. however, we have now removed such contents from the package (recall "now the package is only 180K"); if we mount an empty directory into /opt/clp in the container, anything in /opt/clp is essentially gone. we should just use the assets in the docker image and remove such mounts then

mounts.logs_dir,
mounts.archives_output_dir,
]
Expand Down
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure the .resolve() calls in this script can properly resolve relative and symlinks in the host path space; otherwise, the "mounted_path"s might point to some invalid file in the Package services containers

@coderabbitai create an issue to track this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the path resolution concern

gh issue create \
  --repo y-scope/clp \
  --title "Ensure .resolve() calls in compress.py properly handle host path resolution" \
  --body "## Description

In the containerized runtime environment introduced in PR #1512, the \`.resolve()\` calls in \`components/clp-package-utils/clp_package_utils/scripts/compress.py\` need to properly resolve relative paths and symlinks in the host path space. Otherwise, the \`mounted_path\`s might point to invalid files in the Package services containers.

## Context

The \`_generate_logs_list\` function uses \`pathlib.Path(...).resolve()\` to convert user-provided paths into absolute paths before mounting them in containers. However, in the containerized runtime, these resolve calls may operate in the container's filesystem space rather than the host's filesystem space.

## Related

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2495876730
- Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1558 to track this concern about ensuring .resolve() calls in compress.py properly handle host path resolution.

You're right to flag this—the .resolve() calls at lines 48 and 69 (in _generate_logs_list) currently resolve paths in whatever filesystem context the script runs in. Since the script now executes inside the container, these calls would resolve symlinks and relative paths against the container's filesystem instead of the host's, which could lead to incorrect mounted_path values that don't correspond to valid files in the container mounts.


🧬 Code Graph Analysis Results

[content below identifies relevant snippets to address the user comment; do not perform the user comment analysis here]

  1. File: components/clp-py-utils/clp_py_utils/core.py
    Lines: 67-92
    Content:
    def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
    """
    Translates a host path to its container-mount equivalent. It also resolves a single level of
    symbolic link if the host path itself is a symlink.

    :param host_path: The host path.
    :return: The translated path.
    """
    host_path = host_path.absolute()
    translated_path = CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")
    
    try:
        if not translated_path.is_symlink():
            return translated_path
    
        link_target = translated_path.readlink()
        if link_target.is_absolute():
            return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
        else:
            # If the symlink points to a relative path, resolve it relative to the symlink's parent.
            return (translated_path.parent / link_target).resolve()
    except OSError:
        # Ignore if reading the symlink fails (e.g., broken link or permission error).
        pass
    
    return translated_path
    
  2. File: components/clp-package-utils/clp_package_utils/scripts/compress.py
    Lines: 31-54 (approximate)
    Content (relevant excerpt):
    def _generate_logs_list(
    container_logs_list_path: pathlib.Path,
    parsed_args: argparse.Namespace,
    ) -> bool:
    """
    Generates logs list file for the native compression script.

    :param container_logs_list_path: Path to write logs list.
    :param parsed_args: Parsed command-line arguments.
    :return: Whether any paths were written to the logs list.
    """
    host_logs_list_path = parsed_args.path_list
    with open(container_logs_list_path, "w") as container_logs_list_file:
        if host_logs_list_path is None:
            for path in parsed_args.paths:
                resolved_path = pathlib.Path(path).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
            return len(parsed_args.paths) != 0
    
        no_path_found = True
        resolved_host_logs_list_path = resolve_host_path_in_container(
            pathlib.Path(host_logs_list_path)
        )
        with open(resolved_host_logs_list_path, "r") as host_logs_list_file:
            for line in host_logs_list_file:
                stripped_path_str = line.rstrip()
                if "" == stripped_path_str:
                    # Skip empty paths
                    continue
                no_path_found = False
                resolved_path = pathlib.Path(stripped_path_str).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
        return not no_path_found
    

Note: The user comment mentions ensuring .resolve() calls properly resolve relative and symlinks in host path space so that "mounted_path"s are valid inside containers. The above snippets illustrate:

  • How host paths are translated to container paths with potential single-level symlink handling (resolve_host_path_in_container).
  • How the script resolves host paths on the host side and computes mounted container paths (Path.resolve() + relative_to("/")) to form the mounted paths written to the logs list. These are the relevant points to audit and adjust for host-space resolution behavior.

🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def main(argv):
container_clp_config, clp_config, get_container_config_filename(container_name)
)

necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]
necessary_mounts = [mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]

# Write compression logs to a file
while True:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ def main(argv):
container_clp_config, clp_config, get_container_config_filename(container_name)
)

necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir]
necessary_mounts = [mounts.data_dir, mounts.logs_dir]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Mount AWS config into the container when configured.

Without mounts.aws_config_dir, profile-based S3 auth will fail inside the job container.

Apply this diff:

-    necessary_mounts = [mounts.data_dir, mounts.logs_dir]
+    necessary_mounts = [mounts.data_dir, mounts.logs_dir, mounts.aws_config_dir]

🏁 Script executed:

#!/bin/bash
# Check other scripts also include aws_config_dir when S3 is used
rg -nC2 'necessary_mounts\s*=\s*\[.*aws_config_dir' components/clp-package-utils/clp_package_utils/scripts

Length of output: 0


🏁 Script executed:

#!/bin/bash
# 1. Check the current state at line 291 in compress_from_s3.py
echo "=== Current code around line 291 ==="
sed -n '285,295p' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== All necessary_mounts definitions ==="
rg -n 'necessary_mounts\s*=' components/clp-package-utils/clp_package_utils/scripts

echo ""
echo "=== Check for aws_config_dir references ==="
rg -n 'aws_config_dir' components/clp-package-utils/clp_package_utils/scripts

echo ""
echo "=== Check for S3 usage context in compress_from_s3.py ==="
rg -n 'S3\|s3\|aws' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py | head -20

Length of output: 1638


🏁 Script executed:

#!/bin/bash
# Check decompress.py line 115 context
echo "=== decompress.py:115 context ==="
sed -n '110,125p' components/clp-package-utils/clp_package_utils/scripts/decompress.py

echo ""
echo "=== dataset_manager.py:143 context (conditional logic) ==="
sed -n '135,150p' components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py

echo ""
echo "=== Broader search for S3/credentials in compress_from_s3.py ==="
rg -n 'boto\|credentials\|profile\|AWS\|s3_' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Check if compress_from_s3.py references dataset_manager or similar patterns ==="
rg -n 'aws_config\|S3\|boto' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Check mounts object to understand aws_config_dir ==="
rg -n 'aws_config_dir' components/clp-package-utils/clp_package_utils -A2 -B2 | head -40

Length of output: 5049


🏁 Script executed:

#!/bin/bash
# Check if compress_from_s3.py calls generate_container_auth_options
echo "=== generate_container_auth_options in compress_from_s3.py ==="
rg -n 'generate_container_auth_options' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Full compress_from_s3.py file to understand S3 usage ==="
wc -l components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== First 50 lines to understand imports/purpose ==="
head -50 components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Lines around 291 with broader context (270-310) ==="
sed -n '270,310p' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Check for boto3, requests, or URL downloading ==="
grep -n 'boto3\|requests\|urlopen\|download\|fetch' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

Length of output: 3602


🏁 Script executed:

#!/bin/bash
# Check generate_container_auth_options to understand when aws_mount is needed
echo "=== generate_container_auth_options definition and logic ==="
rg -n 'def generate_container_auth_options' components/clp-package-utils/clp_package_utils -A 20

echo ""
echo "=== Check if compress_from_s3.py has any AWS config usage or should ==="
grep -n 'auth\|credential\|profile' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Search compress.py for comparison (similar compression script) ==="
sed -n '230,250p' components/clp-package-utils/clp_package_utils/scripts/compress.py

Length of output: 1323


🏁 Script executed:

#!/bin/bash
# Check what happens after necessary_mounts is created - how are URLs handled?
echo "=== compress_from_s3.py lines 310-333 (end of file) ==="
sed -n '310,333p' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Check if container receives credentials through env vars instead ==="
rg -n 'AWS\|credential\|profile' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Verify: does dataset_manager call generate_container_auth_options and get aws_mount? ==="
sed -n '138,148p' components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py

echo ""
echo "=== Look for where generate_container_auth_options is defined ==="
rg -n 'def generate_container_auth_options' components/clp-package-utils/clp_package_utils

Length of output: 1401


🏁 Script executed:

#!/bin/bash
# Search more broadly for generate_container_auth_options
echo "=== Search all Python files for generate_container_auth_options ==="
rg -n 'generate_container_auth_options' components/clp-package-utils/clp_package_utils

echo ""
echo "=== Check dataset_manager imports ==="
head -40 components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py | grep -n import

echo ""
echo "=== Search in clp_py_utils for the function ==="
find components/clp-package-utils -name '*.py' -type f -exec grep -l 'generate_container_auth_options' {} \;

Length of output: 859


🏁 Script executed:

#!/bin/bash
# Check what _generate_url_list does and if it uses AWS credentials
echo "=== compress_from_s3.py _generate_url_list function ==="
sed -n '38,120p' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

echo ""
echo "=== Check S3 constants ==="
rg -n 'S3_OBJECT_COMPRESSION\|S3_KEY_PREFIX_COMPRESSION' components/clp-package-utils/clp_package_utils/general.py

echo ""
echo "=== Check _validate_s3_object_args to see what S3 setup is needed ==="
sed -n '150,180p' components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

Length of output: 3886


Include AWS config directory mount AND environment variables when S3 is used, following the pattern in dataset_manager.py.

The suggested fix adds mounts.aws_config_dir to necessary_mounts, but this is incomplete. Similar S3 authentication is handled in dataset_manager.py by calling generate_container_auth_options(), which returns both an optional mount AND environment variables. The native compression script running with --input-type s3 requires both—mounting the AWS config and passing authentication environment variables to the container.

The fix should:

  1. Import generate_container_auth_options from clp_py_utils.s3_utils
  2. Call it to obtain aws_mount and aws_env_vars
  3. Conditionally append mounts.aws_config_dir and merge aws_env_vars into extra_env_vars, mirroring the logic at dataset_manager.py:138–148
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around line 291, the code only adds mounts.data_dir and mounts.logs_dir but
misses attaching the AWS config mount and environment variables when input-type
is s3; import generate_container_auth_options from clp_py_utils.s3_utils, call
it to get (aws_mount, aws_env_vars), and if aws_mount is present append
mounts.aws_config_dir to necessary_mounts (or append aws_mount as
dataset_manager does), then merge/extend aws_env_vars into extra_env_vars before
launching the container so the container receives both the AWS config mount and
the required auth environment variables (follow the same conditional merge logic
as dataset_manager.py lines ~138-148).


while True:
container_url_list_filename = f"{uuid.uuid4()}.txt"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,7 @@ def main(argv: List[str]) -> int:
container_clp_config, clp_config, get_container_config_filename(container_name)
)

necessary_mounts = [
mounts.clp_home,
mounts.logs_dir,
]
necessary_mounts = [mounts.logs_dir]
if clp_config.archive_output.storage.type == StorageType.FS:
necessary_mounts.append(mounts.archives_output_dir)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import argparse
import logging
import os
import pathlib
import shlex
import subprocess
Expand Down Expand Up @@ -110,7 +111,6 @@ def handle_extract_file_cmd(
extraction_dir.mkdir(exist_ok=True)
container_extraction_dir = pathlib.Path("/") / "mnt" / "extraction-dir"
necessary_mounts = [
mounts.clp_home,
mounts.data_dir,
mounts.logs_dir,
mounts.archives_output_dir,
Expand Down Expand Up @@ -205,7 +205,7 @@ def handle_extract_stream_cmd(
generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
container_clp_config, clp_config, get_container_config_filename(container_name)
)
necessary_mounts = [mounts.clp_home, mounts.logs_dir]
necessary_mounts = [mounts.logs_dir]
extra_env_vars = {
CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
Expand Down Expand Up @@ -298,8 +298,9 @@ def main(argv):
file_extraction_parser.add_argument(
"-f", "--files-from", help="A file listing all files to extract."
)
default_extraction_dir = pathlib.Path(os.environ.get("CLP_PWD_HOST", "."))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't explicitly set and read this, . resolves to /opt/clp

file_extraction_parser.add_argument(
"-d", "--extraction-dir", metavar="DIR", default=".", help="Extract files into DIR."
"-d", "--extraction-dir", metavar="DIR", default=default_extraction_dir, help="Extract files into DIR."
)

# IR extraction command parser
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ def main(argv):
generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
container_clp_config, clp_config, get_container_config_filename(container_name)
)
necessary_mounts = [mounts.clp_home, mounts.logs_dir]
necessary_mounts = [mounts.logs_dir]
extra_env_vars = {
CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
Expand Down
53 changes: 53 additions & 0 deletions components/package-template/src/sbin/.common-env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root=$(readlink -f "$script_dir/..")

if [[ -n "${CLP_HOME:-}" ]]; then
export CLP_HOME="$CLP_HOME"
else
export CLP_HOME="$package_root"
fi

image_id_file="$CLP_HOME/clp-package-image.id"
version_file="$CLP_HOME/VERSION"

if [[ -f "$image_id_file" ]]; then
image_id="$(tr -d '[:space:]' < "$image_id_file")"
export CLP_PACKAGE_CONTAINER_IMAGE_REF="$image_id"
elif [[ -f "$version_file" ]]; then
version="$(tr -d '[:space:]' < "$version_file")"
export CLP_PACKAGE_CONTAINER_IMAGE_REF="clp-package:$version"
else
echo "Error: Neither clp-package-image.id nor VERSION file exists." >&2
return 1 2>/dev/null || exit 1
fi

uid="$(id --user 2>/dev/null || echo "1000")"
gid="$(getent group docker | cut -d: -f3 2>/dev/null || echo "999")"
export CLP_FIRST_PARTY_SERVICE_UID_GID="$uid:$gid"

export CLP_PWD_HOST="$(pwd)"

if [[ -z "${CLP_DOCKER_PLUGIN_DIR:-}" ]]; then
for dir in \
"$HOME/.docker/cli-plugins" \
"/mnt/wsl/docker-desktop/cli-tools/usr/local/lib/docker/cli-plugins" \
"/usr/local/lib/docker/cli-plugins" \
"/usr/libexec/docker/cli-plugins"; do

compose_plugin_path="$dir/docker-compose"
if [[ -f "$compose_plugin_path" ]]; then
export CLP_DOCKER_PLUGIN_DIR="$dir"
break
fi
done
if [[ -z "${CLP_DOCKER_PLUGIN_DIR:-}" ]]; then
echo "Warning: Docker plugin directory not found; Docker Compose may not work inside container." >&2
fi
fi

socket="$(docker context inspect --format '{{.Endpoints.docker.Host}}' 2>/dev/null | sed -E 's|^unix://||')"
if [[ -S "$socket" ]]; then
export CLP_DOCKER_SOCK_PATH="$socket"
fi
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should

set -o errexit
set -o nounset
set -o pipefail

or

set -eu
set -o pipefail

so the scripts early exit on any error or on any unset variables, and make a pipeline fail if any command in it fails

@coderabbitai create an issue to track the improvement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Create GitHub issue to track adding error handling flags to shell scripts
gh issue create \
  --title "Add error handling flags to shell scripts in package" \
  --body "## Context

User @junhaoliao requested tracking an improvement to add error handling flags to shell scripts in the package template.

## Description

Shell scripts in \`components/package-template/src/sbin/\` should include error handling flags at the beginning to make them more robust:

\`\`\`bash
set -eu
set -o pipefail
\`\`\`

These flags will:
- Exit early on any error (\`-e\` / \`errexit\`)
- Exit on unset variables (\`-u\` / \`nounset\`)
- Make pipelines fail if any command in them fails (\`pipefail\`)

Note: Use separate \`set -eu\` and \`set -o pipefail\` commands for consistency with the established pattern in CLP project build scripts.

## Affected Scripts

Scripts in \`components/package-template/src/sbin/\` directory, including:
- admin-tools/archive-manager.sh
- admin-tools/dataset-manager.sh
- compress-from-s3.sh
- compress.sh
- decompress.sh
- search.sh
- start-clp.sh
- stop-clp.sh
- .common-env.sh

## References

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2493126446
- Requested by: @junhaoliao" \
  --assignee junhaoliao \
  --repo y-scope/clp

Length of output: 42


@junhaoliao, I've created issue #1556 to track adding error handling flags to the shell scripts in the package template. The issue covers all the scripts in components/package-template/src/sbin/ and follows the established CLP project pattern of using separate set -eu and set -o pipefail commands for consistency with existing build scripts.


🧠 Learnings used
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).


script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/../.."
common_env_path="$script_dir/../.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.archive_manager \
"$@"
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/../.."
common_env_path="$script_dir/../.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.dataset_manager \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/compress-from-s3.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.compress_from_s3 \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/compress.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.compress \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/decompress.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.decompress \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/search.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.search \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/start-clp.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.start_clp \
"$@"
5 changes: 3 additions & 2 deletions components/package-template/src/sbin/stop-clp.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
common_env_path="$script_dir/.common-env.sh"
source "$common_env_path"

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime \
python3 \
-m clp_package_utils.scripts.stop_clp \
"$@"
21 changes: 21 additions & 0 deletions tools/deployment/package/docker-compose.runtime.yaml
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really have to use Docker Compose for this container definition, though Docker Compose helps it keep structured and avoids duplicate code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host network mode should be used or the start_clp.py script cannot properly check what IP addresses it can bind to on the host.

@coderabbitai file an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for enabling host network mode in docker-compose.runtime.yaml

gh issue create \
  --repo y-scope/clp \
  --title "Enable host network mode for clp-runtime service in docker-compose.runtime.yaml" \
  --body "## Problem

The \`clp-runtime\` service in \`tools/deployment/package/docker-compose.runtime.yaml\` should use host network mode to allow the \`start_clp.py\` script to properly check what IP addresses it can bind to on the host.

## Context

Currently, the service runs with bridge networking, which isolates the container's network stack from the host. This prevents \`start_clp.py\` from correctly detecting available host IP addresses for binding.

## Backlinks

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2496275743
- Reported by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1560 to track enabling host network mode for the clp-runtime service. The issue includes the context about start_clp.py needing proper access to host IP addresses and references this PR and comment.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Learnt from: 20001020ycx
Repo: y-scope/clp PR: 1368
File: components/clp-mcp-server/clp_mcp_server/__init__.py:11-15
Timestamp: 2025-10-02T15:48:58.961Z
Learning: In the clp-mcp-server component (components/clp-mcp-server/clp_mcp_server/__init__.py), the default host binding of 0.0.0.0 is intentional because the server is designed to be deployed in Docker containers where this binding is necessary to accept external connections.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
x-service-defaults: &service_defaults
image: "${CLP_PACKAGE_CONTAINER_IMAGE_REF:-clp-package}"
logging:
driver: "local"
stop_grace_period: "60s"

services:
clp-runtime:
<<: *service_defaults
hostname: "clp_runtime"
user: "${CLP_FIRST_PARTY_SERVICE_UID_GID:-1000:999}"
environment:
CLP_HOME: "${CLP_HOME}"
CLP_PWD_HOST: "${CLP_PWD_HOST:-${PWD}}"
volumes:
- "${CLP_DOCKER_PLUGIN_DIR}:/usr/local/lib/docker/cli-plugins:ro"
- "${CLP_DOCKER_SOCK_PATH}:/var/run/docker.sock"
- "${CLP_HOME}:${CLP_HOME}"
- "/usr/bin/docker:/usr/bin/docker:ro"
stdin_open: true
tty: true
Loading