Skip to content

Conversation

@junhaoliao
Copy link
Owner

@junhaoliao junhaoliao commented Oct 25, 2025

Summary

  • inline the uv project directories directly in the check/update loops using YAML anchors
  • reuse G_COMPONENTS_DIR for component paths while handling other uv projects from the repo root

Testing

  • task deps:lock:check-uv

https://chatgpt.com/codex/tasks/task_e_68fcd91ef21c83278d173c54269455c2

Summary by CodeRabbit

  • New Features

    • Log conversion tool and CLI producing serialized run files.
    • S3- and FS-based compression tooling plus a helper script.
  • Refactor

    • Controller lifecycle API renamed and environment setup separated (affects startup flow).
  • Chores

    • CI/workflow and task targets to validate/update dependency lock files and include log-converter in builds/packages.
    • Container/network and deployment defaults adjusted for package networking and archive mounts.
  • Documentation

    • Docs for CI lock workflow and new --setup-only startup behavior.
  • Tests

    • Updated tests and integration configs for log-converter and related outputs.

@coderabbitai
Copy link

coderabbitai bot commented Oct 25, 2025

Walkthrough

Adds UV lock Taskfile tasks and a GitHub Actions workflow; provides FS and S3 compression CLIs and a shell launcher; introduces a new C++ log-converter executable (parsing, buffered conversion, KV-IR serialization) with build integration; applies packaging, deployment, config, controller lifecycle, and test updates across multiple components.

Changes

Cohort / File(s) Summary
UV lock tasks & CI
taskfiles/deps/lock.yaml, .github/workflows/clp-uv-checks.yaml, docs/src/dev-docs/tooling-gh-workflows.md
Added check-uv, update-uv, internal uv-lock Taskfile tasks; new clp-uv-checks GitHub Actions workflow (matrix runners, cron/PR/push/dispatch triggers); documented workflow.
Compression tooling (FS & S3)
components/clp-package-utils/clp_package_utils/scripts/compress.py, components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py, components/clp-package-utils/clp_package_utils/scripts/native/compress.py, components/clp-package-utils/clp_package_utils/general.py, components/package-template/src/sbin/compress-from-s3.sh
Reworked FS-only compressor internals and CLI; added new S3-oriented CLI compress_from_s3.py with URL-list generation, container orchestration, and cleanup; native compressor extended to support S3 inputs and helpers; added S3-related constants and a shell launcher.
C++ log-converter component
components/core/src/clp_s/log_converter/*, components/core/src/clp_s/log_converter/log_converter.cpp, components/core/src/clp_s/log_converter/LogSerializer.*, components/core/src/clp_s/log_converter/CommandLineArguments.*, components/core/src/clp_s/log_converter/LogConverter.*, components/core/src/clp_s/log_converter/CMakeLists.txt, components/core/src/clp_s/CMakeLists.txt
Added a log-converter executable: command-line parsing, readers, buffered LogConverter implementation, LogSerializer producing KV-IR, and CMake targets; integrated log_converter into project build.
FileWriter move semantics
components/core/src/clp_s/FileWriter.hpp
Disabled copy operations and added move constructor and move assignment operator for FileWriter.
Build & packaging updates
taskfile.yaml, taskfiles/deps/main.yaml
Included log-converter in build/package flows; updated log_surgeon tarball URL and checksum; added CMAKE_BUILD_TYPE=Release for a dependency.
Server utils & tests
components/clp-mcp-server/clp_mcp_server/server/utils.py, components/clp-mcp-server/tests/server/test_utils.py
Log formatter now requires/includes a link field; tests updated (added LINK attr and link values in raw entries and expected strings).
Tests & integration config
components/core/tests/test-ParserWithUserSchema.cpp, integration-tests/tests/utils/config.py
Adjusted test expected token name; added log-converter to required core binaries in integration test config.
Deployment & compose
tools/deployment/presto-clp/docker-compose.yaml, tools/deployment/presto-clp/scripts/init.py, tools/deployment/presto-clp/.gitignore
Switched to CLP_STAGED_ARCHIVES_DIR and read-only archive mounts with empty fallback volume; added /.env ignore; integrated CLP package network naming and metadata DB host constants.
Controller & start flow
components/clp-package-utils/clp_package_utils/controller.py, components/clp-package-utils/clp_package_utils/scripts/start_clp.py
Refactored controller lifecycle: added set_up_env and shifted abstract method names (start/stop), DockerComposeController implements env setup; start_clp.py adds --setup-only to run setup and exit.
Presto/container config & docs
components/clp-py-utils/clp_py_utils/clp_config.py, docs/src/user-docs/quick-start/clp-text.md, docs/src/user-docs/quick-start/clp-json.md, docs/src/dev-docs/components-core/index.md, docs/src/dev-docs/tooling-gh-workflows.md
Added PRESTO coordinator constant and transform_for_container method; docs updated with --setup-only tips and dependency version update; removed an example tying DB host to non-localhost.
Docker image change
tools/docker-images/clp-package/Dockerfile
Added LD_LIBRARY_PATH ENV and copied libmariadbcpp shared libraries into CLP lib directory.
Misc scripts & packaging
components/package-template/src/sbin/compress-from-s3.sh, taskfile.yaml
Added shell launcher for S3 compressor; included log-converter in packaging/rsync targets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant GH as GitHub Actions
  participant Runner as CI runner
  participant Repo as Repository
  participant TaskCLI as task CLI
  participant Task as Taskfile `check-uv`
  participant UV as internal `uv-lock`

  Note over GH: Trigger (PR / push / cron / dispatch)
  GH->>Runner: start job
  Runner->>Repo: checkout (with submodules)
  Runner->>Runner: install task CLI & uv tool
  Runner->>TaskCLI: run `task deps:lock:check-uv`
  TaskCLI->>Task: invoke `check-uv`
  loop per project
    Task->>Repo: set `PROJECT_DIR`
    Task->>UV: run `uv lock --check` in `PROJECT_DIR`
    UV-->>Task: return status
  end
  Task-->>TaskCLI: aggregate results
  TaskCLI-->>Runner: exit code
  Runner-->>GH: report status
Loading
sequenceDiagram
  autonumber
  participant User as Operator
  participant CLI as compress_from_s3
  participant Config as CLP config / DB
  participant S3 as S3 service
  participant FS as Host filesystem (temp files)
  participant Container as Compression container

  User->>CLI: run with S3 subcommand + args
  CLI->>Config: load CLP config & validate storage type
  CLI->>S3: parse & validate URLs / prefixes (bucket/region)
  CLI->>FS: write URL list file
  CLI->>Container: start container with mounts and config
  Container->>Container: run native compressor `--input-type s3` with URL list
  Container-->>CLI: exit code
  CLI->>FS: cleanup URL list & temp config on success
  CLI-->>User: return status
Loading
sequenceDiagram
  autonumber
  participant User as Operator
  participant LC as log-converter (main)
  participant CLA as CommandLineArguments
  participant Conv as LogConverter
  participant Serial as LogSerializer
  participant FS as filesystem/Reader

  User->>LC: invoke with args
  LC->>CLA: parse arguments (files, auth, output-dir)
  CLA-->>LC: parsed input paths
  LC->>FS: open each input (file or network reader)
  loop per input
    LC->>Conv: convert_file(path, reader, output-dir)
    Conv->>Serial: create serializer for output file
    Conv->>Conv: refill/parse buffer, produce messages
    Conv->>Serial: add_message(...), flush as needed
    Serial-->>Conv: success/failure
  end
  LC-->>User: exit code
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

  • Areas needing focused review:
    • C++ log-converter: buffer growth, parser integration, error propagation, ownership/move semantics (FileWriter/LogSerializer).
    • New S3 compression toolchain: CLI validation, S3 URL parsing, AWS credentials extraction, container mounts, temp-file handling, and cleanup.
    • Controller refactor: public API renames (set_up_env/start/stop) and DockerComposeController env generation/writing.
    • Taskfile & GitHub workflow: loop variable propagation and flagged paths.
    • Server utils & tests: formatter now requires link—verify all consumers/tests updated.
    • Packaging/build: CMake additions, Dockerfile library copy, and updated dependency tarball checksum.

Poem

🐇
I nibble paths and mount the skies,
I stash URLs where metadata lies,
I hop through locks and build the stream,
Compress and convert — a rabbit's dream,
CI hums softly as code flies.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.52% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Refine uv lock maintenance loops" directly relates to the primary changes in this pull request. The raw_summary shows that the PR introduces three new tasks in taskfiles/deps/lock.yaml: check-uv, update-uv, and an internal uv-lock task that execute loops over uv projects. The PR objectives explicitly state the intent to "inline the uv project directories directly in the check/update loops using YAML anchors," which aligns precisely with what the title describes. The title is concise, specific, and clearly identifies that the primary change is about refining/improving the uv lock maintenance loops without using vague language or unnecessary details.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/fix-issue-#1381-in-repository

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39417f0 and b5dc6ca.

📒 Files selected for processing (1)
  • .github/workflows/clp-uv-checks.yaml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/clp-uv-checks.yaml

18-18: "paths" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (1)
.github/workflows/clp-uv-checks.yaml (1)

39-52: Workflow structure and actions are well-configured.

The remaining workflow configuration looks solid:

  • Matrix strategy properly targets macOS and Ubuntu runners (lines 31–36)
  • Checkout action includes recursive submodules (lines 39–41)
  • Task CLI and uv are installed correctly with pinned versions (lines 43–48)
  • Validation step invokes the lock check command (lines 50–52)

Once the YAML alias issue on line 18 is resolved, this workflow should execute as intended.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
taskfiles/deps/lock.yaml (2)

13-26: Add documentation comments for public tasks.

Unlike the check-rust and update-rust tasks below (with JSDoc-style parameter docs), the new check-uv and update-uv tasks lack documentation. Consider adding comments explaining the task purpose, parameters, and any important behavior.

Also applies to: 34-41


13-26: Consider explicit toolchain dependency.

The cargo tasks declare deps: ["toolchains:rust"] to ensure the Rust toolchain is available. For consistency and clarity, verify whether check-uv and update-uv should similarly declare a dependency on a UV toolchain (e.g., deps: ["toolchains:uv"]).

Also applies to: 34-41

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb7f0d5 and 79b88a8.

📒 Files selected for processing (1)
  • taskfiles/deps/lock.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)

gibber9809 and others added 8 commits October 25, 2025 19:16
…ured text logs into KV-IR; Update log-surgeon to yscope/log-surgeon@840f262. (y-scope#1460)

Co-authored-by: Bingran Hu <[email protected]>
Co-authored-by: kirkrodrigues <[email protected]>
Co-authored-by: LinZhihao-723 <[email protected]>
…er Compose project to match the CLP package Compose project (fixes y-scope#1496). (y-scope#1500)
…or S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only. (y-scope#1476)

Co-authored-by: LinZhihao-723 <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79b88a8 and 212a904.

📒 Files selected for processing (3)
  • .github/workflows/clp-uv-checks.yaml (1 hunks)
  • docs/src/dev-docs/tooling-gh-workflows.md (1 hunks)
  • taskfiles/deps/lock.yaml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • taskfiles/deps/lock.yaml
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/clp-uv-checks.yaml

18-18: "paths" section must be sequence node but got alias node with "" tag

(syntax-check)

🪛 GitHub Actions: clp-lint
.github/workflows/clp-uv-checks.yaml

[error] 1-1: wrong new line character: expected \n

🪛 GitHub Check: lint-check (ubuntu-24.04)
.github/workflows/clp-uv-checks.yaml

[warning] 39-39:
39:9 [comments] too few spaces before comment: expected 2


[failure] 1-1:
1:22 [new-lines] wrong new line character: expected \n

🪛 YAMLlint (1.37.1)
.github/workflows/clp-uv-checks.yaml

[error] 1-1: wrong new line character: expected \n

(new-lines)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (2)
docs/src/dev-docs/tooling-gh-workflows.md (1)

94-98: LGTM!

The documentation addition is clear, appropriately positioned, and accurately describes the new workflow's purpose.

.github/workflows/clp-uv-checks.yaml (1)

1-1: File line endings are already correct (LF).

Verification confirms that .github/workflows/clp-uv-checks.yaml currently uses Unix-style line endings (LF, \n) exclusively. No CRLF characters detected. The issue has been resolved.

Comment on lines 5 to 18
paths: &monitored_paths
- ".github/workflows/clp-uv-checks.yaml"
- "components/clp-mcp-server/pyproject.toml"
- "components/clp-mcp-server/uv.lock"
- "components/clp-package-utils/pyproject.toml"
- "components/clp-package-utils/uv.lock"
- "components/clp-py-utils/pyproject.toml"
- "components/clp-py-utils/uv.lock"
- "components/job-orchestration/pyproject.toml"
- "components/job-orchestration/uv.lock"
- "integration-tests/pyproject.toml"
- "integration-tests/uv.lock"
push:
paths: *monitored_paths
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

YAML alias syntax is not supported in GitHub Actions paths triggers.

Actionlint reports that using an alias for the paths key (line 18: paths: *monitored_paths) is invalid. GitHub Actions workflow YAML does not support this syntax—the paths section must be a direct sequence node, not an alias reference.

This will likely cause the workflow to fail parsing.

Replace the alias reference on line 18 with the full list of paths:

-  push:
-    paths: *monitored_paths
+  push:
+    paths:
+      - ".github/workflows/clp-uv-checks.yaml"
+      - "components/clp-mcp-server/pyproject.toml"
+      - "components/clp-mcp-server/uv.lock"
+      - "components/clp-package-utils/pyproject.toml"
+      - "components/clp-package-utils/uv.lock"
+      - "components/clp-py-utils/pyproject.toml"
+      - "components/clp-py-utils/uv.lock"
+      - "components/job-orchestration/pyproject.toml"
+      - "components/job-orchestration/uv.lock"
+      - "integration-tests/pyproject.toml"
+      - "integration-tests/uv.lock"

Alternatively, if duplicating is not preferred, define the workflow in a separate script or use a workflow generation tool, but for inline YAML, direct sequence nodes are required.

🧰 Tools
🪛 actionlint (1.7.8)

18-18: "paths" section must be sequence node but got alias node with "" tag

(syntax-check)

🤖 Prompt for AI Agents
.github/workflows/clp-uv-checks.yaml around lines 5 to 18: the workflow uses a
YAML alias (&monitored_paths) and then references it with an alias at line 18
(paths: *monitored_paths), which GitHub Actions does not support; replace the
alias reference with the full explicit sequence of paths under the push.paths
key (i.e., copy the list from the monitored_paths definition into the push.paths
section) so push.paths is a direct sequence node; if duplication is
unacceptable, move path generation out of inline YAML into a script or workflow
generator, but for this file simply expand the alias inline.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (17)
components/clp-mcp-server/clp_mcp_server/server/utils.py (1)

86-89: Docstring says timestamp required, but code treats it as optional

The function emits "N/A" when timestamp is missing/invalid. Update docs to reflect required vs optional keys and the output format.

-    - "timestamp": An integer representing the epoch timestamp in milliseconds.
-    - "message": A string representing the log message.
-    - "link": A string representing the link to open the log viewer displaying the message.
+    - "message": A string representing the log message. Required.
+    - "link": A string representing the link to open the log viewer displaying the message. Required.
+    - "timestamp": An integer in epoch milliseconds. Optional; if missing/invalid, "N/A" is used.
 
-    The message will be formatted as `timestamp: <date string>, message: <message>, link: <link>`.
+    The message will be formatted as: `timestamp: <date string>, message: <message>, link: <link>`.
components/clp-mcp-server/tests/server/test_utils.py (3)

16-17: Annotate class-level mutables with ClassVar (RUF012) or move them to module scope

Ruff flags these attributes; they’re intended as class constants/fixtures. Annotate with typing.ClassVar (and optionally LINK) or lift them to module-level constants.

+from typing import ClassVar
@@
-class TestUtils:
+class TestUtils:
@@
-    LINK = "http://localhost:4000/"
+    LINK: ClassVar[str] = "http://localhost:4000/"
@@
-    INVALID_TYPE_ENTRIES = [
+    INVALID_TYPE_ENTRIES: ClassVar = [
@@
-    EXPECTED_INVALID_TYPE = [
+    EXPECTED_INVALID_TYPE: ClassVar = [
@@
-    INVALID_VALUE_ENTRIES = [
+    INVALID_VALUE_ENTRIES: ClassVar = [
@@
-    EXPECTED_INVALID_VALUE = [
+    EXPECTED_INVALID_VALUE: ClassVar = [
@@
-    MISSING_TIMESTAMP_AND_MESSAGE_ENTRY = [
+    MISSING_TIMESTAMP_AND_MESSAGE_ENTRY: ClassVar = [
@@
-    EXPECTED_MISSING_TIMESTAMP_AND_MESSAGE = [
+    EXPECTED_MISSING_TIMESTAMP_AND_MESSAGE: ClassVar = [
@@
-    RAW_LOG_ENTRIES = [
+    RAW_LOG_ENTRIES: ClassVar = [
@@
-    EXPECTED_RESULTS = [
+    EXPECTED_RESULTS: ClassVar = [

Alternatively, define these at module scope (outside the class) to avoid ClassVar annotations entirely. Based on static analysis hints.

Also applies to: 42-46, 49-60, 61-70, 73-88, 93-143, 145-172


200-206: Remove print() inside raises blocks

Printing inside pytest.raises adds noise and isn’t needed.

-        with pytest.raises(ValueError) as exc_info:
-            print(convert_date_string_to_epoch("2024.10.18T16:00:00.123"))
+        with pytest.raises(ValueError) as exc_info:
+            convert_date_string_to_epoch("2024.10.18T16:00:00.123")
@@
-        with pytest.raises(ValueError) as exc_info:
-            print(convert_date_string_to_epoch("2024-10-18T16-00-00-123"))
+        with pytest.raises(ValueError) as exc_info:
+            convert_date_string_to_epoch("2024-10-18T16-00-00-123")

24-41: Add a negative test for missing/invalid "link" to lock in behavior

Given link is now required in formatting, add a case with no link and with non-str link to assert they’re skipped with a warning (after guarding in utils).

I can draft the test and a caplog assertion if helpful.

tools/deployment/presto-clp/docker-compose.yaml (1)

46-48: Consider validating at least one storage path is configured.

The empty dummy volume is a clever pattern to allow optional mounts, but it means the container will start successfully even if neither CLP_ARCHIVES_DIR nor CLP_STAGED_ARCHIVES_DIR is set. While init.py ensures one is set based on storage type, consider adding a runtime validation in the worker startup script to fail fast if both mounts are empty.

Would you like me to generate a validation script that checks for at least one configured storage path during worker initialization?

taskfile.yaml (1)

112-119: DRY the repeated core binary list via a YAML anchor

The paths/targets for core binaries are duplicated across sources, rsync, generates, and TARGETS. Suggest centralizing with an anchor to prevent drift.

Example approach:

  • Define an anchor with core binary paths once, e.g., under vars or at task scope:
    core_bin_paths: &core_bin_paths
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/clg"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/clo"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp-s"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/indexer"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter"
    • "{{.G_CORE_COMPONENT_BUILD_DIR}}/reducer-server"
  • Reuse with *core_bin_paths in package.sources and use join for rsync/TARGETS.

Also applies to: 157-167, 218-225, 230-230

components/core/src/clp_s/log_converter/CMakeLists.txt (1)

11-37: Add an install rule to simplify packaging

Optional: provide an install target so downstream packaging can use cmake --install instead of manual rsync.

Example:

install(TARGETS log-converter RUNTIME DESTINATION bin)
components/core/src/clp_s/log_converter/LogSerializer.cpp (1)

42-46: Map exceptions to accurate error codes (optional)

Returning no_such_file_or_directory for any open failure loses fidelity (e.g., EACCES). Consider catching FileWriter::OperationFailed and mapping to the underlying errno/category, or return a generic io_error when errno is unknown.

components/core/src/clp_s/log_converter/log_converter.cpp (1)

61-71: Use create_directories for nested output paths

create_directory fails if parent dirs are missing. create_directories handles nested creation and returns success if already present.

-    if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec)
+    if (false == std::filesystem::create_directories(command_line_arguments.get_output_dir(), ec)
         && ec)
components/core/src/clp_s/log_converter/LogConverter.cpp (2)

75-86: Make timestamp stripping robust; don’t assume it’s a strict prefix.

substr(timestamp.length()) assumes the event string starts with the timestamp and drops exactly that many bytes, which can mis-slice when delimiters or non-prefix timestamps occur. Guard with a prefix check and trim a single spacer if present.

-            auto const& event{parser.get_log_parser().get_log_event_view()};
-            auto const message{event.to_string()};
-            if (nullptr != event.get_timestamp()) {
-                auto const timestamp{event.get_timestamp()->to_string_view()};
-                auto const message_without_timestamp{
-                        std::string_view{message}.substr(timestamp.length())
-                };
-                YSTDLIB_ERROR_HANDLING_TRYV(
-                        serializer.add_message(timestamp, message_without_timestamp)
-                );
-            } else {
-                YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message));
-            }
+            auto const& event{parser.get_log_parser().get_log_event_view()};
+            auto const message{event.to_string()};
+            if (nullptr != event.get_timestamp()) {
+                auto const timestamp{event.get_timestamp()->to_string_view()};
+                std::string_view msg_view{message};
+                if (msg_view.size() >= timestamp.size()
+                    && 0 == msg_view.compare(0, timestamp.size(), timestamp)) {
+                    msg_view.remove_prefix(timestamp.size());
+                    if (!msg_view.empty() && (msg_view.front() == ' ' || msg_view.front() == '\t')) {
+                        msg_view.remove_prefix(1);
+                    }
+                }
+                YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(timestamp, msg_view));
+            } else {
+                YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message));
+            }

Confirm LogSerializer::add_message eagerly copies data into its IR buffer. If it defers, views into m_buffer (timestamp) and into message would be unsafe once compact_buffer() or the local message goes out of scope.


110-112: Return an I/O error for reader failures, not “not_enough_memory”.

Mapping reader failures to std::errc::not_enough_memory is misleading and hurts debuggability. Prefer std::errc::io_error (or propagate a more specific code).

-    if (clp::ErrorCode_Success != rc) {
-        return std::errc::not_enough_memory;
-    }
+    if (clp::ErrorCode_Success != rc) {
+        return std::errc::io_error;
+    }
components/core/src/clp_s/log_converter/LogConverter.hpp (2)

49-52: Align documented error with implementation (use I/O error).

If you adopt the io_error change in refill_buffer, update this comment accordingly.

-     * - std::errc::not_enough_memory if `clp::ReaderInterface::try_read()` returns an error.
+     * - std::errc::io_error if `clp::ReaderInterface::try_read()` returns an error.

10-12: Remove unused include to keep the header minimal.

../InputConfig.hpp isn’t referenced in this header.

-#include "../InputConfig.hpp"
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (2)

84-90: Make auth parsing case-insensitive (optional).

Users may pass S3/None. Lowercase the input before comparison for better UX.

Example:

std::string s(auth_method);
std::transform(s.begin(), s.end(), s.begin(), [](unsigned char c){ return std::tolower(c); });
if (s == cS3Auth) { ... } else if (s != cNoAuth) { throw ...; }

127-140: Optionally validate output directory early.

Fail fast if --output-dir is empty or non-existent, or create it if policy allows.

components/package-template/src/sbin/compress-from-s3.sh (1)

1-9: Improve portability and robustness (readlink -f, PYTHONPATH, strict mode)

  • readlink -f is not portable on macOS; add a Python fallback.
  • Preserve existing PYTHONPATH instead of replacing it.
  • Enable strict bash options.

Apply this diff:

 #!/usr/bin/env bash
+set -euo pipefail
+# IFS=$'\n\t'  # optionally tighten word splitting
-
-script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
 package_root="$script_dir/.."
 
-PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
-    python3 \
-    -m clp_package_utils.scripts.compress_from_s3 \
-    "$@"
+# Resolve site-packages path portably
+site_pkgs="$package_root/lib/python3/site-packages"
+if command -v readlink >/dev/null 2>&1 && readlink -f / >/dev/null 2>&1; then
+  site_pkgs_resolved="$(readlink -f "$site_pkgs")"
+else
+  site_pkgs_resolved="$(python3 - <<'PY' "$site_pkgs"
+import os, sys
+print(os.path.realpath(sys.argv[1]))
+PY
+)"
+fi
+
+PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs_resolved" \
+  /usr/bin/env python3 -m clp_package_utils.scripts.compress_from_s3 "$@"
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)

222-278: Make S3 keys ordering deterministic and use path-aware common prefix

  • Using a set loses order; job specs can vary across runs. Prefer a stable list.
  • os.path.commonprefix is char-based; use a path-aware function to avoid partial component prefixes.

Apply this diff:

-import os
+import os
+import posixpath
@@
-    keys = set()
+    keys_set: set[str] = set()
+    key_list: list[str] = []
@@
-        if key in keys:
+        if key in keys_set:
             raise ValueError(f"Duplicate S3 key found: {key}.")
-        keys.add(key)
+        keys_set.add(key)
+        key_list.append(key)
@@
-    key_list: list[str] = list(keys)
-    key_prefix = os.path.commonprefix(key_list)
+    # Use POSIX path semantics for S3 keys and derive a component-safe common prefix
+    try:
+        key_prefix = posixpath.commonpath(key_list)
+    except ValueError:
+        key_prefix = posixpath.commonprefix(key_list)
@@
-    return region_code, bucket_name, key_prefix, key_list
+    return region_code, bucket_name, key_prefix, key_list
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 212a904 and cb57c0e.

📒 Files selected for processing (25)
  • components/clp-mcp-server/clp_mcp_server/server/utils.py (2 hunks)
  • components/clp-mcp-server/tests/server/test_utils.py (6 hunks)
  • components/clp-package-utils/clp_package_utils/general.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (9 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (6 hunks)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/FileWriter.hpp (2 hunks)
  • components/core/src/clp_s/log_converter/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/log_converter/CommandLineArguments.hpp (1 hunks)
  • components/core/src/clp_s/log_converter/LogConverter.cpp (1 hunks)
  • components/core/src/clp_s/log_converter/LogConverter.hpp (1 hunks)
  • components/core/src/clp_s/log_converter/LogSerializer.cpp (1 hunks)
  • components/core/src/clp_s/log_converter/LogSerializer.hpp (1 hunks)
  • components/core/src/clp_s/log_converter/log_converter.cpp (1 hunks)
  • components/core/tests/test-ParserWithUserSchema.cpp (1 hunks)
  • components/package-template/src/sbin/compress-from-s3.sh (1 hunks)
  • docs/src/dev-docs/components-core/index.md (1 hunks)
  • integration-tests/tests/utils/config.py (1 hunks)
  • taskfile.yaml (3 hunks)
  • taskfiles/deps/main.yaml (1 hunks)
  • tools/deployment/presto-clp/.gitignore (1 hunks)
  • tools/deployment/presto-clp/docker-compose.yaml (1 hunks)
  • tools/deployment/presto-clp/scripts/init.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • tools/deployment/presto-clp/.gitignore
  • components/core/src/clp_s/CMakeLists.txt
  • docs/src/dev-docs/components-core/index.md
  • taskfiles/deps/main.yaml
🧰 Additional context used
🧬 Code graph analysis (10)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
components/core/src/clp_s/FileWriter.hpp (2)
  • writer (42-49)
  • writer (42-42)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (5)
  • nodiscard (27-27)
  • nodiscard (29-29)
  • nodiscard (31-33)
  • nodiscard (35-35)
  • argc (25-25)
components/core/src/clp_s/log_converter/LogConverter.hpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
  • nodiscard (39-39)
  • nodiscard (43-43)
  • output_dir (33-34)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_network_auth (51-51)
components/core/src/clp_s/log_converter/LogSerializer.cpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (4)
  • output_dir (33-34)
  • timestamp (61-62)
  • message (71-72)
  • flush_buffer (77-81)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • AwsAuthentication (385-415)
  • CLPConfig (605-817)
  • StorageType (119-121)
components/job-orchestration/job_orchestration/scheduler/job_config.py (2)
  • S3InputConfig (31-42)
  • FsInputConfig (23-28)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (208-240)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • StorageEngine (94-96)
  • StorageType (119-121)
  • get_clp_connection_params_and_type (200-221)
components/clp-package-utils/clp_package_utils/general.py (10)
  • dump_container_config (344-361)
  • generate_container_config (231-312)
  • generate_container_name (163-168)
  • generate_container_start_cmd (375-416)
  • get_clp_home (144-160)
  • get_container_config_filename (340-341)
  • JobType (101-105)
  • load_config_file (427-449)
  • validate_and_load_db_credentials_file (480-484)
  • validate_dataset_name (634-660)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/FileWriter.hpp (1)
  • path (127-127)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
  • output_dir (33-34)
  • message (71-72)
  • timestamp (61-62)
components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
  • create (28-49)
  • create (28-29)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (7)
  • CommandLineArguments (22-22)
  • CommandLineArguments (22-22)
  • nodiscard (27-27)
  • nodiscard (29-29)
  • nodiscard (31-33)
  • nodiscard (35-35)
  • argc (25-25)
components/core/src/clp_s/log_converter/LogConverter.hpp (3)
  • LogConverter (20-20)
  • path (35-36)
  • reader (53-54)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (119-121)
🪛 Ruff (0.14.1)
components/clp-mcp-server/tests/server/test_utils.py

42-46: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


49-60: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


61-70: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


73-88: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

162-162: Avoid specifying long messages outside the exception class

(TRY003)


166-166: Avoid specifying long messages outside the exception class

(TRY003)


186-188: Avoid specifying long messages outside the exception class

(TRY003)


200-200: Avoid specifying long messages outside the exception class

(TRY003)


242-242: Avoid specifying long messages outside the exception class

(TRY003)


254-257: Avoid specifying long messages outside the exception class

(TRY003)


262-265: Avoid specifying long messages outside the exception class

(TRY003)


268-268: Avoid specifying long messages outside the exception class

(TRY003)


275-275: Avoid specifying long messages outside the exception class

(TRY003)


291-291: Avoid specifying long messages outside the exception class

(TRY003)


367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

252-252: Do not catch blind exception: Exception

(BLE001)


253-253: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


304-304: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (14)
integration-tests/tests/utils/config.py (1)

30-38: Addition of "log-converter" to required binaries is verified and correct.

The log-converter binary is properly defined as a CMake build target and is included in the standard build pipeline (taskfile.yaml). It's built with the same configuration as other required binaries and output to the same directory. The change maintains alphabetical ordering and aligns with the actual build infrastructure.

components/core/tests/test-ParserWithUserSchema.cpp (2)

1-204: PR objectives don't match the file changes.

The PR objectives describe refining UV lock maintenance loops with YAML anchors and G_COMPONENTS_DIR, but this file is a C++ test for schema parser error messages. There appears to be a mismatch between the stated PR goals and the actual changes under review.


120-128: Manually verify that the test passes with the updated error message expectation.

The test assertion change from "AlphaNumeric" to "IdentifierCharacters" appears to be a systematic update reflecting a change in the external log_surgeon library's token naming. The schema file exists and the error location (line 3, column 4) is correct. However, since the error message is generated by the external log_surgeon::SchemaParser library and cannot be verified in the sandbox without compilation, you should:

  1. Run the test to confirm it passes: ctest --output-on-failure -R "Test error for colon missing schema file"
  2. Verify the log_surgeon library version supports the "IdentifierCharacters" token name
components/clp-mcp-server/clp_mcp_server/server/utils.py (1)

109-112: Code is asymmetric in defensive handling, but link is guaranteed by upstream contract

The review comment correctly identifies that obj["link"] uses direct indexing while timestamp and message use .get() with fallback handling. However, examining ClpConnector.read_results() shows the "link" field is explicitly constructed and added to every document before returning—it is contractually guaranteed to be present.

The asymmetry is valid from a code consistency perspective. The suggested fix would add defensive handling matching timestamp/message treatment. However, since upstream guarantees "link" presence, a KeyError would indicate a contract violation rather than a typical edge case.

Consider: either trust the data contract and document why link differs, or apply the diff for consistency and defensive robustness. Both are defensible choices depending on team preference.

tools/deployment/presto-clp/scripts/init.py (1)

127-127: LGTM: Env var renamed to reflect S3 staging semantics.

The rename from CLP_ARCHIVES_DIR to CLP_STAGED_ARCHIVES_DIR for S3 storage accurately reflects that S3 uses a staging directory, while FS storage (line 118) continues to use CLP_ARCHIVES_DIR for direct archive access.

tools/deployment/presto-clp/docker-compose.yaml (1)

37-38: The volume mounts are already correctly configured with fixed container paths; worker configuration does not and should not reference them.

The docker-compose.yaml volumes at lines 37-38 are already using fixed container paths (/var/data/archives and /var/data/staged-archives), not dynamic environment variable paths. These mounted volumes are set up correctly and serve their intended purpose through environment variable substitution at deployment time (handled by init.py).

The worker's Presto configuration does not need to explicitly reference these container paths because:

  1. For filesystem storage: The CLP connector uses the environment variable CLP_ARCHIVES_DIR configured by init.py, not direct path references in worker config.
  2. For S3 storage: The worker's clp.properties is dynamically generated with S3 credentials and endpoint configuration, not filesystem paths.

The worker's configuration templates (config.properties, node.properties, velox.properties) are correctly unchanged. This is not a breaking change to worker configuration.

Likely an incorrect or invalid review comment.

taskfile.yaml (3)

117-117: log-converter included in package artifacts—LGTM

The new binary is wired into both the sources and rsync copy sets. Matches the CMake target name.

Also applies to: 163-163


223-231: core-build target list—LGTM

The generates and TARGETS now include log-converter and align with packaging paths.


298-311: Based on the verification:

No action required—Node.js v22.20.0 is available for all common OS/ARCH permutations

The curl output confirms v22.20.0 is available for:

  • node-v22.20.0-linux-x64.tar.gz
  • node-v22.20.0-linux-arm64.tar.gz
  • node-v22.20.0-darwin-x64.tar.gz
  • node-v22.20.0-darwin-arm64.tar.gz

The taskfile uses dynamic {{OS}} and {{.NODEJS_ARCH}} variables (conditionally x64 or arm64) for the download pattern. The hard-coded version v22.20.0 is not broken and will successfully download across these standard CI environments.

components/core/src/clp_s/log_converter/CommandLineArguments.hpp (1)

1-49: API surface looks good.

No blocking issues found.

components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)

78-82: Remove the review comment - the code handles final line without newline correctly.

The try_read_to_delimiter implementation shows that when a file's last line lacks a trailing newline:

  1. The final getdelim() call reads remaining data successfully (returns byte count > 0)
  2. Since num_bytes_read >= 1, the data is appended to line and ErrorCodeSuccess is returned
  3. The loop body immediately processes and pushes the line to the destination vector
  4. Only on the next iteration does getdelim() return -1, causing ErrorCodeEndOfFile to be returned

No residual data is lost. The pattern of using try_read_to_delimiter with append=false is already used throughout the codebase (clp/Utils.cpp, clp/compression.cpp, glt/compression.cpp) without special EOF handling, confirming this approach is correct.

components/clp-package-utils/clp_package_utils/general.py (1)

52-54: S3 subcommand constants look good

Clear, kebab-cased, and used consistently across scripts. No issues.

components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

81-87: Explicitly setting --input-type fs is correct

Ensures the native compressor runs in FS mode. LGTM.


220-220: Mounting input_logs_dir is necessary for FS inputs

Good catch adding mounts.input_logs_dir to ensure container can see the input files.

Comment on lines +287 to +290
if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args):
logger.error("No S3 URLs given for compression.")
return -1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle failures when reading --inputs-from cleanly

If the inputs file is missing or unreadable, the script raises and exits with a stack trace. Catch and log a clear error.

Apply this guard:

-    if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args):
+    try:
+        ok = _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args)
+    except Exception:
+        logger.exception("Failed to prepare S3 URL list for compression.")
+        return -1
+    if not ok:
         logger.error("No S3 URLs given for compression.")
         return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args):
logger.error("No S3 URLs given for compression.")
return -1
try:
ok = _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args)
except Exception:
logger.exception("Failed to prepare S3 URL list for compression.")
return -1
if not ok:
logger.error("No S3 URLs given for compression.")
return -1
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 287 to 290, the call to _generate_url_list can raise when the
--inputs-from file is missing or unreadable; wrap the call in a try/except,
catch the exception (e.g., Exception as e), log a clear error like "Failed to
read inputs file for --inputs-from: <error>" via logger.error, and return -1 to
exit cleanly instead of letting a stack trace propagate.

Comment on lines 36 to 72
def _generate_logs_list(
input_type: InputType,
container_logs_list_path: pathlib.Path,
parsed_args: argparse.Namespace,
) -> None:
if InputType.FS == input_type:
host_logs_list_path = parsed_args.path_list
with open(container_logs_list_path, "w") as container_logs_list_file:
if host_logs_list_path is not None:
with open(host_logs_list_path, "r") as host_logs_list_file:
for line in host_logs_list_file:
stripped_path_str = line.rstrip()
if "" == stripped_path_str:
# Skip empty paths
continue
resolved_path = pathlib.Path(stripped_path_str).resolve()
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
resolved_path.anchor
)
container_logs_list_file.write(f"{mounted_path}\n")

) -> bool:
"""
Generates logs list file for the native compression script.
:param container_logs_list_path: Path to write logs list.
:param parsed_args: Parsed command-line arguments.
:return: Whether any paths were written to the logs list.
"""
host_logs_list_path = parsed_args.path_list
with open(container_logs_list_path, "w") as container_logs_list_file:
if host_logs_list_path is None:
for path in parsed_args.paths:
resolved_path = pathlib.Path(path).resolve()
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
resolved_path.anchor
)
container_logs_list_file.write(f"{mounted_path}\n")

elif InputType.S3 == input_type:
with open(container_logs_list_path, "w") as container_logs_list_file:
container_logs_list_file.write(f"{parsed_args.paths[0]}\n")

else:
raise ValueError(f"Unsupported input type: {input_type}.")
return len(parsed_args.paths) != 0

no_path_found = True
with open(host_logs_list_path, "r") as host_logs_list_file:
for line in host_logs_list_file:
stripped_path_str = line.rstrip()
if "" == stripped_path_str:
# Skip empty paths
continue
no_path_found = False
resolved_path = pathlib.Path(stripped_path_str).resolve()
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
resolved_path.anchor
)
container_logs_list_file.write(f"{mounted_path}\n")
return not no_path_found

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate inputs are under logs_input.directory and handle missing --path-list gracefully

  • If any provided path is outside the configured logs_input.directory, the container won't see it (mount only covers that root). Validate early to avoid hard-to-debug failures.
  • Opening a non-existent --path-list file currently raises and aborts without a friendly message.

Apply this refactor (adds input_root validation and robust file handling):

-def _generate_logs_list(
-    container_logs_list_path: pathlib.Path,
-    parsed_args: argparse.Namespace,
-) -> bool:
+def _generate_logs_list(
+    container_logs_list_path: pathlib.Path,
+    parsed_args: argparse.Namespace,
+    input_root: pathlib.Path,
+) -> bool:
@@
-    host_logs_list_path = parsed_args.path_list
-    with open(container_logs_list_path, "w") as container_logs_list_file:
+    host_logs_list_path = parsed_args.path_list
+    input_root = input_root.resolve()
+    with open(container_logs_list_path, "w") as container_logs_list_file:
         if host_logs_list_path is None:
             for path in parsed_args.paths:
                 resolved_path = pathlib.Path(path).resolve()
+                try:
+                    resolved_path.relative_to(input_root)
+                except ValueError:
+                    raise ValueError(f"Path '{resolved_path}' is outside logs_input.directory '{input_root}'.")
                 mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                     resolved_path.anchor
                 )
                 container_logs_list_file.write(f"{mounted_path}\n")
             return len(parsed_args.paths) != 0
 
-        no_path_found = True
-        with open(host_logs_list_path, "r") as host_logs_list_file:
+        no_path_found = True
+        try:
+            host_file = pathlib.Path(host_logs_list_path).resolve()
+            with open(host_file, "r") as host_logs_list_file:
                 for line in host_logs_list_file:
                     stripped_path_str = line.rstrip()
                     if "" == stripped_path_str:
                         # Skip empty paths
                         continue
                     no_path_found = False
-                resolved_path = pathlib.Path(stripped_path_str).resolve()
+                    resolved_path = pathlib.Path(stripped_path_str).resolve()
+                    try:
+                        resolved_path.relative_to(input_root)
+                    except ValueError:
+                        raise ValueError(f"Path '{resolved_path}' is outside logs_input.directory '{input_root}'.")
                     mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                         resolved_path.anchor
                     )
                     container_logs_list_file.write(f"{mounted_path}\n")
+        except FileNotFoundError as e:
+            raise ValueError(f"--path-list file not found: {host_logs_list_path}") from e
         return not no_path_found

And update the call site:

-    if not _generate_logs_list(container_logs_list_path, parsed_args):
+    if not _generate_logs_list(
+        container_logs_list_path,
+        parsed_args,
+        clp_config.logs_input.directory,
+    ):
         logger.error("No filesystem paths given for compression.")
         return -1

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +233 to 236
if not _generate_logs_list(container_logs_list_path, parsed_args):
logger.error("No filesystem paths given for compression.")
return -1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrap logs list generation with a friendly error report

Currently, exceptions from reading --path-list surface as stack traces. Catch and report as user errors.

Apply this small guard if you prefer not to change the function signature:

-    if not _generate_logs_list(container_logs_list_path, parsed_args):
+    try:
+        ok = _generate_logs_list(container_logs_list_path, parsed_args)
+    except Exception:
+        logger.exception("Failed to prepare filesystem paths for compression.")
+        return -1
+    if not ok:
         logger.error("No filesystem paths given for compression.")
         return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not _generate_logs_list(container_logs_list_path, parsed_args):
logger.error("No filesystem paths given for compression.")
return -1
try:
ok = _generate_logs_list(container_logs_list_path, parsed_args)
except Exception:
logger.exception("Failed to prepare filesystem paths for compression.")
return -1
if not ok:
logger.error("No filesystem paths given for compression.")
return -1
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress.py around
lines 233 to 236, calling _generate_logs_list can raise exceptions that
currently surface as stack traces; wrap the call in a try/except that catches
Exception, logs a friendly user-facing error message including the exception
text (e.g. "Failed to read --path-list: <error>") and returns -1, and avoid
re-raising the exception so users see a concise error (optionally log full
traceback at debug level).

Comment on lines +33 to +49
// Delete copy constructor and assignment operator
FileWriter(FileWriter const&) = delete;
auto operator=(FileWriter const&) -> FileWriter& = delete;

// Define custom move constructor and assignment operator
FileWriter(FileWriter&& writer)
: m_file{std::exchange(writer.m_file, nullptr)},
m_fd{std::exchange(writer.m_fd, -1)} {}

auto operator=(FileWriter&& writer) -> FileWriter& {
if (this == &writer) {
return *this;
}
m_file = std::exchange(writer.m_file, nullptr);
m_fd = std::exchange(writer.m_fd, -1);
return *this;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move assignment leaks existing handles; make moves noexcept and swap-based

If the target already owns an open file, the current move assignment overwrites m_file/m_fd without closing, leaking the resource. Also, marking moves noexcept helps container optimizations.

Apply this diff:

@@
-    // Delete copy constructor and assignment operator
+    // Delete copy constructor and assignment operator
     FileWriter(FileWriter const&) = delete;
     auto operator=(FileWriter const&) -> FileWriter& = delete;
 
-    // Define custom move constructor and assignment operator
-    FileWriter(FileWriter&& writer)
-            : m_file{std::exchange(writer.m_file, nullptr)},
-              m_fd{std::exchange(writer.m_fd, -1)} {}
+    // Define custom move constructor and assignment operator
+    FileWriter(FileWriter&& writer) noexcept
+            : m_file{std::exchange(writer.m_file, nullptr)},
+              m_fd{std::exchange(writer.m_fd, -1)} {}
 
-    auto operator=(FileWriter&& writer) -> FileWriter& {
-        if (this == &writer) {
-            return *this;
-        }
-        m_file = std::exchange(writer.m_file, nullptr);
-        m_fd = std::exchange(writer.m_fd, -1);
-        return *this;
-    }
+    friend void swap(FileWriter& a, FileWriter& b) noexcept {
+        using std::swap;
+        swap(a.m_file, b.m_file);
+        swap(a.m_fd, b.m_fd);
+    }
+
+    auto operator=(FileWriter&& writer) noexcept -> FileWriter& {
+        if (this != &writer) {
+            swap(*this, writer); // old resources end up in writer and are cleaned up by its dtor
+        }
+        return *this;
+    }
🤖 Prompt for AI Agents
In components/core/src/clp_s/FileWriter.hpp around lines 33 to 49, the move
assignment currently overwrites m_file/m_fd without releasing existing resources
(leak) and neither move constructor nor move assignment is marked noexcept;
replace the assignment logic with a swap-based move assignment (or explicitly
close existing handles before overwriting) so the target releases its current
file descriptor/file safely, implement the move constructor/assignment to leave
the source in a valid empty state, and mark both the move constructor and move
assignment noexcept to enable container optimizations.

Comment on lines +11 to +37
if(CLP_BUILD_EXECUTABLES)
add_executable(
log-converter
log_converter.cpp
${CLP_S_LOG_CONVERTER_SOURCES}
)
target_compile_features(log-converter PRIVATE cxx_std_20)
target_link_libraries(
log-converter
PRIVATE
Boost::program_options
clp_s::clp_dependencies
clp_s::io
fmt::fmt
log_surgeon::log_surgeon
msgpack-cxx
nlohmann_json::nlohmann_json
spdlog::spdlog
ystdlib::containers
ystdlib::error_handling
)
set_target_properties(
log-converter
PROPERTIES
RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}"
)
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Packaging assumes this binary exists, but build is gated—risk of breakage

Taskfile unconditionally packages {{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter. If CLP_BUILD_EXECUTABLES is OFF, packaging will fail.

Options:

  • Ensure CLP_BUILD_EXECUTABLES is enabled in CI/package builds.
  • Or gate the packaging entries on the same option.
  • Or always build this target irrespective of CLP_BUILD_EXECUTABLES.
🤖 Prompt for AI Agents
components/core/src/clp_s/log_converter/CMakeLists.txt lines 11-37: Packaging
unconditionally expects the log-converter binary but this target is only created
when CLP_BUILD_EXECUTABLES is ON; to fix, make packaging robust by aligning
behavior with the same option—either enable CLP_BUILD_EXECUTABLES in CI/package
builds or (preferably) gate the packaging/taskfile entry that references
{{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter with the same
CLP_BUILD_EXECUTABLES check (or check for the target/file existence before
packaging) so packaging will only reference the binary when it is actually
built.

Comment on lines +67 to +76
std::string line;
while (true) {
error_code = reader.try_read_to_delimiter('\n', false, false, line);
if (ErrorCodeSuccess != error_code) {
break;
}
if (false == line.empty()) {
path_destination.push_back(line);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Trim CRLF and trailing whitespace when reading paths.

Without trimming, Windows CRLF leaves a trailing '\r' causing path resolution failures.

-    while (true) {
+    while (true) {
         error_code = reader.try_read_to_delimiter('\n', false, false, line);
         if (ErrorCodeSuccess != error_code) {
             break;
         }
-        if (false == line.empty()) {
-            path_destination.push_back(line);
-        }
+        // Trim trailing CR and spaces/tabs
+        while (!line.empty()
+               && (line.back() == '\r' || line.back() == ' ' || line.back() == '\t')) {
+            line.pop_back();
+        }
+        if (!line.empty()) {
+            path_destination.push_back(line);
+        }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::string line;
while (true) {
error_code = reader.try_read_to_delimiter('\n', false, false, line);
if (ErrorCodeSuccess != error_code) {
break;
}
if (false == line.empty()) {
path_destination.push_back(line);
}
}
std::string line;
while (true) {
error_code = reader.try_read_to_delimiter('\n', false, false, line);
if (ErrorCodeSuccess != error_code) {
break;
}
// Trim trailing CR and spaces/tabs
while (!line.empty()
&& (line.back() == '\r' || line.back() == ' ' || line.back() == '\t')) {
line.pop_back();
}
if (!line.empty()) {
path_destination.push_back(line);
}
}
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/CommandLineArguments.cpp around lines
67 to 76, the loop reads lines but does not trim trailing CR/LF or other
whitespace, leaving a trailing '\r' on Windows that breaks path resolution;
after successfully reading a line, trim trailing whitespace (including '\r',
'\n', spaces, tabs) from the end of the string (e.g., use find_last_not_of and
erase or an equivalent rtrim) then only push_back non-empty trimmed paths into
path_destination.

Comment on lines +53 to +90
auto serializer{YSTDLIB_ERROR_HANDLING_TRYX(LogSerializer::create(output_dir, path.path))};

bool reached_end_of_stream{false};
while (false == reached_end_of_stream) {
auto const num_bytes_read{YSTDLIB_ERROR_HANDLING_TRYX(refill_buffer(reader))};
reached_end_of_stream = 0ULL == num_bytes_read;

while (m_parser_offset < m_num_bytes_buffered) {
auto const err{parser.parse_next_event(
m_buffer.data(),
m_num_bytes_buffered,
m_parser_offset,
reached_end_of_stream
)};
if (log_surgeon::ErrorCode::BufferOutOfBounds == err) {
break;
}
if (log_surgeon::ErrorCode::Success != err) {
return std::errc::no_message;
}

auto const& event{parser.get_log_parser().get_log_event_view()};
auto const message{event.to_string()};
if (nullptr != event.get_timestamp()) {
auto const timestamp{event.get_timestamp()->to_string_view()};
auto const message_without_timestamp{
std::string_view{message}.substr(timestamp.length())
};
YSTDLIB_ERROR_HANDLING_TRYV(
serializer.add_message(timestamp, message_without_timestamp)
);
} else {
YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message));
}
}
}
serializer.close();
return ystdlib::error_handling::success();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure serializer is closed on all paths (RAII or scope guard).

Errors returned via TRYX and on non-success parse exit before serializer.close(), yielding incomplete IR (missing EOF) and a possibly unclosed file. Make LogSerializer close on destruction and idempotent, or add a scope guard here.

Apply the LogSerializer.hpp diff in its file comment to make close-on-destroy and idempotent; then keeping the explicit serializer.close(); here is harmless.

🤖 Prompt for AI Agents
components/core/src/clp_s/log_converter/LogConverter.cpp lines 53-90:
serializer.close() is only called on the happy path so early returns from TRYX
or parse failures can leave the serializer/file unclosed and EOF missing; make
LogSerializer's destructor call close() and ensure close() is idempotent (safe
to call multiple times) by applying the LogSerializer.hpp changes noted in the
review (close-on-destroy and guard against double-close), then retain the
explicit serializer.close() here (it becomes harmless) so all paths guarantee
closure.

Comment on lines +3 to +18
#include <array>
#include <cstdint>
#include <exception>
#include <optional>
#include <string>
#include <string_view>
#include <system_error>
#include <utility>

#include <boost/uuid/random_generator.hpp>
#include <boost/uuid/uuid_io.hpp>
#include <msgpack.hpp>
#include <nlohmann/json.hpp>
#include <nlohmann/json_fwd.hpp>
#include <ystdlib/error_handling/Result.hpp>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing header for std::filesystem

This file uses std::filesystem::path but doesn't include . Add it to avoid transitive-include brittleness.

 #include <array>
 #include <cstdint>
 #include <exception>
+#include <filesystem>
 #include <optional>
 #include <string>
 #include <string_view>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <array>
#include <cstdint>
#include <exception>
#include <optional>
#include <string>
#include <string_view>
#include <system_error>
#include <utility>
#include <boost/uuid/random_generator.hpp>
#include <boost/uuid/uuid_io.hpp>
#include <msgpack.hpp>
#include <nlohmann/json.hpp>
#include <nlohmann/json_fwd.hpp>
#include <ystdlib/error_handling/Result.hpp>
#include <array>
#include <cstdint>
#include <exception>
#include <filesystem>
#include <optional>
#include <string>
#include <string_view>
#include <system_error>
#include <utility>
#include <boost/uuid/random_generator.hpp>
#include <boost/uuid/uuid_io.hpp>
#include <msgpack.hpp>
#include <nlohmann/json.hpp>
#include <nlohmann/json_fwd.hpp>
#include <ystdlib/error_handling/Result.hpp>
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.cpp around lines 3 to
18, the file uses std::filesystem::path but does not include <filesystem>; add
the header #include <filesystem> (preferably grouped with the other standard
headers at the top of the file) to ensure std::filesystem symbols are available
and avoid relying on transitive includes.

Comment on lines +41 to +46
clp_s::FileWriter writer;
try {
writer.open(converted_path, clp_s::FileWriter::OpenMode::CreateForWriting);
} catch (std::exception const&) {
return std::errc::no_such_file_or_directory;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Type mismatch: pass string to FileWriter::open

open expects std::string; converted_path is std::filesystem::path. Use string() and preserve error info.

-        writer.open(converted_path, clp_s::FileWriter::OpenMode::CreateForWriting);
+        writer.open(converted_path.string(), clp_s::FileWriter::OpenMode::CreateForWriting);
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.cpp around lines 41 to
46, FileWriter::open expects a std::string but converted_path is a
std::filesystem::path; call open with converted_path.string() instead of
converted_path, and change the catch to catch (std::exception const& e) so you
can preserve and surface the error details (e.g. log or include e.what() before
returning std::errc::no_such_file_or_directory).

Comment on lines +75 to +83
* Closes and flushes the serialized output.
*/
void close() {
flush_buffer();
m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof);
m_writer.close();
}

private:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make close() idempotent and close on destruction to avoid incomplete outputs.

If callers return early, EOF isn’t written and the writer may remain open. Make close() idempotent and call it from the destructor.

-    // Destructor
-    ~LogSerializer() = default;
+    // Destructor
+    ~LogSerializer() noexcept {
+        if (!m_closed) {
+            try {
+                close();
+            } catch (...) {
+                // Swallow in destructor
+            }
+        }
+    }
@@
-    void close() {
-        flush_buffer();
-        m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof);
-        m_writer.close();
-    }
+    void close() {
+        if (m_closed) {
+            return;
+        }
+        flush_buffer();
+        m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof);
+        m_writer.close();
+        m_closed = true;
+    }
@@
-    clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer;
-    clp_s::FileWriter m_writer;
+    clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer;
+    clp_s::FileWriter m_writer;
+    bool m_closed{false};

Also applies to: 111-113

🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.hpp around lines 75-83
(and similarly for 111-113), close() currently unconditionally writes EOF and
closes the writer which can be skipped on early returns and can be unsafe if
called multiple times; make close() idempotent by tracking a closed flag (e.g.,
m_closed) and return immediately if already closed, only write EOF/flush/close
once, and call close() from the destructor to ensure EOF is written and the
writer is closed when the object is destroyed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)

724-730: Protect secrets in .env and write atomically/minimally.

.env includes credentials (e.g., AWS secret). It’s written with default perms (often world‑readable). Use 0600 and avoid partial writes.

Apply:

-        # Write the environment variables to the `.env` file.
-        with open(f"{self._clp_home}/.env", "w") as env_file:
-            for key, value in env_vars.items():
-                if value is None:
-                    continue
-                env_file.write(f"{key}={value}\n")
+        # Write the environment variables to the `.env` file with restrictive perms.
+        env_path = self._clp_home / ".env"
+        fd = os.open(str(env_path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
+        with os.fdopen(fd, "w") as env_file:
+            for key, value in env_vars.items():
+                if value is None:
+                    continue
+                env_file.write(f"{key}={value}\n")
♻️ Duplicate comments (1)
.github/workflows/clp-uv-checks.yaml (1)

5-18: YAML alias syntax is not supported in GitHub Actions paths triggers (reiterating prior feedback).

Line 18 uses a YAML alias reference (paths: *monitored_paths), which GitHub Actions workflow syntax does not support. The paths section must be a direct sequence node, not an alias. This will cause the workflow to fail at parse time.

Replace the alias on line 18 with the explicit list of paths:

   push:
-    paths: *monitored_paths
+    paths:
+      - ".github/workflows/clp-uv-checks.yaml"
+      - "components/clp-mcp-server/pyproject.toml"
+      - "components/clp-mcp-server/uv.lock"
+      - "components/clp-package-utils/pyproject.toml"
+      - "components/clp-package-utils/uv.lock"
+      - "components/clp-py-utils/pyproject.toml"
+      - "components/clp-py-utils/uv.lock"
+      - "components/job-orchestration/pyproject.toml"
+      - "components/job-orchestration/uv.lock"
+      - "integration-tests/pyproject.toml"
+      - "integration-tests/uv.lock"

You may also consider removing the anchor definition at line 5 once the push section is expanded, unless it serves a purpose elsewhere in the workflow.

🧹 Nitpick comments (3)
components/clp-package-utils/clp_package_utils/controller.py (1)

751-756: Address Ruff S603/S607 subprocess warnings or suppress intentionally.

Calls are argument‑list (no shell) and preceded by dependency checks, so they’re safe. To quiet lint and document intent:

-        subprocess.run(
+        subprocess.run(  # noqa: S603
             cmd,
             cwd=self._clp_home,
             check=True,
         )
...
-        subprocess.run(
-            ["docker", "compose", "--project-name", self._project_name, "down"],
+        subprocess.run(  # noqa: S603,S607
+            ["docker", "compose", "--project-name", self._project_name, "down"],
             cwd=self._clp_home,
             check=True,
         )

Alternatively, keep as‑is and ignore the lints.

Also applies to: 783-787

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

69-71: Avoid bare except:; don’t swallow KeyboardInterrupt/SystemExit.

Use except Exception to keep termination signals intact.

-    except:
+    except Exception:
         logger.exception("Failed to load config.")
         return -1
...
-    except:
+    except Exception:
         logger.exception("Failed to create necessary directories.")
         return -1

Also applies to: 80-82


20-20: Minor style nit: prefer logging.getLogger(__name__).

Current logger name is a file path. Using module name improves consistency.

-logger = logging.getLogger(__file__)
+logger = logging.getLogger(__name__)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb57c0e and 39417f0.

📒 Files selected for processing (12)
  • .github/workflows/clp-uv-checks.yaml (1 hunks)
  • components/clp-package-utils/clp_package_utils/controller.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (3 hunks)
  • docs/src/dev-docs/tooling-gh-workflows.md (1 hunks)
  • docs/src/user-docs/guides-using-presto.md (0 hunks)
  • docs/src/user-docs/quick-start/clp-json.md (1 hunks)
  • docs/src/user-docs/quick-start/clp-text.md (1 hunks)
  • taskfiles/deps/main.yaml (2 hunks)
  • tools/deployment/presto-clp/docker-compose.yaml (2 hunks)
  • tools/deployment/presto-clp/scripts/init.py (5 hunks)
  • tools/docker-images/clp-package/Dockerfile (2 hunks)
💤 Files with no reviewable changes (1)
  • docs/src/user-docs/guides-using-presto.md
✅ Files skipped from review due to trivial changes (1)
  • docs/src/user-docs/quick-start/clp-text.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/src/dev-docs/tooling-gh-workflows.md
🧰 Additional context used
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/controller.py (2)
components/clp-package-utils/clp_package_utils/general.py (3)
  • check_docker_dependencies (171-194)
  • DockerComposeProjectNotRunningError (81-85)
  • DockerDependencyError (56-57)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_deployment_type (777-781)
  • DeploymentType (90-92)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/controller.py (2)
  • set_up_env (90-94)
  • set_up_env (645-729)
🪛 actionlint (1.7.8)
.github/workflows/clp-uv-checks.yaml

18-18: "paths" section must be sequence node but got alias node with "" tag

(syntax-check)

🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/controller.py

751-751: subprocess call: check for execution of untrusted input

(S603)


783-783: subprocess call: check for execution of untrusted input

(S603)


784-784: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (22)
taskfiles/deps/main.yaml (3)

374-375: Verify log-surgeon tarball hash and confirm dependency update rationale.

The tarball URL has been updated to a newer commit (840f262), with an updated SHA256. Ensure:

  • The new tarball SHA256 is correct and verified from the source repository.
  • The commit change is intentional and includes desired fixes or improvements.
  • All downstream consumers are compatible with this version.

409-409: Consistent build type configuration.

Adding -DCMAKE_BUILD_TYPE=Release to mariadb-connector-cpp aligns with the pattern used elsewhere in the file (e.g., all other CMake-based dependencies specify Release builds). This improves reproducibility and consistency.


1-641: Mismatch between PR objectives and file scope.

The PR objectives describe refining "uv lock maintenance loops" using "YAML anchors" and "G_COMPONENTS_DIR", but this file contains only dependency installation tasks with version/build config updates. The uv-related changes may reside in the included lock.yaml file (line 4) or other files not provided for review. Ensure the PR description accurately reflects the changes in this file, or clarify the scope across multiple files.

.github/workflows/clp-uv-checks.yaml (2)

39-41: Comment spacing is correct.

This addresses the prior minor formatting concern—the checkout action comment now has proper 2-space formatting.


24-53: Workflow structure and task integration look good.

The job matrix strategy, concurrency settings, and step sequencing are well-designed. The invocation of task deps:lock:check-uv correctly delegates lock validation to the centralized task defined in taskfiles, supporting the PR's objective to manage UV project directories via task configuration.

components/clp-py-utils/clp_py_utils/clp_config.py (4)

35-35: LGTM - Presto coordinator component name constant.

The new constant follows the same naming pattern as other component names and is used appropriately in the Presto transformation method.


594-603: Verify: Missing default values for host and port fields.

The Presto class lacks default values for host and port fields, which is inconsistent with all other similar component classes (Database, QueryScheduler, Redis, Reducer, ResultsCache, Queue, WebUi, McpServer). Those classes all provide defaults: host = "localhost" and port = DEFAULT_PORT.

Without defaults, users must explicitly specify both host and port when configuring Presto. Is this intentional (to force explicit configuration for an external service), or should defaults be added for consistency?

If defaults should be added, apply this diff:

 class Presto(BaseModel):
     DEFAULT_PORT: ClassVar[int] = 8080

-    host: DomainStr
-    port: Port
+    host: DomainStr = "localhost"
+    port: Port = DEFAULT_PORT

605-609: LGTM - Environment variable validation.

The _get_env_var function correctly raises ValueError with a clear message when an environment variable is missing. This aligns with the documented behavior in the calling methods' docstrings.


825-826: LGTM - Conditional Presto transformation.

The conditional check appropriately transforms the Presto configuration when using the PRESTO query engine. The redundant self.presto is not None check is good defensive programming, even though the validator at lines 796-803 already enforces this constraint.

tools/docker-images/clp-package/Dockerfile (2)

21-22: Environment variables properly split for library path configuration.

Adding LD_LIBRARY_PATH to help the runtime locate the MariaDB C++ connector libraries is correct. The split across two ENV instructions maintains readability and follows standard Docker practices.


31-32: The glob pattern is correct for the DEB install layout.

When MariaDB Connector/C++ is built with -DINSTALL_LAYOUT=DEB, libraries are installed under lib/${CMAKE_LIBRARY_ARCHITECTURE} (e.g., lib/x86_64-linux-gnu/). The Dockerfile's glob pattern lib/*/libmariadbcpp.so* correctly matches these architecture-specific subdirectories created by the DEB layout.

The path structure is also correctly aligned: the taskfile defines INSTALL_PREFIX as build/deps/cpp/mariadb-connector-cpp-install, and the Dockerfile copies from ./build/deps/cpp/mariadb-connector-cpp-install/lib/*/libmariadbcpp.so*. The COPY instruction will successfully match and copy the libraries without silent failures.

Likely an incorrect or invalid review comment.

tools/deployment/presto-clp/scripts/init.py (4)

120-137: LGTM: Storage-specific environment variables.

The code correctly sets different environment variables (CLP_ARCHIVES_DIR for filesystem storage, CLP_STAGED_ARCHIVES_DIR for S3 storage) based on the storage type. This aligns with the docker-compose.yaml volume mount configuration.


169-173: LGTM: Instance ID-based network naming.

The network name format clp-package-{instance_id}_default correctly follows Docker Compose's default network naming convention. Error handling appropriately returns False when the instance ID cannot be determined.


285-322: LGTM: Robust helper function with comprehensive error handling.

The function has:

  • Proper type hints (Optional[str])
  • Descriptive error messages for each failure scenario
  • Appropriate exception handling (OSError covers file access issues)
  • Validation to ensure the instance ID is not empty
  • Explicit UTF-8 encoding

110-113: No issue found—code correctly matches CLP connector configuration expectations.

The CLP connector explicitly expects the metadata database URL and database name as separate configuration properties (clp.metadata-db-url and clp.metadata-db-name), not combined in a single JDBC URL. The code correctly implements this design by setting them as separate environment variables on lines 110-113, which is the expected pattern for the Presto CLP connector.

tools/deployment/presto-clp/docker-compose.yaml (4)

13-14: LGTM: Network configuration aligns with CLP package integration.

Both coordinator and worker services now use the clp-package external network, enabling communication with CLP package services. This aligns with the network name exported by init.py.

Also applies to: 42-43


46-47: LGTM: Well-documented dummy volume.

The explicit empty volume with a clear comment is a good practice for optional bind mounts. This prevents Docker from creating unnecessary anonymous volumes when environment variables aren't set.


36-41: Good pattern: Fallback to empty volume for optional mounts.

The use of ${VAR:-empty} provides a clean fallback when archive directories aren't configured. The read-only flag is appropriate for archive data. The hardcoded mount paths (/var/data/archives and /var/data/staged-archives) are consistent with the same paths used across the deployment system in docker-compose.base.yaml, confirming they match the expected configuration.


52-55: Network configuration is correct — CLP package implicitly creates the external network.

The configuration properly references an external network that is automatically created when the CLP package starts. The init.py script reads the instance-id file (created during CLP package initialization) and constructs the network name as clp-package-{instance_id}_default. Docker Compose then uses this project identifier to create the network, which Presto-CLP joins via the external: true declaration. The configuration correctly requires CLP package to be started first.

components/clp-package-utils/clp_package_utils/controller.py (1)

505-508: Good switch to container‑scoped Presto host/port.

Using container_clp_config for PrestoHost/PrestoPort aligns UI server settings with container networking.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

40-44: Nice addition: --setup-only short‑circuits after environment prep.

Flow is clear: validate → mkdirs → set_up_env → optional early exit. Matches docs.

Also applies to: 87-93

docs/src/user-docs/quick-start/clp-json.md (1)

20-23: Docs align with new --setup-only behavior.

Clear, concise tip; matches CLI.

Comment on lines +790 to +797
@staticmethod
def _get_num_workers() -> int:
"""
:return: Number of worker processes to run.
"""
# This will change when we move from single to multi-container workers. See y-scope/clp#1424
return multiprocessing.cpu_count() // 2

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Clamp worker count to at least 1.

multiprocessing.cpu_count() // 2 returns 0 on single‑core hosts, yielding zero‑concurrency workers.

-        return multiprocessing.cpu_count() // 2
+        return max(1, multiprocessing.cpu_count() // 2)

Run to sanity‑check for any consumers expecting non‑zero:


🏁 Script executed:

#!/bin/bash
# Find usages of worker concurrency env vars
rg -n --no-heading -C2 'CLP_(COMPRESSION|QUERY|REDUCER)_WORKER_CONCURRENCY'

Length of output: 1827


I'll verify the scope of this issue by checking all usages of _get_num_workers() and examining how the concurrency parameter is consumed.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find all calls to _get_num_workers()
rg -n -C3 '_get_num_workers'

Length of output: 1405



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the context around line 648 where num_workers is used
sed -n '645,385p' components/clp-package-utils/clp_package_utils/controller.py | head -50

Length of output: 33


🏁 Script executed:

#!/bin/bash
# Also check lines around 352 and 379 to see full context of env var usage
sed -n '348,385p' components/clp-package-utils/clp_package_utils/controller.py

Length of output: 1132



---

🏁 Script executed:

```shell
#!/bin/bash
# Find worker implementation files (search for job_orchestration.executor patterns)
find . -name "*.py" -type f | xargs grep -l "job_orchestration.executor" | head -20

Length of output: 795



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for how --concurrency argument is used/parsed
rg -n "concurrency" components/job-orchestration/ --type py -C2

Length of output: 1739


🏁 Script executed:

#!/bin/bash
# Look for celery worker or argparse setup related to concurrency
rg -n "argparse\|--concurrency\|add_argument.*concurrency" components/job-orchestration/ --type py

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for concurrency handling in compress worker
rg -n "concurrency" components/job-orchestration/job_orchestration/executor/compress/ --type py -C2

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for concurrency handling in query worker
rg -n "concurrency" components/job-orchestration/job_orchestration/executor/query/ --type py -C2

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Find main entry points for workers (look for __main__ or celery worker setup)
rg -n "if __name__" components/job-orchestration/job_orchestration/executor/ --type py -A5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for celery app setup and worker configuration
rg -n "app.worker_main\|Celery\|celery_app" components/job-orchestration/job_orchestration/executor/ --type py -C2

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for test files related to concurrency or worker setup
find . -path "*/test*" -name "*.py" | xargs grep -l "concurrency\|cpu_count" 2>/dev/null | head -10

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check for any comments or issues related to single-core or zero concurrency
rg -n "single.core\|concurrency.*0\|cpu_count" . --type py | grep -i "single\|core\|zero\|issue\|todo\|fixme"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check what multiprocessing.cpu_count() actually returns for verification
cat > /tmp/test_cpu.py << 'EOF'
import multiprocessing
print(f"CPU count: {multiprocessing.cpu_count()}")
print(f"CPU count // 2: {multiprocessing.cpu_count() // 2}")
print(f"Single-core case (1 // 2): {1 // 2}")
print(f"With max(): {max(1, 1 // 2)}")
EOF
python3 /tmp/test_cpu.py

Length of output: 73


Clamp worker count to at least 1.

On single-core systems, multiprocessing.cpu_count() // 2 returns 0, causing zero-concurrency workers. This fix is necessary and consistent with the pattern already established in the reducer component (which applies max(int(parsed_args.concurrency), 1) at line 68).

-        return multiprocessing.cpu_count() // 2
+        return max(1, multiprocessing.cpu_count() // 2)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def _get_num_workers() -> int:
"""
:return: Number of worker processes to run.
"""
# This will change when we move from single to multi-container workers. See y-scope/clp#1424
return multiprocessing.cpu_count() // 2
@staticmethod
def _get_num_workers() -> int:
"""
:return: Number of worker processes to run.
"""
# This will change when we move from single to multi-container workers. See y-scope/clp#1424
return max(1, multiprocessing.cpu_count() // 2)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines 790
to 797, the worker count calculation can return 0 on single-core systems; change
the return to clamp the value to at least 1 (e.g., return
max(multiprocessing.cpu_count() // 2, 1)) so we never spawn zero workers.

Comment on lines +10 to +13
# Database endpoint inside the CLP Package Docker network. Must match the constants defined in
# `components/clp-py-utils/clp_py_utils/clp_config.py`.
DATABASE_COMPONENT_NAME = "database"
DATABASE_DEFAULT_PORT = 3306
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify constant consistency with clp_py_utils/clp_config.py.

The comment states these constants must match those defined in components/clp-py-utils/clp_py_utils/clp_config.py. Ensure both files use the same values.


🏁 Script executed:

#!/bin/bash
# Verify that DATABASE_COMPONENT_NAME and DATABASE_DEFAULT_PORT match the constants in clp_py_utils/clp_config.py

if [[ -f "components/clp-py-utils/clp_py_utils/clp_config.py" ]]; then
    echo "=== Searching for DATABASE_COMPONENT_NAME and DATABASE_DEFAULT_PORT in clp_config.py ==="
    rg -n -C3 'DATABASE_COMPONENT_NAME|DATABASE_DEFAULT_PORT' components/clp-py-utils/clp_py_utils/clp_config.py
else
    echo "File components/clp-py-utils/clp_py_utils/clp_config.py not found"
fi

Length of output: 88


🏁 Script executed:

#!/bin/bash
# Step 1: Check if clp_config.py exists and view its contents
echo "=== Checking if clp_config.py exists ==="
if [[ -f "components/clp-py-utils/clp_py_utils/clp_config.py" ]]; then
    echo "File exists. Viewing contents..."
    cat components/clp-py-utils/clp_py_utils/clp_config.py
else
    echo "File not found at expected path. Searching for clp_config.py..."
    find . -name "clp_config.py" -type f 2>/dev/null
fi

# Step 2: View init.py to confirm current constants
echo -e "\n=== Viewing init.py ==="
if [[ -f "tools/deployment/presto-clp/scripts/init.py" ]]; then
    head -20 tools/deployment/presto-clp/scripts/init.py
fi

# Step 3: Search broadly for DATABASE constants in the codebase
echo -e "\n=== Searching for DATABASE constants in codebase ==="
rg -n 'DATABASE.*=' --type py | grep -E '(COMPONENT_NAME|DEFAULT_PORT|database)' | head -20

Length of output: 31740


Rename constants to match clp_config.py.

Constants defined in init.py don't match their equivalents in clp_config.py. Rename:

  • DATABASE_COMPONENT_NAMEDB_COMPONENT_NAME (matches clp_config.py line 25)
  • DATABASE_DEFAULT_PORTDB_DEFAULT_PORT or reference Database.DEFAULT_PORT from the config module (matches clp_config.py line 104)

The values are correct (3306 and "database"), but the constant names differ from those defined in the referenced file.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 10 to 13, the
constant names do not match the equivalents in
components/clp-py-utils/clp_py_utils/clp_config.py; rename
DATABASE_COMPONENT_NAME to DB_COMPONENT_NAME and DATABASE_DEFAULT_PORT to
DB_DEFAULT_PORT (or alternatively import and reference Database.DEFAULT_PORT
from the clp_config module) so the names align with clp_config.py while keeping
the same values ("database" and 3306).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants