-
Notifications
You must be signed in to change notification settings - Fork 0
Refine uv lock maintenance loops #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds UV lock Taskfile tasks and a GitHub Actions workflow; provides FS and S3 compression CLIs and a shell launcher; introduces a new C++ log-converter executable (parsing, buffered conversion, KV-IR serialization) with build integration; applies packaging, deployment, config, controller lifecycle, and test updates across multiple components. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant GH as GitHub Actions
participant Runner as CI runner
participant Repo as Repository
participant TaskCLI as task CLI
participant Task as Taskfile `check-uv`
participant UV as internal `uv-lock`
Note over GH: Trigger (PR / push / cron / dispatch)
GH->>Runner: start job
Runner->>Repo: checkout (with submodules)
Runner->>Runner: install task CLI & uv tool
Runner->>TaskCLI: run `task deps:lock:check-uv`
TaskCLI->>Task: invoke `check-uv`
loop per project
Task->>Repo: set `PROJECT_DIR`
Task->>UV: run `uv lock --check` in `PROJECT_DIR`
UV-->>Task: return status
end
Task-->>TaskCLI: aggregate results
TaskCLI-->>Runner: exit code
Runner-->>GH: report status
sequenceDiagram
autonumber
participant User as Operator
participant CLI as compress_from_s3
participant Config as CLP config / DB
participant S3 as S3 service
participant FS as Host filesystem (temp files)
participant Container as Compression container
User->>CLI: run with S3 subcommand + args
CLI->>Config: load CLP config & validate storage type
CLI->>S3: parse & validate URLs / prefixes (bucket/region)
CLI->>FS: write URL list file
CLI->>Container: start container with mounts and config
Container->>Container: run native compressor `--input-type s3` with URL list
Container-->>CLI: exit code
CLI->>FS: cleanup URL list & temp config on success
CLI-->>User: return status
sequenceDiagram
autonumber
participant User as Operator
participant LC as log-converter (main)
participant CLA as CommandLineArguments
participant Conv as LogConverter
participant Serial as LogSerializer
participant FS as filesystem/Reader
User->>LC: invoke with args
LC->>CLA: parse arguments (files, auth, output-dir)
CLA-->>LC: parsed input paths
LC->>FS: open each input (file or network reader)
loop per input
LC->>Conv: convert_file(path, reader, output-dir)
Conv->>Serial: create serializer for output file
Conv->>Conv: refill/parse buffer, produce messages
Conv->>Serial: add_message(...), flush as needed
Serial-->>Conv: success/failure
end
LC-->>User: exit code
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🪛 actionlint (1.7.8).github/workflows/clp-uv-checks.yaml18-18: "paths" section must be sequence node but got alias node with "" tag (syntax-check) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
taskfiles/deps/lock.yaml (2)
13-26: Add documentation comments for public tasks.Unlike the
check-rustandupdate-rusttasks below (with JSDoc-style parameter docs), the newcheck-uvandupdate-uvtasks lack documentation. Consider adding comments explaining the task purpose, parameters, and any important behavior.Also applies to: 34-41
13-26: Consider explicit toolchain dependency.The cargo tasks declare
deps: ["toolchains:rust"]to ensure the Rust toolchain is available. For consistency and clarity, verify whethercheck-uvandupdate-uvshould similarly declare a dependency on a UV toolchain (e.g.,deps: ["toolchains:uv"]).Also applies to: 34-41
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
taskfiles/deps/lock.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
…ured text logs into KV-IR; Update log-surgeon to yscope/log-surgeon@840f262. (y-scope#1460) Co-authored-by: Bingran Hu <[email protected]> Co-authored-by: kirkrodrigues <[email protected]> Co-authored-by: LinZhihao-723 <[email protected]>
… returned from KQL queries. (y-scope#1491) Co-authored-by: Lin Zhihao <[email protected]>
…rom VCS tracking (fixes y-scope#1422). (y-scope#1499) Co-authored-by: kirkrodrigues <[email protected]>
…er Compose project to match the CLP package Compose project (fixes y-scope#1496). (y-scope#1500)
…or S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only. (y-scope#1476) Co-authored-by: LinZhihao-723 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/clp-uv-checks.yaml(1 hunks)docs/src/dev-docs/tooling-gh-workflows.md(1 hunks)taskfiles/deps/lock.yaml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- taskfiles/deps/lock.yaml
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/clp-uv-checks.yaml
18-18: "paths" section must be sequence node but got alias node with "" tag
(syntax-check)
🪛 GitHub Actions: clp-lint
.github/workflows/clp-uv-checks.yaml
[error] 1-1: wrong new line character: expected \n
🪛 GitHub Check: lint-check (ubuntu-24.04)
.github/workflows/clp-uv-checks.yaml
[warning] 39-39:
39:9 [comments] too few spaces before comment: expected 2
[failure] 1-1:
1:22 [new-lines] wrong new line character: expected \n
🪛 YAMLlint (1.37.1)
.github/workflows/clp-uv-checks.yaml
[error] 1-1: wrong new line character: expected \n
(new-lines)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (2)
docs/src/dev-docs/tooling-gh-workflows.md (1)
94-98: LGTM!The documentation addition is clear, appropriately positioned, and accurately describes the new workflow's purpose.
.github/workflows/clp-uv-checks.yaml (1)
1-1: File line endings are already correct (LF).Verification confirms that
.github/workflows/clp-uv-checks.yamlcurrently uses Unix-style line endings (LF,\n) exclusively. No CRLF characters detected. The issue has been resolved.
.github/workflows/clp-uv-checks.yaml
Outdated
| paths: &monitored_paths | ||
| - ".github/workflows/clp-uv-checks.yaml" | ||
| - "components/clp-mcp-server/pyproject.toml" | ||
| - "components/clp-mcp-server/uv.lock" | ||
| - "components/clp-package-utils/pyproject.toml" | ||
| - "components/clp-package-utils/uv.lock" | ||
| - "components/clp-py-utils/pyproject.toml" | ||
| - "components/clp-py-utils/uv.lock" | ||
| - "components/job-orchestration/pyproject.toml" | ||
| - "components/job-orchestration/uv.lock" | ||
| - "integration-tests/pyproject.toml" | ||
| - "integration-tests/uv.lock" | ||
| push: | ||
| paths: *monitored_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YAML alias syntax is not supported in GitHub Actions paths triggers.
Actionlint reports that using an alias for the paths key (line 18: paths: *monitored_paths) is invalid. GitHub Actions workflow YAML does not support this syntax—the paths section must be a direct sequence node, not an alias reference.
This will likely cause the workflow to fail parsing.
Replace the alias reference on line 18 with the full list of paths:
- push:
- paths: *monitored_paths
+ push:
+ paths:
+ - ".github/workflows/clp-uv-checks.yaml"
+ - "components/clp-mcp-server/pyproject.toml"
+ - "components/clp-mcp-server/uv.lock"
+ - "components/clp-package-utils/pyproject.toml"
+ - "components/clp-package-utils/uv.lock"
+ - "components/clp-py-utils/pyproject.toml"
+ - "components/clp-py-utils/uv.lock"
+ - "components/job-orchestration/pyproject.toml"
+ - "components/job-orchestration/uv.lock"
+ - "integration-tests/pyproject.toml"
+ - "integration-tests/uv.lock"Alternatively, if duplicating is not preferred, define the workflow in a separate script or use a workflow generation tool, but for inline YAML, direct sequence nodes are required.
🧰 Tools
🪛 actionlint (1.7.8)
18-18: "paths" section must be sequence node but got alias node with "" tag
(syntax-check)
🤖 Prompt for AI Agents
.github/workflows/clp-uv-checks.yaml around lines 5 to 18: the workflow uses a
YAML alias (&monitored_paths) and then references it with an alias at line 18
(paths: *monitored_paths), which GitHub Actions does not support; replace the
alias reference with the full explicit sequence of paths under the push.paths
key (i.e., copy the list from the monitored_paths definition into the push.paths
section) so push.paths is a direct sequence node; if duplication is
unacceptable, move path generation out of inline YAML into a script or workflow
generator, but for this file simply expand the alias inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Nitpick comments (17)
components/clp-mcp-server/clp_mcp_server/server/utils.py (1)
86-89: Docstring says timestamp required, but code treats it as optionalThe function emits "N/A" when timestamp is missing/invalid. Update docs to reflect required vs optional keys and the output format.
- - "timestamp": An integer representing the epoch timestamp in milliseconds. - - "message": A string representing the log message. - - "link": A string representing the link to open the log viewer displaying the message. + - "message": A string representing the log message. Required. + - "link": A string representing the link to open the log viewer displaying the message. Required. + - "timestamp": An integer in epoch milliseconds. Optional; if missing/invalid, "N/A" is used. - The message will be formatted as `timestamp: <date string>, message: <message>, link: <link>`. + The message will be formatted as: `timestamp: <date string>, message: <message>, link: <link>`.components/clp-mcp-server/tests/server/test_utils.py (3)
16-17: Annotate class-level mutables with ClassVar (RUF012) or move them to module scopeRuff flags these attributes; they’re intended as class constants/fixtures. Annotate with typing.ClassVar (and optionally LINK) or lift them to module-level constants.
+from typing import ClassVar @@ -class TestUtils: +class TestUtils: @@ - LINK = "http://localhost:4000/" + LINK: ClassVar[str] = "http://localhost:4000/" @@ - INVALID_TYPE_ENTRIES = [ + INVALID_TYPE_ENTRIES: ClassVar = [ @@ - EXPECTED_INVALID_TYPE = [ + EXPECTED_INVALID_TYPE: ClassVar = [ @@ - INVALID_VALUE_ENTRIES = [ + INVALID_VALUE_ENTRIES: ClassVar = [ @@ - EXPECTED_INVALID_VALUE = [ + EXPECTED_INVALID_VALUE: ClassVar = [ @@ - MISSING_TIMESTAMP_AND_MESSAGE_ENTRY = [ + MISSING_TIMESTAMP_AND_MESSAGE_ENTRY: ClassVar = [ @@ - EXPECTED_MISSING_TIMESTAMP_AND_MESSAGE = [ + EXPECTED_MISSING_TIMESTAMP_AND_MESSAGE: ClassVar = [ @@ - RAW_LOG_ENTRIES = [ + RAW_LOG_ENTRIES: ClassVar = [ @@ - EXPECTED_RESULTS = [ + EXPECTED_RESULTS: ClassVar = [Alternatively, define these at module scope (outside the class) to avoid ClassVar annotations entirely. Based on static analysis hints.
Also applies to: 42-46, 49-60, 61-70, 73-88, 93-143, 145-172
200-206: Remove print() inside raises blocksPrinting inside pytest.raises adds noise and isn’t needed.
- with pytest.raises(ValueError) as exc_info: - print(convert_date_string_to_epoch("2024.10.18T16:00:00.123")) + with pytest.raises(ValueError) as exc_info: + convert_date_string_to_epoch("2024.10.18T16:00:00.123") @@ - with pytest.raises(ValueError) as exc_info: - print(convert_date_string_to_epoch("2024-10-18T16-00-00-123")) + with pytest.raises(ValueError) as exc_info: + convert_date_string_to_epoch("2024-10-18T16-00-00-123")
24-41: Add a negative test for missing/invalid "link" to lock in behaviorGiven link is now required in formatting, add a case with no link and with non-str link to assert they’re skipped with a warning (after guarding in utils).
I can draft the test and a caplog assertion if helpful.
tools/deployment/presto-clp/docker-compose.yaml (1)
46-48: Consider validating at least one storage path is configured.The
emptydummy volume is a clever pattern to allow optional mounts, but it means the container will start successfully even if neitherCLP_ARCHIVES_DIRnorCLP_STAGED_ARCHIVES_DIRis set. Whileinit.pyensures one is set based on storage type, consider adding a runtime validation in the worker startup script to fail fast if both mounts are empty.Would you like me to generate a validation script that checks for at least one configured storage path during worker initialization?
taskfile.yaml (1)
112-119: DRY the repeated core binary list via a YAML anchorThe paths/targets for core binaries are duplicated across sources, rsync, generates, and TARGETS. Suggest centralizing with an anchor to prevent drift.
Example approach:
- Define an anchor with core binary paths once, e.g., under vars or at task scope:
core_bin_paths: &core_bin_paths
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/clg"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/clo"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp-s"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/indexer"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter"
- "{{.G_CORE_COMPONENT_BUILD_DIR}}/reducer-server"
- Reuse with *core_bin_paths in package.sources and use join for rsync/TARGETS.
Also applies to: 157-167, 218-225, 230-230
components/core/src/clp_s/log_converter/CMakeLists.txt (1)
11-37: Add an install rule to simplify packagingOptional: provide an install target so downstream packaging can use cmake --install instead of manual rsync.
Example:
install(TARGETS log-converter RUNTIME DESTINATION bin)components/core/src/clp_s/log_converter/LogSerializer.cpp (1)
42-46: Map exceptions to accurate error codes (optional)Returning no_such_file_or_directory for any open failure loses fidelity (e.g., EACCES). Consider catching FileWriter::OperationFailed and mapping to the underlying errno/category, or return a generic io_error when errno is unknown.
components/core/src/clp_s/log_converter/log_converter.cpp (1)
61-71: Use create_directories for nested output pathscreate_directory fails if parent dirs are missing. create_directories handles nested creation and returns success if already present.
- if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) + if (false == std::filesystem::create_directories(command_line_arguments.get_output_dir(), ec) && ec)components/core/src/clp_s/log_converter/LogConverter.cpp (2)
75-86: Make timestamp stripping robust; don’t assume it’s a strict prefix.
substr(timestamp.length())assumes the event string starts with the timestamp and drops exactly that many bytes, which can mis-slice when delimiters or non-prefix timestamps occur. Guard with a prefix check and trim a single spacer if present.- auto const& event{parser.get_log_parser().get_log_event_view()}; - auto const message{event.to_string()}; - if (nullptr != event.get_timestamp()) { - auto const timestamp{event.get_timestamp()->to_string_view()}; - auto const message_without_timestamp{ - std::string_view{message}.substr(timestamp.length()) - }; - YSTDLIB_ERROR_HANDLING_TRYV( - serializer.add_message(timestamp, message_without_timestamp) - ); - } else { - YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); - } + auto const& event{parser.get_log_parser().get_log_event_view()}; + auto const message{event.to_string()}; + if (nullptr != event.get_timestamp()) { + auto const timestamp{event.get_timestamp()->to_string_view()}; + std::string_view msg_view{message}; + if (msg_view.size() >= timestamp.size() + && 0 == msg_view.compare(0, timestamp.size(), timestamp)) { + msg_view.remove_prefix(timestamp.size()); + if (!msg_view.empty() && (msg_view.front() == ' ' || msg_view.front() == '\t')) { + msg_view.remove_prefix(1); + } + } + YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(timestamp, msg_view)); + } else { + YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); + }Confirm
LogSerializer::add_messageeagerly copies data into its IR buffer. If it defers, views intom_buffer(timestamp) and intomessagewould be unsafe oncecompact_buffer()or the localmessagegoes out of scope.
110-112: Return an I/O error for reader failures, not “not_enough_memory”.Mapping reader failures to
std::errc::not_enough_memoryis misleading and hurts debuggability. Preferstd::errc::io_error(or propagate a more specific code).- if (clp::ErrorCode_Success != rc) { - return std::errc::not_enough_memory; - } + if (clp::ErrorCode_Success != rc) { + return std::errc::io_error; + }components/core/src/clp_s/log_converter/LogConverter.hpp (2)
49-52: Align documented error with implementation (use I/O error).If you adopt the
io_errorchange inrefill_buffer, update this comment accordingly.- * - std::errc::not_enough_memory if `clp::ReaderInterface::try_read()` returns an error. + * - std::errc::io_error if `clp::ReaderInterface::try_read()` returns an error.
10-12: Remove unused include to keep the header minimal.
../InputConfig.hppisn’t referenced in this header.-#include "../InputConfig.hpp"components/core/src/clp_s/log_converter/CommandLineArguments.cpp (2)
84-90: Make auth parsing case-insensitive (optional).Users may pass
S3/None. Lowercase the input before comparison for better UX.Example:
std::string s(auth_method); std::transform(s.begin(), s.end(), s.begin(), [](unsigned char c){ return std::tolower(c); }); if (s == cS3Auth) { ... } else if (s != cNoAuth) { throw ...; }
127-140: Optionally validate output directory early.Fail fast if
--output-diris empty or non-existent, or create it if policy allows.components/package-template/src/sbin/compress-from-s3.sh (1)
1-9: Improve portability and robustness (readlink -f, PYTHONPATH, strict mode)
- readlink -f is not portable on macOS; add a Python fallback.
- Preserve existing PYTHONPATH instead of replacing it.
- Enable strict bash options.
Apply this diff:
#!/usr/bin/env bash +set -euo pipefail +# IFS=$'\n\t' # optionally tighten word splitting - -script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" +script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" package_root="$script_dir/.." -PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \ - python3 \ - -m clp_package_utils.scripts.compress_from_s3 \ - "$@" +# Resolve site-packages path portably +site_pkgs="$package_root/lib/python3/site-packages" +if command -v readlink >/dev/null 2>&1 && readlink -f / >/dev/null 2>&1; then + site_pkgs_resolved="$(readlink -f "$site_pkgs")" +else + site_pkgs_resolved="$(python3 - <<'PY' "$site_pkgs" +import os, sys +print(os.path.realpath(sys.argv[1])) +PY +)" +fi + +PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs_resolved" \ + /usr/bin/env python3 -m clp_package_utils.scripts.compress_from_s3 "$@"components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)
222-278: Make S3 keys ordering deterministic and use path-aware common prefix
- Using a set loses order; job specs can vary across runs. Prefer a stable list.
- os.path.commonprefix is char-based; use a path-aware function to avoid partial component prefixes.
Apply this diff:
-import os +import os +import posixpath @@ - keys = set() + keys_set: set[str] = set() + key_list: list[str] = [] @@ - if key in keys: + if key in keys_set: raise ValueError(f"Duplicate S3 key found: {key}.") - keys.add(key) + keys_set.add(key) + key_list.append(key) @@ - key_list: list[str] = list(keys) - key_prefix = os.path.commonprefix(key_list) + # Use POSIX path semantics for S3 keys and derive a component-safe common prefix + try: + key_prefix = posixpath.commonpath(key_list) + except ValueError: + key_prefix = posixpath.commonprefix(key_list) @@ - return region_code, bucket_name, key_prefix, key_list + return region_code, bucket_name, key_prefix, key_list
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
components/clp-mcp-server/clp_mcp_server/server/utils.py(2 hunks)components/clp-mcp-server/tests/server/test_utils.py(6 hunks)components/clp-package-utils/clp_package_utils/general.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(9 hunks)components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(6 hunks)components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/FileWriter.hpp(2 hunks)components/core/src/clp_s/log_converter/CMakeLists.txt(1 hunks)components/core/src/clp_s/log_converter/CommandLineArguments.cpp(1 hunks)components/core/src/clp_s/log_converter/CommandLineArguments.hpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.hpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.cpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.hpp(1 hunks)components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)components/core/tests/test-ParserWithUserSchema.cpp(1 hunks)components/package-template/src/sbin/compress-from-s3.sh(1 hunks)docs/src/dev-docs/components-core/index.md(1 hunks)integration-tests/tests/utils/config.py(1 hunks)taskfile.yaml(3 hunks)taskfiles/deps/main.yaml(1 hunks)tools/deployment/presto-clp/.gitignore(1 hunks)tools/deployment/presto-clp/docker-compose.yaml(1 hunks)tools/deployment/presto-clp/scripts/init.py(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- tools/deployment/presto-clp/.gitignore
- components/core/src/clp_s/CMakeLists.txt
- docs/src/dev-docs/components-core/index.md
- taskfiles/deps/main.yaml
🧰 Additional context used
🧬 Code graph analysis (10)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
components/core/src/clp_s/FileWriter.hpp (2)
writer(42-49)writer(42-42)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (5)
nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)
components/core/src/clp_s/log_converter/LogConverter.hpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
nodiscard(39-39)nodiscard(43-43)output_dir(33-34)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
m_network_auth(51-51)
components/core/src/clp_s/log_converter/LogSerializer.cpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (4)
output_dir(33-34)timestamp(61-62)message(71-72)flush_buffer(77-81)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
AwsAuthentication(385-415)CLPConfig(605-817)StorageType(119-121)components/job-orchestration/job_orchestration/scheduler/job_config.py (2)
S3InputConfig(31-42)FsInputConfig(23-28)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
parse_s3_url(208-240)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
StorageEngine(94-96)StorageType(119-121)get_clp_connection_params_and_type(200-221)components/clp-package-utils/clp_package_utils/general.py (10)
dump_container_config(344-361)generate_container_config(231-312)generate_container_name(163-168)generate_container_start_cmd(375-416)get_clp_home(144-160)get_container_config_filename(340-341)JobType(101-105)load_config_file(427-449)validate_and_load_db_credentials_file(480-484)validate_dataset_name(634-660)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/FileWriter.hpp (1)
path(127-127)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(33-34)message(71-72)timestamp(61-62)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(28-49)create(28-29)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (7)
CommandLineArguments(22-22)CommandLineArguments(22-22)nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(20-20)path(35-36)reader(53-54)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(119-121)
🪛 Ruff (0.14.1)
components/clp-mcp-server/tests/server/test_utils.py
42-46: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
49-60: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
61-70: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
73-88: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
162-162: Avoid specifying long messages outside the exception class
(TRY003)
166-166: Avoid specifying long messages outside the exception class
(TRY003)
186-188: Avoid specifying long messages outside the exception class
(TRY003)
200-200: Avoid specifying long messages outside the exception class
(TRY003)
242-242: Avoid specifying long messages outside the exception class
(TRY003)
254-257: Avoid specifying long messages outside the exception class
(TRY003)
262-265: Avoid specifying long messages outside the exception class
(TRY003)
268-268: Avoid specifying long messages outside the exception class
(TRY003)
275-275: Avoid specifying long messages outside the exception class
(TRY003)
291-291: Avoid specifying long messages outside the exception class
(TRY003)
367-367: f-string without any placeholders
Remove extraneous f prefix
(F541)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
252-252: Do not catch blind exception: Exception
(BLE001)
253-253: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
304-304: subprocess call: check for execution of untrusted input
(S603)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (14)
integration-tests/tests/utils/config.py (1)
30-38: Addition of "log-converter" to required binaries is verified and correct.The log-converter binary is properly defined as a CMake build target and is included in the standard build pipeline (taskfile.yaml). It's built with the same configuration as other required binaries and output to the same directory. The change maintains alphabetical ordering and aligns with the actual build infrastructure.
components/core/tests/test-ParserWithUserSchema.cpp (2)
1-204: PR objectives don't match the file changes.The PR objectives describe refining UV lock maintenance loops with YAML anchors and G_COMPONENTS_DIR, but this file is a C++ test for schema parser error messages. There appears to be a mismatch between the stated PR goals and the actual changes under review.
120-128: Manually verify that the test passes with the updated error message expectation.The test assertion change from "AlphaNumeric" to "IdentifierCharacters" appears to be a systematic update reflecting a change in the external
log_surgeonlibrary's token naming. The schema file exists and the error location (line 3, column 4) is correct. However, since the error message is generated by the externallog_surgeon::SchemaParserlibrary and cannot be verified in the sandbox without compilation, you should:
- Run the test to confirm it passes:
ctest --output-on-failure -R "Test error for colon missing schema file"- Verify the log_surgeon library version supports the "IdentifierCharacters" token name
components/clp-mcp-server/clp_mcp_server/server/utils.py (1)
109-112: Code is asymmetric in defensive handling, but link is guaranteed by upstream contractThe review comment correctly identifies that
obj["link"]uses direct indexing whiletimestampandmessageuse.get()with fallback handling. However, examiningClpConnector.read_results()shows the "link" field is explicitly constructed and added to every document before returning—it is contractually guaranteed to be present.The asymmetry is valid from a code consistency perspective. The suggested fix would add defensive handling matching timestamp/message treatment. However, since upstream guarantees "link" presence, a KeyError would indicate a contract violation rather than a typical edge case.
Consider: either trust the data contract and document why link differs, or apply the diff for consistency and defensive robustness. Both are defensible choices depending on team preference.
tools/deployment/presto-clp/scripts/init.py (1)
127-127: LGTM: Env var renamed to reflect S3 staging semantics.The rename from
CLP_ARCHIVES_DIRtoCLP_STAGED_ARCHIVES_DIRfor S3 storage accurately reflects that S3 uses a staging directory, while FS storage (line 118) continues to useCLP_ARCHIVES_DIRfor direct archive access.tools/deployment/presto-clp/docker-compose.yaml (1)
37-38: The volume mounts are already correctly configured with fixed container paths; worker configuration does not and should not reference them.The docker-compose.yaml volumes at lines 37-38 are already using fixed container paths (
/var/data/archivesand/var/data/staged-archives), not dynamic environment variable paths. These mounted volumes are set up correctly and serve their intended purpose through environment variable substitution at deployment time (handled byinit.py).The worker's Presto configuration does not need to explicitly reference these container paths because:
- For filesystem storage: The CLP connector uses the environment variable
CLP_ARCHIVES_DIRconfigured byinit.py, not direct path references in worker config.- For S3 storage: The worker's
clp.propertiesis dynamically generated with S3 credentials and endpoint configuration, not filesystem paths.The worker's configuration templates (
config.properties,node.properties,velox.properties) are correctly unchanged. This is not a breaking change to worker configuration.Likely an incorrect or invalid review comment.
taskfile.yaml (3)
117-117: log-converter included in package artifacts—LGTMThe new binary is wired into both the sources and rsync copy sets. Matches the CMake target name.
Also applies to: 163-163
223-231: core-build target list—LGTMThe generates and TARGETS now include log-converter and align with packaging paths.
298-311: Based on the verification:No action required—Node.js v22.20.0 is available for all common OS/ARCH permutations
The curl output confirms v22.20.0 is available for:
node-v22.20.0-linux-x64.tar.gz✓node-v22.20.0-linux-arm64.tar.gz✓node-v22.20.0-darwin-x64.tar.gz✓node-v22.20.0-darwin-arm64.tar.gz✓The taskfile uses dynamic
{{OS}}and{{.NODEJS_ARCH}}variables (conditionally x64 or arm64) for the download pattern. The hard-coded version v22.20.0 is not broken and will successfully download across these standard CI environments.components/core/src/clp_s/log_converter/CommandLineArguments.hpp (1)
1-49: API surface looks good.No blocking issues found.
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
78-82: Remove the review comment - the code handles final line without newline correctly.The
try_read_to_delimiterimplementation shows that when a file's last line lacks a trailing newline:
- The final
getdelim()call reads remaining data successfully (returns byte count > 0)- Since
num_bytes_read >= 1, the data is appended tolineandErrorCodeSuccessis returned- The loop body immediately processes and pushes the line to the destination vector
- Only on the next iteration does
getdelim()return -1, causingErrorCodeEndOfFileto be returnedNo residual data is lost. The pattern of using
try_read_to_delimiterwithappend=falseis already used throughout the codebase (clp/Utils.cpp, clp/compression.cpp, glt/compression.cpp) without special EOF handling, confirming this approach is correct.components/clp-package-utils/clp_package_utils/general.py (1)
52-54: S3 subcommand constants look goodClear, kebab-cased, and used consistently across scripts. No issues.
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
81-87: Explicitly setting --input-type fs is correctEnsures the native compressor runs in FS mode. LGTM.
220-220: Mounting input_logs_dir is necessary for FS inputsGood catch adding mounts.input_logs_dir to ensure container can see the input files.
| if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args): | ||
| logger.error("No S3 URLs given for compression.") | ||
| return -1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle failures when reading --inputs-from cleanly
If the inputs file is missing or unreadable, the script raises and exits with a stack trace. Catch and log a clear error.
Apply this guard:
- if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args):
+ try:
+ ok = _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args)
+ except Exception:
+ logger.exception("Failed to prepare S3 URL list for compression.")
+ return -1
+ if not ok:
logger.error("No S3 URLs given for compression.")
return -1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args): | |
| logger.error("No S3 URLs given for compression.") | |
| return -1 | |
| try: | |
| ok = _generate_url_list(parsed_args.subcommand, container_url_list_path, parsed_args) | |
| except Exception: | |
| logger.exception("Failed to prepare S3 URL list for compression.") | |
| return -1 | |
| if not ok: | |
| logger.error("No S3 URLs given for compression.") | |
| return -1 |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 287 to 290, the call to _generate_url_list can raise when the
--inputs-from file is missing or unreadable; wrap the call in a try/except,
catch the exception (e.g., Exception as e), log a clear error like "Failed to
read inputs file for --inputs-from: <error>" via logger.error, and return -1 to
exit cleanly instead of letting a stack trace propagate.
| def _generate_logs_list( | ||
| input_type: InputType, | ||
| container_logs_list_path: pathlib.Path, | ||
| parsed_args: argparse.Namespace, | ||
| ) -> None: | ||
| if InputType.FS == input_type: | ||
| host_logs_list_path = parsed_args.path_list | ||
| with open(container_logs_list_path, "w") as container_logs_list_file: | ||
| if host_logs_list_path is not None: | ||
| with open(host_logs_list_path, "r") as host_logs_list_file: | ||
| for line in host_logs_list_file: | ||
| stripped_path_str = line.rstrip() | ||
| if "" == stripped_path_str: | ||
| # Skip empty paths | ||
| continue | ||
| resolved_path = pathlib.Path(stripped_path_str).resolve() | ||
| mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to( | ||
| resolved_path.anchor | ||
| ) | ||
| container_logs_list_file.write(f"{mounted_path}\n") | ||
|
|
||
| ) -> bool: | ||
| """ | ||
| Generates logs list file for the native compression script. | ||
| :param container_logs_list_path: Path to write logs list. | ||
| :param parsed_args: Parsed command-line arguments. | ||
| :return: Whether any paths were written to the logs list. | ||
| """ | ||
| host_logs_list_path = parsed_args.path_list | ||
| with open(container_logs_list_path, "w") as container_logs_list_file: | ||
| if host_logs_list_path is None: | ||
| for path in parsed_args.paths: | ||
| resolved_path = pathlib.Path(path).resolve() | ||
| mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to( | ||
| resolved_path.anchor | ||
| ) | ||
| container_logs_list_file.write(f"{mounted_path}\n") | ||
|
|
||
| elif InputType.S3 == input_type: | ||
| with open(container_logs_list_path, "w") as container_logs_list_file: | ||
| container_logs_list_file.write(f"{parsed_args.paths[0]}\n") | ||
|
|
||
| else: | ||
| raise ValueError(f"Unsupported input type: {input_type}.") | ||
| return len(parsed_args.paths) != 0 | ||
|
|
||
| no_path_found = True | ||
| with open(host_logs_list_path, "r") as host_logs_list_file: | ||
| for line in host_logs_list_file: | ||
| stripped_path_str = line.rstrip() | ||
| if "" == stripped_path_str: | ||
| # Skip empty paths | ||
| continue | ||
| no_path_found = False | ||
| resolved_path = pathlib.Path(stripped_path_str).resolve() | ||
| mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to( | ||
| resolved_path.anchor | ||
| ) | ||
| container_logs_list_file.write(f"{mounted_path}\n") | ||
| return not no_path_found | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate inputs are under logs_input.directory and handle missing --path-list gracefully
- If any provided path is outside the configured logs_input.directory, the container won't see it (mount only covers that root). Validate early to avoid hard-to-debug failures.
- Opening a non-existent --path-list file currently raises and aborts without a friendly message.
Apply this refactor (adds input_root validation and robust file handling):
-def _generate_logs_list(
- container_logs_list_path: pathlib.Path,
- parsed_args: argparse.Namespace,
-) -> bool:
+def _generate_logs_list(
+ container_logs_list_path: pathlib.Path,
+ parsed_args: argparse.Namespace,
+ input_root: pathlib.Path,
+) -> bool:
@@
- host_logs_list_path = parsed_args.path_list
- with open(container_logs_list_path, "w") as container_logs_list_file:
+ host_logs_list_path = parsed_args.path_list
+ input_root = input_root.resolve()
+ with open(container_logs_list_path, "w") as container_logs_list_file:
if host_logs_list_path is None:
for path in parsed_args.paths:
resolved_path = pathlib.Path(path).resolve()
+ try:
+ resolved_path.relative_to(input_root)
+ except ValueError:
+ raise ValueError(f"Path '{resolved_path}' is outside logs_input.directory '{input_root}'.")
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
resolved_path.anchor
)
container_logs_list_file.write(f"{mounted_path}\n")
return len(parsed_args.paths) != 0
- no_path_found = True
- with open(host_logs_list_path, "r") as host_logs_list_file:
+ no_path_found = True
+ try:
+ host_file = pathlib.Path(host_logs_list_path).resolve()
+ with open(host_file, "r") as host_logs_list_file:
for line in host_logs_list_file:
stripped_path_str = line.rstrip()
if "" == stripped_path_str:
# Skip empty paths
continue
no_path_found = False
- resolved_path = pathlib.Path(stripped_path_str).resolve()
+ resolved_path = pathlib.Path(stripped_path_str).resolve()
+ try:
+ resolved_path.relative_to(input_root)
+ except ValueError:
+ raise ValueError(f"Path '{resolved_path}' is outside logs_input.directory '{input_root}'.")
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
resolved_path.anchor
)
container_logs_list_file.write(f"{mounted_path}\n")
+ except FileNotFoundError as e:
+ raise ValueError(f"--path-list file not found: {host_logs_list_path}") from e
return not no_path_foundAnd update the call site:
- if not _generate_logs_list(container_logs_list_path, parsed_args):
+ if not _generate_logs_list(
+ container_logs_list_path,
+ parsed_args,
+ clp_config.logs_input.directory,
+ ):
logger.error("No filesystem paths given for compression.")
return -1Committable suggestion skipped: line range outside the PR's diff.
| if not _generate_logs_list(container_logs_list_path, parsed_args): | ||
| logger.error("No filesystem paths given for compression.") | ||
| return -1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap logs list generation with a friendly error report
Currently, exceptions from reading --path-list surface as stack traces. Catch and report as user errors.
Apply this small guard if you prefer not to change the function signature:
- if not _generate_logs_list(container_logs_list_path, parsed_args):
+ try:
+ ok = _generate_logs_list(container_logs_list_path, parsed_args)
+ except Exception:
+ logger.exception("Failed to prepare filesystem paths for compression.")
+ return -1
+ if not ok:
logger.error("No filesystem paths given for compression.")
return -1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not _generate_logs_list(container_logs_list_path, parsed_args): | |
| logger.error("No filesystem paths given for compression.") | |
| return -1 | |
| try: | |
| ok = _generate_logs_list(container_logs_list_path, parsed_args) | |
| except Exception: | |
| logger.exception("Failed to prepare filesystem paths for compression.") | |
| return -1 | |
| if not ok: | |
| logger.error("No filesystem paths given for compression.") | |
| return -1 |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress.py around
lines 233 to 236, calling _generate_logs_list can raise exceptions that
currently surface as stack traces; wrap the call in a try/except that catches
Exception, logs a friendly user-facing error message including the exception
text (e.g. "Failed to read --path-list: <error>") and returns -1, and avoid
re-raising the exception so users see a concise error (optionally log full
traceback at debug level).
| // Delete copy constructor and assignment operator | ||
| FileWriter(FileWriter const&) = delete; | ||
| auto operator=(FileWriter const&) -> FileWriter& = delete; | ||
|
|
||
| // Define custom move constructor and assignment operator | ||
| FileWriter(FileWriter&& writer) | ||
| : m_file{std::exchange(writer.m_file, nullptr)}, | ||
| m_fd{std::exchange(writer.m_fd, -1)} {} | ||
|
|
||
| auto operator=(FileWriter&& writer) -> FileWriter& { | ||
| if (this == &writer) { | ||
| return *this; | ||
| } | ||
| m_file = std::exchange(writer.m_file, nullptr); | ||
| m_fd = std::exchange(writer.m_fd, -1); | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move assignment leaks existing handles; make moves noexcept and swap-based
If the target already owns an open file, the current move assignment overwrites m_file/m_fd without closing, leaking the resource. Also, marking moves noexcept helps container optimizations.
Apply this diff:
@@
- // Delete copy constructor and assignment operator
+ // Delete copy constructor and assignment operator
FileWriter(FileWriter const&) = delete;
auto operator=(FileWriter const&) -> FileWriter& = delete;
- // Define custom move constructor and assignment operator
- FileWriter(FileWriter&& writer)
- : m_file{std::exchange(writer.m_file, nullptr)},
- m_fd{std::exchange(writer.m_fd, -1)} {}
+ // Define custom move constructor and assignment operator
+ FileWriter(FileWriter&& writer) noexcept
+ : m_file{std::exchange(writer.m_file, nullptr)},
+ m_fd{std::exchange(writer.m_fd, -1)} {}
- auto operator=(FileWriter&& writer) -> FileWriter& {
- if (this == &writer) {
- return *this;
- }
- m_file = std::exchange(writer.m_file, nullptr);
- m_fd = std::exchange(writer.m_fd, -1);
- return *this;
- }
+ friend void swap(FileWriter& a, FileWriter& b) noexcept {
+ using std::swap;
+ swap(a.m_file, b.m_file);
+ swap(a.m_fd, b.m_fd);
+ }
+
+ auto operator=(FileWriter&& writer) noexcept -> FileWriter& {
+ if (this != &writer) {
+ swap(*this, writer); // old resources end up in writer and are cleaned up by its dtor
+ }
+ return *this;
+ }🤖 Prompt for AI Agents
In components/core/src/clp_s/FileWriter.hpp around lines 33 to 49, the move
assignment currently overwrites m_file/m_fd without releasing existing resources
(leak) and neither move constructor nor move assignment is marked noexcept;
replace the assignment logic with a swap-based move assignment (or explicitly
close existing handles before overwriting) so the target releases its current
file descriptor/file safely, implement the move constructor/assignment to leave
the source in a valid empty state, and mark both the move constructor and move
assignment noexcept to enable container optimizations.
| if(CLP_BUILD_EXECUTABLES) | ||
| add_executable( | ||
| log-converter | ||
| log_converter.cpp | ||
| ${CLP_S_LOG_CONVERTER_SOURCES} | ||
| ) | ||
| target_compile_features(log-converter PRIVATE cxx_std_20) | ||
| target_link_libraries( | ||
| log-converter | ||
| PRIVATE | ||
| Boost::program_options | ||
| clp_s::clp_dependencies | ||
| clp_s::io | ||
| fmt::fmt | ||
| log_surgeon::log_surgeon | ||
| msgpack-cxx | ||
| nlohmann_json::nlohmann_json | ||
| spdlog::spdlog | ||
| ystdlib::containers | ||
| ystdlib::error_handling | ||
| ) | ||
| set_target_properties( | ||
| log-converter | ||
| PROPERTIES | ||
| RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}" | ||
| ) | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packaging assumes this binary exists, but build is gated—risk of breakage
Taskfile unconditionally packages {{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter. If CLP_BUILD_EXECUTABLES is OFF, packaging will fail.
Options:
- Ensure CLP_BUILD_EXECUTABLES is enabled in CI/package builds.
- Or gate the packaging entries on the same option.
- Or always build this target irrespective of CLP_BUILD_EXECUTABLES.
🤖 Prompt for AI Agents
components/core/src/clp_s/log_converter/CMakeLists.txt lines 11-37: Packaging
unconditionally expects the log-converter binary but this target is only created
when CLP_BUILD_EXECUTABLES is ON; to fix, make packaging robust by aligning
behavior with the same option—either enable CLP_BUILD_EXECUTABLES in CI/package
builds or (preferably) gate the packaging/taskfile entry that references
{{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter with the same
CLP_BUILD_EXECUTABLES check (or check for the target/file existence before
packaging) so packaging will only reference the binary when it is actually
built.
| std::string line; | ||
| while (true) { | ||
| error_code = reader.try_read_to_delimiter('\n', false, false, line); | ||
| if (ErrorCodeSuccess != error_code) { | ||
| break; | ||
| } | ||
| if (false == line.empty()) { | ||
| path_destination.push_back(line); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trim CRLF and trailing whitespace when reading paths.
Without trimming, Windows CRLF leaves a trailing '\r' causing path resolution failures.
- while (true) {
+ while (true) {
error_code = reader.try_read_to_delimiter('\n', false, false, line);
if (ErrorCodeSuccess != error_code) {
break;
}
- if (false == line.empty()) {
- path_destination.push_back(line);
- }
+ // Trim trailing CR and spaces/tabs
+ while (!line.empty()
+ && (line.back() == '\r' || line.back() == ' ' || line.back() == '\t')) {
+ line.pop_back();
+ }
+ if (!line.empty()) {
+ path_destination.push_back(line);
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::string line; | |
| while (true) { | |
| error_code = reader.try_read_to_delimiter('\n', false, false, line); | |
| if (ErrorCodeSuccess != error_code) { | |
| break; | |
| } | |
| if (false == line.empty()) { | |
| path_destination.push_back(line); | |
| } | |
| } | |
| std::string line; | |
| while (true) { | |
| error_code = reader.try_read_to_delimiter('\n', false, false, line); | |
| if (ErrorCodeSuccess != error_code) { | |
| break; | |
| } | |
| // Trim trailing CR and spaces/tabs | |
| while (!line.empty() | |
| && (line.back() == '\r' || line.back() == ' ' || line.back() == '\t')) { | |
| line.pop_back(); | |
| } | |
| if (!line.empty()) { | |
| path_destination.push_back(line); | |
| } | |
| } |
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/CommandLineArguments.cpp around lines
67 to 76, the loop reads lines but does not trim trailing CR/LF or other
whitespace, leaving a trailing '\r' on Windows that breaks path resolution;
after successfully reading a line, trim trailing whitespace (including '\r',
'\n', spaces, tabs) from the end of the string (e.g., use find_last_not_of and
erase or an equivalent rtrim) then only push_back non-empty trimmed paths into
path_destination.
| auto serializer{YSTDLIB_ERROR_HANDLING_TRYX(LogSerializer::create(output_dir, path.path))}; | ||
|
|
||
| bool reached_end_of_stream{false}; | ||
| while (false == reached_end_of_stream) { | ||
| auto const num_bytes_read{YSTDLIB_ERROR_HANDLING_TRYX(refill_buffer(reader))}; | ||
| reached_end_of_stream = 0ULL == num_bytes_read; | ||
|
|
||
| while (m_parser_offset < m_num_bytes_buffered) { | ||
| auto const err{parser.parse_next_event( | ||
| m_buffer.data(), | ||
| m_num_bytes_buffered, | ||
| m_parser_offset, | ||
| reached_end_of_stream | ||
| )}; | ||
| if (log_surgeon::ErrorCode::BufferOutOfBounds == err) { | ||
| break; | ||
| } | ||
| if (log_surgeon::ErrorCode::Success != err) { | ||
| return std::errc::no_message; | ||
| } | ||
|
|
||
| auto const& event{parser.get_log_parser().get_log_event_view()}; | ||
| auto const message{event.to_string()}; | ||
| if (nullptr != event.get_timestamp()) { | ||
| auto const timestamp{event.get_timestamp()->to_string_view()}; | ||
| auto const message_without_timestamp{ | ||
| std::string_view{message}.substr(timestamp.length()) | ||
| }; | ||
| YSTDLIB_ERROR_HANDLING_TRYV( | ||
| serializer.add_message(timestamp, message_without_timestamp) | ||
| ); | ||
| } else { | ||
| YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); | ||
| } | ||
| } | ||
| } | ||
| serializer.close(); | ||
| return ystdlib::error_handling::success(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure serializer is closed on all paths (RAII or scope guard).
Errors returned via TRYX and on non-success parse exit before serializer.close(), yielding incomplete IR (missing EOF) and a possibly unclosed file. Make LogSerializer close on destruction and idempotent, or add a scope guard here.
Apply the LogSerializer.hpp diff in its file comment to make close-on-destroy and idempotent; then keeping the explicit serializer.close(); here is harmless.
🤖 Prompt for AI Agents
components/core/src/clp_s/log_converter/LogConverter.cpp lines 53-90:
serializer.close() is only called on the happy path so early returns from TRYX
or parse failures can leave the serializer/file unclosed and EOF missing; make
LogSerializer's destructor call close() and ensure close() is idempotent (safe
to call multiple times) by applying the LogSerializer.hpp changes noted in the
review (close-on-destroy and guard against double-close), then retain the
explicit serializer.close() here (it becomes harmless) so all paths guarantee
closure.
| #include <array> | ||
| #include <cstdint> | ||
| #include <exception> | ||
| #include <optional> | ||
| #include <string> | ||
| #include <string_view> | ||
| #include <system_error> | ||
| #include <utility> | ||
|
|
||
| #include <boost/uuid/random_generator.hpp> | ||
| #include <boost/uuid/uuid_io.hpp> | ||
| #include <msgpack.hpp> | ||
| #include <nlohmann/json.hpp> | ||
| #include <nlohmann/json_fwd.hpp> | ||
| #include <ystdlib/error_handling/Result.hpp> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing header for std::filesystem
This file uses std::filesystem::path but doesn't include . Add it to avoid transitive-include brittleness.
#include <array>
#include <cstdint>
#include <exception>
+#include <filesystem>
#include <optional>
#include <string>
#include <string_view>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <array> | |
| #include <cstdint> | |
| #include <exception> | |
| #include <optional> | |
| #include <string> | |
| #include <string_view> | |
| #include <system_error> | |
| #include <utility> | |
| #include <boost/uuid/random_generator.hpp> | |
| #include <boost/uuid/uuid_io.hpp> | |
| #include <msgpack.hpp> | |
| #include <nlohmann/json.hpp> | |
| #include <nlohmann/json_fwd.hpp> | |
| #include <ystdlib/error_handling/Result.hpp> | |
| #include <array> | |
| #include <cstdint> | |
| #include <exception> | |
| #include <filesystem> | |
| #include <optional> | |
| #include <string> | |
| #include <string_view> | |
| #include <system_error> | |
| #include <utility> | |
| #include <boost/uuid/random_generator.hpp> | |
| #include <boost/uuid/uuid_io.hpp> | |
| #include <msgpack.hpp> | |
| #include <nlohmann/json.hpp> | |
| #include <nlohmann/json_fwd.hpp> | |
| #include <ystdlib/error_handling/Result.hpp> |
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.cpp around lines 3 to
18, the file uses std::filesystem::path but does not include <filesystem>; add
the header #include <filesystem> (preferably grouped with the other standard
headers at the top of the file) to ensure std::filesystem symbols are available
and avoid relying on transitive includes.
| clp_s::FileWriter writer; | ||
| try { | ||
| writer.open(converted_path, clp_s::FileWriter::OpenMode::CreateForWriting); | ||
| } catch (std::exception const&) { | ||
| return std::errc::no_such_file_or_directory; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: pass string to FileWriter::open
open expects std::string; converted_path is std::filesystem::path. Use string() and preserve error info.
- writer.open(converted_path, clp_s::FileWriter::OpenMode::CreateForWriting);
+ writer.open(converted_path.string(), clp_s::FileWriter::OpenMode::CreateForWriting);🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.cpp around lines 41 to
46, FileWriter::open expects a std::string but converted_path is a
std::filesystem::path; call open with converted_path.string() instead of
converted_path, and change the catch to catch (std::exception const& e) so you
can preserve and surface the error details (e.g. log or include e.what() before
returning std::errc::no_such_file_or_directory).
| * Closes and flushes the serialized output. | ||
| */ | ||
| void close() { | ||
| flush_buffer(); | ||
| m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof); | ||
| m_writer.close(); | ||
| } | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make close() idempotent and close on destruction to avoid incomplete outputs.
If callers return early, EOF isn’t written and the writer may remain open. Make close() idempotent and call it from the destructor.
- // Destructor
- ~LogSerializer() = default;
+ // Destructor
+ ~LogSerializer() noexcept {
+ if (!m_closed) {
+ try {
+ close();
+ } catch (...) {
+ // Swallow in destructor
+ }
+ }
+ }
@@
- void close() {
- flush_buffer();
- m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof);
- m_writer.close();
- }
+ void close() {
+ if (m_closed) {
+ return;
+ }
+ flush_buffer();
+ m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof);
+ m_writer.close();
+ m_closed = true;
+ }
@@
- clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer;
- clp_s::FileWriter m_writer;
+ clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer;
+ clp_s::FileWriter m_writer;
+ bool m_closed{false};Also applies to: 111-113
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogSerializer.hpp around lines 75-83
(and similarly for 111-113), close() currently unconditionally writes EOF and
closes the writer which can be skipped on early returns and can be unsafe if
called multiple times; make close() idempotent by tracking a closed flag (e.g.,
m_closed) and return immediately if already closed, only write EOF/flush/close
once, and call close() from the destructor to ensure EOF is written and the
writer is closed when the object is destroyed.
… the package without starting components (resolves y-scope#1475). (y-scope#1502)
…r` and `spider_worker`) in the `clp-package` image; Set `mariadb-connector-cpp` build type to `Release` (fixes y-scope#1410). (y-scope#1414)
…deployments (fixes y-scope#1498). (y-scope#1501)
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
724-730: Protect secrets in.envand write atomically/minimally.
.envincludes credentials (e.g., AWS secret). It’s written with default perms (often world‑readable). Use 0600 and avoid partial writes.Apply:
- # Write the environment variables to the `.env` file. - with open(f"{self._clp_home}/.env", "w") as env_file: - for key, value in env_vars.items(): - if value is None: - continue - env_file.write(f"{key}={value}\n") + # Write the environment variables to the `.env` file with restrictive perms. + env_path = self._clp_home / ".env" + fd = os.open(str(env_path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600) + with os.fdopen(fd, "w") as env_file: + for key, value in env_vars.items(): + if value is None: + continue + env_file.write(f"{key}={value}\n")
♻️ Duplicate comments (1)
.github/workflows/clp-uv-checks.yaml (1)
5-18: YAML alias syntax is not supported in GitHub Actionspathstriggers (reiterating prior feedback).Line 18 uses a YAML alias reference (
paths: *monitored_paths), which GitHub Actions workflow syntax does not support. Thepathssection must be a direct sequence node, not an alias. This will cause the workflow to fail at parse time.Replace the alias on line 18 with the explicit list of paths:
push: - paths: *monitored_paths + paths: + - ".github/workflows/clp-uv-checks.yaml" + - "components/clp-mcp-server/pyproject.toml" + - "components/clp-mcp-server/uv.lock" + - "components/clp-package-utils/pyproject.toml" + - "components/clp-package-utils/uv.lock" + - "components/clp-py-utils/pyproject.toml" + - "components/clp-py-utils/uv.lock" + - "components/job-orchestration/pyproject.toml" + - "components/job-orchestration/uv.lock" + - "integration-tests/pyproject.toml" + - "integration-tests/uv.lock"You may also consider removing the anchor definition at line 5 once the push section is expanded, unless it serves a purpose elsewhere in the workflow.
🧹 Nitpick comments (3)
components/clp-package-utils/clp_package_utils/controller.py (1)
751-756: Address Ruff S603/S607 subprocess warnings or suppress intentionally.Calls are argument‑list (no shell) and preceded by dependency checks, so they’re safe. To quiet lint and document intent:
- subprocess.run( + subprocess.run( # noqa: S603 cmd, cwd=self._clp_home, check=True, ) ... - subprocess.run( - ["docker", "compose", "--project-name", self._project_name, "down"], + subprocess.run( # noqa: S603,S607 + ["docker", "compose", "--project-name", self._project_name, "down"], cwd=self._clp_home, check=True, )Alternatively, keep as‑is and ignore the lints.
Also applies to: 783-787
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
69-71: Avoid bareexcept:; don’t swallowKeyboardInterrupt/SystemExit.Use
except Exceptionto keep termination signals intact.- except: + except Exception: logger.exception("Failed to load config.") return -1 ... - except: + except Exception: logger.exception("Failed to create necessary directories.") return -1Also applies to: 80-82
20-20: Minor style nit: preferlogging.getLogger(__name__).Current logger name is a file path. Using module name improves consistency.
-logger = logging.getLogger(__file__) +logger = logging.getLogger(__name__)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
.github/workflows/clp-uv-checks.yaml(1 hunks)components/clp-package-utils/clp_package_utils/controller.py(4 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(3 hunks)docs/src/dev-docs/tooling-gh-workflows.md(1 hunks)docs/src/user-docs/guides-using-presto.md(0 hunks)docs/src/user-docs/quick-start/clp-json.md(1 hunks)docs/src/user-docs/quick-start/clp-text.md(1 hunks)taskfiles/deps/main.yaml(2 hunks)tools/deployment/presto-clp/docker-compose.yaml(2 hunks)tools/deployment/presto-clp/scripts/init.py(5 hunks)tools/docker-images/clp-package/Dockerfile(2 hunks)
💤 Files with no reviewable changes (1)
- docs/src/user-docs/guides-using-presto.md
✅ Files skipped from review due to trivial changes (1)
- docs/src/user-docs/quick-start/clp-text.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/src/dev-docs/tooling-gh-workflows.md
🧰 Additional context used
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/controller.py (2)
components/clp-package-utils/clp_package_utils/general.py (3)
check_docker_dependencies(171-194)DockerComposeProjectNotRunningError(81-85)DockerDependencyError(56-57)components/clp-py-utils/clp_py_utils/clp_config.py (2)
get_deployment_type(777-781)DeploymentType(90-92)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/controller.py (2)
set_up_env(90-94)set_up_env(645-729)
🪛 actionlint (1.7.8)
.github/workflows/clp-uv-checks.yaml
18-18: "paths" section must be sequence node but got alias node with "" tag
(syntax-check)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/controller.py
751-751: subprocess call: check for execution of untrusted input
(S603)
783-783: subprocess call: check for execution of untrusted input
(S603)
784-784: Starting a process with a partial executable path
(S607)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (22)
taskfiles/deps/main.yaml (3)
374-375: Verify log-surgeon tarball hash and confirm dependency update rationale.The tarball URL has been updated to a newer commit (840f262), with an updated SHA256. Ensure:
- The new tarball SHA256 is correct and verified from the source repository.
- The commit change is intentional and includes desired fixes or improvements.
- All downstream consumers are compatible with this version.
409-409: Consistent build type configuration.Adding
-DCMAKE_BUILD_TYPE=Releaseto mariadb-connector-cpp aligns with the pattern used elsewhere in the file (e.g., all other CMake-based dependencies specify Release builds). This improves reproducibility and consistency.
1-641: Mismatch between PR objectives and file scope.The PR objectives describe refining "uv lock maintenance loops" using "YAML anchors" and "G_COMPONENTS_DIR", but this file contains only dependency installation tasks with version/build config updates. The uv-related changes may reside in the included
lock.yamlfile (line 4) or other files not provided for review. Ensure the PR description accurately reflects the changes in this file, or clarify the scope across multiple files..github/workflows/clp-uv-checks.yaml (2)
39-41: Comment spacing is correct.This addresses the prior minor formatting concern—the checkout action comment now has proper 2-space formatting.
24-53: Workflow structure and task integration look good.The job matrix strategy, concurrency settings, and step sequencing are well-designed. The invocation of
task deps:lock:check-uvcorrectly delegates lock validation to the centralized task defined in taskfiles, supporting the PR's objective to manage UV project directories via task configuration.components/clp-py-utils/clp_py_utils/clp_config.py (4)
35-35: LGTM - Presto coordinator component name constant.The new constant follows the same naming pattern as other component names and is used appropriately in the Presto transformation method.
594-603: Verify: Missing default values for host and port fields.The
Prestoclass lacks default values forhostandportfields, which is inconsistent with all other similar component classes (Database,QueryScheduler,Redis,Reducer,ResultsCache,Queue,WebUi,McpServer). Those classes all provide defaults:host = "localhost"andport = DEFAULT_PORT.Without defaults, users must explicitly specify both host and port when configuring Presto. Is this intentional (to force explicit configuration for an external service), or should defaults be added for consistency?
If defaults should be added, apply this diff:
class Presto(BaseModel): DEFAULT_PORT: ClassVar[int] = 8080 - host: DomainStr - port: Port + host: DomainStr = "localhost" + port: Port = DEFAULT_PORT
605-609: LGTM - Environment variable validation.The
_get_env_varfunction correctly raisesValueErrorwith a clear message when an environment variable is missing. This aligns with the documented behavior in the calling methods' docstrings.
825-826: LGTM - Conditional Presto transformation.The conditional check appropriately transforms the Presto configuration when using the PRESTO query engine. The redundant
self.presto is not Nonecheck is good defensive programming, even though the validator at lines 796-803 already enforces this constraint.tools/docker-images/clp-package/Dockerfile (2)
21-22: Environment variables properly split for library path configuration.Adding
LD_LIBRARY_PATHto help the runtime locate the MariaDB C++ connector libraries is correct. The split across two ENV instructions maintains readability and follows standard Docker practices.
31-32: The glob pattern is correct for the DEB install layout.When MariaDB Connector/C++ is built with
-DINSTALL_LAYOUT=DEB, libraries are installed underlib/${CMAKE_LIBRARY_ARCHITECTURE}(e.g.,lib/x86_64-linux-gnu/). The Dockerfile's glob patternlib/*/libmariadbcpp.so*correctly matches these architecture-specific subdirectories created by the DEB layout.The path structure is also correctly aligned: the taskfile defines
INSTALL_PREFIXasbuild/deps/cpp/mariadb-connector-cpp-install, and the Dockerfile copies from./build/deps/cpp/mariadb-connector-cpp-install/lib/*/libmariadbcpp.so*. The COPY instruction will successfully match and copy the libraries without silent failures.Likely an incorrect or invalid review comment.
tools/deployment/presto-clp/scripts/init.py (4)
120-137: LGTM: Storage-specific environment variables.The code correctly sets different environment variables (
CLP_ARCHIVES_DIRfor filesystem storage,CLP_STAGED_ARCHIVES_DIRfor S3 storage) based on the storage type. This aligns with the docker-compose.yaml volume mount configuration.
169-173: LGTM: Instance ID-based network naming.The network name format
clp-package-{instance_id}_defaultcorrectly follows Docker Compose's default network naming convention. Error handling appropriately returnsFalsewhen the instance ID cannot be determined.
285-322: LGTM: Robust helper function with comprehensive error handling.The function has:
- Proper type hints (
Optional[str])- Descriptive error messages for each failure scenario
- Appropriate exception handling (OSError covers file access issues)
- Validation to ensure the instance ID is not empty
- Explicit UTF-8 encoding
110-113: No issue found—code correctly matches CLP connector configuration expectations.The CLP connector explicitly expects the metadata database URL and database name as separate configuration properties (
clp.metadata-db-urlandclp.metadata-db-name), not combined in a single JDBC URL. The code correctly implements this design by setting them as separate environment variables on lines 110-113, which is the expected pattern for the Presto CLP connector.tools/deployment/presto-clp/docker-compose.yaml (4)
13-14: LGTM: Network configuration aligns with CLP package integration.Both coordinator and worker services now use the
clp-packageexternal network, enabling communication with CLP package services. This aligns with the network name exported byinit.py.Also applies to: 42-43
46-47: LGTM: Well-documented dummy volume.The explicit
emptyvolume with a clear comment is a good practice for optional bind mounts. This prevents Docker from creating unnecessary anonymous volumes when environment variables aren't set.
36-41: Good pattern: Fallback to empty volume for optional mounts.The use of
${VAR:-empty}provides a clean fallback when archive directories aren't configured. The read-only flag is appropriate for archive data. The hardcoded mount paths (/var/data/archivesand/var/data/staged-archives) are consistent with the same paths used across the deployment system indocker-compose.base.yaml, confirming they match the expected configuration.
52-55: Network configuration is correct — CLP package implicitly creates the external network.The configuration properly references an external network that is automatically created when the CLP package starts. The
init.pyscript reads the instance-id file (created during CLP package initialization) and constructs the network name asclp-package-{instance_id}_default. Docker Compose then uses this project identifier to create the network, which Presto-CLP joins via theexternal: truedeclaration. The configuration correctly requires CLP package to be started first.components/clp-package-utils/clp_package_utils/controller.py (1)
505-508: Good switch to container‑scoped Presto host/port.Using
container_clp_configforPrestoHost/PrestoPortaligns UI server settings with container networking.components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
40-44: Nice addition:--setup-onlyshort‑circuits after environment prep.Flow is clear: validate → mkdirs → set_up_env → optional early exit. Matches docs.
Also applies to: 87-93
docs/src/user-docs/quick-start/clp-json.md (1)
20-23: Docs align with new--setup-onlybehavior.Clear, concise tip; matches CLI.
| @staticmethod | ||
| def _get_num_workers() -> int: | ||
| """ | ||
| :return: Number of worker processes to run. | ||
| """ | ||
| # This will change when we move from single to multi-container workers. See y-scope/clp#1424 | ||
| return multiprocessing.cpu_count() // 2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Clamp worker count to at least 1.
multiprocessing.cpu_count() // 2 returns 0 on single‑core hosts, yielding zero‑concurrency workers.
- return multiprocessing.cpu_count() // 2
+ return max(1, multiprocessing.cpu_count() // 2)Run to sanity‑check for any consumers expecting non‑zero:
🏁 Script executed:
#!/bin/bash
# Find usages of worker concurrency env vars
rg -n --no-heading -C2 'CLP_(COMPRESSION|QUERY|REDUCER)_WORKER_CONCURRENCY'Length of output: 1827
I'll verify the scope of this issue by checking all usages of _get_num_workers() and examining how the concurrency parameter is consumed.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find all calls to _get_num_workers()
rg -n -C3 '_get_num_workers'
Length of output: 1405
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the context around line 648 where num_workers is used
sed -n '645,385p' components/clp-package-utils/clp_package_utils/controller.py | head -50
Length of output: 33
🏁 Script executed:
#!/bin/bash
# Also check lines around 352 and 379 to see full context of env var usage
sed -n '348,385p' components/clp-package-utils/clp_package_utils/controller.pyLength of output: 1132
---
🏁 Script executed:
```shell
#!/bin/bash
# Find worker implementation files (search for job_orchestration.executor patterns)
find . -name "*.py" -type f | xargs grep -l "job_orchestration.executor" | head -20
Length of output: 795
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for how --concurrency argument is used/parsed
rg -n "concurrency" components/job-orchestration/ --type py -C2
Length of output: 1739
🏁 Script executed:
#!/bin/bash
# Look for celery worker or argparse setup related to concurrency
rg -n "argparse\|--concurrency\|add_argument.*concurrency" components/job-orchestration/ --type pyLength of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for concurrency handling in compress worker
rg -n "concurrency" components/job-orchestration/job_orchestration/executor/compress/ --type py -C2
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for concurrency handling in query worker
rg -n "concurrency" components/job-orchestration/job_orchestration/executor/query/ --type py -C2Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Find main entry points for workers (look for __main__ or celery worker setup)
rg -n "if __name__" components/job-orchestration/job_orchestration/executor/ --type py -A5
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for celery app setup and worker configuration
rg -n "app.worker_main\|Celery\|celery_app" components/job-orchestration/job_orchestration/executor/ --type py -C2Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for test files related to concurrency or worker setup
find . -path "*/test*" -name "*.py" | xargs grep -l "concurrency\|cpu_count" 2>/dev/null | head -10
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check for any comments or issues related to single-core or zero concurrency
rg -n "single.core\|concurrency.*0\|cpu_count" . --type py | grep -i "single\|core\|zero\|issue\|todo\|fixme"Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check what multiprocessing.cpu_count() actually returns for verification
cat > /tmp/test_cpu.py << 'EOF'
import multiprocessing
print(f"CPU count: {multiprocessing.cpu_count()}")
print(f"CPU count // 2: {multiprocessing.cpu_count() // 2}")
print(f"Single-core case (1 // 2): {1 // 2}")
print(f"With max(): {max(1, 1 // 2)}")
EOF
python3 /tmp/test_cpu.pyLength of output: 73
Clamp worker count to at least 1.
On single-core systems, multiprocessing.cpu_count() // 2 returns 0, causing zero-concurrency workers. This fix is necessary and consistent with the pattern already established in the reducer component (which applies max(int(parsed_args.concurrency), 1) at line 68).
- return multiprocessing.cpu_count() // 2
+ return max(1, multiprocessing.cpu_count() // 2)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def _get_num_workers() -> int: | |
| """ | |
| :return: Number of worker processes to run. | |
| """ | |
| # This will change when we move from single to multi-container workers. See y-scope/clp#1424 | |
| return multiprocessing.cpu_count() // 2 | |
| @staticmethod | |
| def _get_num_workers() -> int: | |
| """ | |
| :return: Number of worker processes to run. | |
| """ | |
| # This will change when we move from single to multi-container workers. See y-scope/clp#1424 | |
| return max(1, multiprocessing.cpu_count() // 2) |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines 790
to 797, the worker count calculation can return 0 on single-core systems; change
the return to clamp the value to at least 1 (e.g., return
max(multiprocessing.cpu_count() // 2, 1)) so we never spawn zero workers.
| # Database endpoint inside the CLP Package Docker network. Must match the constants defined in | ||
| # `components/clp-py-utils/clp_py_utils/clp_config.py`. | ||
| DATABASE_COMPONENT_NAME = "database" | ||
| DATABASE_DEFAULT_PORT = 3306 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify constant consistency with clp_py_utils/clp_config.py.
The comment states these constants must match those defined in components/clp-py-utils/clp_py_utils/clp_config.py. Ensure both files use the same values.
🏁 Script executed:
#!/bin/bash
# Verify that DATABASE_COMPONENT_NAME and DATABASE_DEFAULT_PORT match the constants in clp_py_utils/clp_config.py
if [[ -f "components/clp-py-utils/clp_py_utils/clp_config.py" ]]; then
echo "=== Searching for DATABASE_COMPONENT_NAME and DATABASE_DEFAULT_PORT in clp_config.py ==="
rg -n -C3 'DATABASE_COMPONENT_NAME|DATABASE_DEFAULT_PORT' components/clp-py-utils/clp_py_utils/clp_config.py
else
echo "File components/clp-py-utils/clp_py_utils/clp_config.py not found"
fiLength of output: 88
🏁 Script executed:
#!/bin/bash
# Step 1: Check if clp_config.py exists and view its contents
echo "=== Checking if clp_config.py exists ==="
if [[ -f "components/clp-py-utils/clp_py_utils/clp_config.py" ]]; then
echo "File exists. Viewing contents..."
cat components/clp-py-utils/clp_py_utils/clp_config.py
else
echo "File not found at expected path. Searching for clp_config.py..."
find . -name "clp_config.py" -type f 2>/dev/null
fi
# Step 2: View init.py to confirm current constants
echo -e "\n=== Viewing init.py ==="
if [[ -f "tools/deployment/presto-clp/scripts/init.py" ]]; then
head -20 tools/deployment/presto-clp/scripts/init.py
fi
# Step 3: Search broadly for DATABASE constants in the codebase
echo -e "\n=== Searching for DATABASE constants in codebase ==="
rg -n 'DATABASE.*=' --type py | grep -E '(COMPONENT_NAME|DEFAULT_PORT|database)' | head -20Length of output: 31740
Rename constants to match clp_config.py.
Constants defined in init.py don't match their equivalents in clp_config.py. Rename:
DATABASE_COMPONENT_NAME→DB_COMPONENT_NAME(matches clp_config.py line 25)DATABASE_DEFAULT_PORT→DB_DEFAULT_PORTor referenceDatabase.DEFAULT_PORTfrom the config module (matches clp_config.py line 104)
The values are correct (3306 and "database"), but the constant names differ from those defined in the referenced file.
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/scripts/init.py around lines 10 to 13, the
constant names do not match the equivalents in
components/clp-py-utils/clp_py_utils/clp_config.py; rename
DATABASE_COMPONENT_NAME to DB_COMPONENT_NAME and DATABASE_DEFAULT_PORT to
DB_DEFAULT_PORT (or alternatively import and reference Database.DEFAULT_PORT
from the clp_config module) so the names align with clp_config.py while keeping
the same values ("database" and 3306).
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68fcd91ef21c83278d173c54269455c2
Summary by CodeRabbit
New Features
Refactor
Chores
Documentation
Tests