refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152

junhaoliao · 2025-08-03T06:09:40Z

Description

This PR standardizes how configuration and credentials are passed to containers, addressing security concerns and code complexity in the start_clp.py script.

The key changes are:
Standardize configuration and credential delivery to improve consistency and security by:

Separating credentials from general configuration.
- Instead, passing all credentials via environment variables to avoid leaking them in temporary files.
Using a single generated configuration file for all non-credential configuration data. (instead of generating config files per component)

See #1149 for more details.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Code / File inspection

Inspected that all components properly load credentials from environment variables.
Confirmed that the single generated configuration file contains all necessary non-credential configuration data

Manual tests

Tested container startup with the new configuration approach:
- cd clp-package/sbin/
- ./start-clp.sh
Tested wrapper scripts:
- ./compress.sh ~/samples/hive-24hr/
- ./decompress.sh x
- ./search.sh "1"
- ./admin-tools/archive-manager.sh find
Performed a search with query string "1" in the webui and observed results returned successfully.

Enabled garbage_collector and observed archives / search results are purged in below configurations:

results_cache.retention_period = 1 + archive_output.retention_period = 5800000 + performed a search: observed search results were cleaned in around one minute after query completed.
results_cache.retention_period = 1 + archive_output.retention_period = 1 + performed a search: observed no search result; archives were cleaned in around one minute after compression.

results_cache:
  # Retention period for search results, in minutes. Set to null to disable automatic deletion.
  retention_period: 1

## Where archives should be output to
archive_output:
  # Retention period for archives, in minutes. Set to null to disable automatic deletion.
  retention_period: 1 # 5800000 for validation with the result cache 

## Garbage collector config
garbage_collector:
  logging_level: "INFO"

  # Interval (in minutes) at which garbage collector jobs run
  sweep_interval:
    archive: 1
    search_result: 1

Summary by CodeRabbit

New Features
- Inject extra environment variables into container startup; add shared generated container config file and mount support; new helpers to construct common, credential and connection env lists.
Bug Fixes
- Load DB/queue/Redis credentials from environment when present; omit sensitive fields from serialized exports; improved error logging with tracebacks.
Documentation
- Added -c short alias and clarified --config help text.
Refactor
- Centralized container config dumping and env/credential handling; standardized generated config filename usage.

…ronment variables

coderabbitai · 2025-08-03T06:10:03Z

Warning

Rate limit exceeded

@junhaoliao has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 31342c3 and ca2db62.

📒 Files selected for processing (1)

components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)

Walkthrough

Adds shared generated-config file support and mounting, moves secret credentials to environment-variable flows with new env-name constants and load-from-env methods, extends container start to accept extra environment variables, centralizes config dumping, and updates many scripts and the CLPConfig model to use the new plumbing.

Changes

Cohort / File(s)	Change Summary
Config model & env constants `components/clp-py-utils/clp_py_utils/clp_config.py`	Added CLP_SHARED_CONFIG_FILENAME and CLP_*_ENV_VAR_NAME constants; Database/Queue/Redis gained dump/load-from-file and load-from-env methods; credential fields made optional; secrets excluded from dump_to_primitive_dict; CLPConfig.get_shared_config_file_path() and updated CLPConfig.dump_to_primitive_dict.
CLP Docker / config utilities `components/clp-package-utils/clp_package_utils/general.py`	Added `generated_config_file: Optional[DockerMount]` to `CLPDockerMounts`; generate_container_config now mounts shared config when missing; dump_container_config signature changed to accept config_filename and return host/container paths; added dump_shared_container_config and get_container_config_filename; generate_container_start_cmd accepts extra_env_vars and injects `-e KEY=VALUE`; added env helper functions (get_common_env_vars_list, get_credential_env_vars_list, get_celery_connection_env_vars_list) and formatting helpers.
Containerized script entrypoints `components/clp-package-utils/clp_package_utils/scripts/*` `.../archive_manager.py`, `compress.py`, `decompress.py`, `search.py`, `dataset_manager.py`	Use get_container_config_filename(container_name) for dumped config filename; construct extra_env_vars mapping DB creds to CLP_DB_* env-name constants; pass extra_env_vars as new 4th argument to generate_container_start_cmd; imports updated to use new constants/helpers.
Native (no-container) scripts `components/clp-package-utils/clp_package_utils/scripts/native/*` `archive_manager.py`, `compress.py`, `decompress.py`, `search.py`	After config validation, call `clp_config.database.load_credentials_from_env()` to load DB credentials from environment; native decompress now uses CLP_DB_*_ENV_VAR_NAME keys for subprocess env.
start_clp orchestration `components/clp-package-utils/clp_package_utils/scripts/start_clp.py`	Replaced per-container YAML dumps with shared generated config path (`container_clp_config.get_shared_config_file_path()`); add mounts.generated_config_file to mounts; centralized env-var construction via new helpers; call dump_shared_container_config; removed per-container YAML writes.
DB / init CLP scripts `components/clp-py-utils/clp_py_utils/*` `create-db-tables.py`, `initialize-clp-metadata-db.py`, `initialize-orchestration-db.py`	CLI: add `-c` alias for `--config`; switch to `pathlib.Path` and CLPConfig.parse_obj(...); call `clp_config.database.load_credentials_from_env()` after parsing; added explicit ValidationError/ValueError handling and adjusted imports to use CLPConfig.
Job orchestration (schedulers/executor/gc) `components/job-orchestration/...` `compression_scheduler.py`, `query_scheduler.py`, `compression_task.py`, `garbage_collector.py`	After CLPConfig parse, call `clp_config.database.load_credentials_from_env()`; unify ValidationError/ValueError handling; use logger.exception for generic failures; update executor env-var keys to CLP_DB_* constants; adjust startup logs to use component names.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Script
    participant CLPConfig
    participant OS
    participant Docker

    User->>Script: invoke CLI (start_clp / compress / search / decompress / native)
    Script->>CLPConfig: parse YAML -> CLPConfig
    Script->>OS: read env (CLP_DB_*, CLP_QUEUE_*, CLP_REDIS_*)
    OS-->>Script: env values
    Script->>CLPConfig: clp_config.database.load_credentials_from_env()
    alt shared config required
        Script->>OS: dump_shared_container_config / dump_container_config(config_filename)
        OS-->>Script: host config path
        Script->>Docker: ensure generated_config_file mount present
    end
    Script->>Script: build extra_env_vars (KEY=VALUE) + common env list
    Script->>Docker: generate_container_start_cmd(..., extra_env_vars)
    Docker-->>Script: container started with mounts and injected env

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(package): Add dataset-manager scripts to support listing datasets, and deleting them entirely. #1144 — updates scripts to use the new container config filename helper and pass extra_env_vars at call sites (directly relevant to script adjustments).
feat(clp-package): Add support for more AWS authentication methods. #743 — touches generate_container_config and mounting logic (related to generated_config_file and mount handling).
feat(clp-package): Add support for uploading extracted streams to S3. #662 — modifies generate_container_config/start logic and may conflict with shared-config mount and dump changes.

Suggested reviewers

kirkrodrigues
haiqi96

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
732-752: Missing generated_config_dir mount in worker configuration.

The function sets CLP_CONFIG_PATH environment variable pointing to the generated config file but doesn't include mounts.generated_config_dir in the necessary mounts. This could cause the worker to fail accessing the config file.
     necessary_mounts = [
         mounts.clp_home,
         mounts.data_dir,
         mounts.logs_dir,
+        mounts.generated_config_dir,
     ]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 581bd46 and 489476f.

📒 Files selected for processing (17)

components/clp-package-utils/clp_package_utils/general.py (8 hunks)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/search.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (10 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (5 hunks)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1 hunks)
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (3 hunks)
components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (3 hunks)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1 hunks)

🧰 Additional context used

🧠 Learnings (15)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

📚 Learning: in `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variabl...

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/native/search.py
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/native/search.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: in clp installation scripts, consistency across platform scripts is prioritized over defensive progr...

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...

Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: for installation scripts in the clp project, prefer explicit error handling over automatic dependenc...

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/search.py

📚 Learning: in the current clp codebase implementation, dataset validation using validate_dataset() is performed...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/search.py

📚 Learning: in the clp project, when reviewing cmakelists.txt changes that introduce new compression library dep...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: in the c++ `filedecompressor` implementations at `components/core/src/clp/clp/filedecompressor.cpp` ...

Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....

Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...

Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: lint-check (macos-15)

🔇 Additional comments (40)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

34-34: LGTM: Import addition supports configuration standardization.

The import of load_worker_config aligns with the PR's goal of standardizing configuration and credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

120-126: LGTM: Improved security by passing database credentials via environment variables.

The addition of extra_env_vars containing database credentials and updating the generate_container_start_cmd call aligns with the PR's security improvements. This approach prevents credentials from being embedded in temporary configuration files.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

209-219: LGTM: Consistent credential handling standardization.

The addition of extra_env_vars for database credentials follows the same secure pattern implemented across other scripts. This ensures consistent and secure credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

105-131: LGTM: Secure credential handling in file extraction function.

The addition of extra_env_vars containing database credentials and updating the generate_container_start_cmd call ensures secure credential delivery for file extraction operations.

211-217: LGTM: Consistent credential handling in stream extraction function.

The implementation matches the pattern used in the file extraction function, ensuring consistent and secure credential handling across both extraction methods.

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

22-22: Confirm configuration-loading updates in create-db-tables.py

I see the new -c alias and updated help text (“CLP configuration file.”) on line 22; that part looks good. However, I wasn’t able to find any use of CLPConfig.parse_obj() or load_database_credentials_from_env() in this script—only the help text change is present. If you intended to expand the configuration-loading logic here, please verify those changes have been applied or update the summary accordingly.

components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

208-211: LGTM - Credential extraction for environment variables

The extraction of database credentials into environment variables aligns with the PR's goal to standardize credential delivery to containers.

231-231: LGTM - Environment variable injection into container start command

The modification to pass extra_env_vars to generate_container_start_cmd correctly implements the standardized approach for credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2)

3-3: LGTM - Import addition for environment variable access

The os module import is necessary for accessing environment variables in the credential override block.

192-197: LGTM - Environment variable override for database credentials

The implementation correctly overrides database credentials from environment variables and provides appropriate error handling for missing variables. This aligns with the PR's standardization of credential delivery.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

4-4: LGTM - Import addition for environment variable access

The os module import is necessary for accessing environment variables in the credential override logic.

233-238: LGTM - Environment variable override for database credentials

The implementation correctly overrides database credentials from environment variables with proper error handling. This follows the established pattern across native scripts for credential standardization.

components/clp-package-utils/clp_package_utils/scripts/native/search.py (2)

7-7: LGTM - Import addition for environment variable access

The os module import enables access to environment variables for the credential override functionality.

301-306: LGTM - Environment variable override for database credentials

The implementation correctly loads database credentials from environment variables with appropriate error handling. This is consistent with the pattern used across other native scripts in this PR.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

1136-1137: LGTM - Standardized configuration loading with environment variable credentials

The change to use CLPConfig.load_from_generated_config_file() and load_database_credentials_from_env() correctly implements the new standardized approach for configuration and credential loading.

1138-1143: LGTM - Improved exception handling

The exception handling is now more specific, catching ValidationError and KeyError separately from generic exceptions, which provides better error reporting and aligns with the new configuration loading approach.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

406-407: LGTM! Improved configuration loading pattern.

The switch to CLPConfig.load_from_generated_config_file() and environment-based credential loading aligns well with the PR's goal to standardize configuration delivery and enhance security by separating credentials from config files.

408-414: Enhanced error handling with proper exception categorization.

Good improvement in exception handling - catching specific ValidationError and KeyError exceptions separately from generic exceptions, with appropriate logging for debugging purposes.

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (2)

42-42: Consistent command-line interface improvement.

Adding the -c short option alongside --config improves usability and maintains consistency with other scripts in the codebase.

48-56: Excellent standardization of configuration loading.

The shift from parsing only the Database section to parsing the full CLPConfig with environment variable credential overrides aligns perfectly with the PR's security and standardization objectives. The error handling appropriately catches validation and configuration errors.

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2)

47-56: Consistent configuration standardization pattern.

This follows the same secure pattern established in other initialization scripts - using full CLPConfig parsing with environment variable credential overrides. The error handling is comprehensive and appropriate.

59-60: Proper configuration object usage.

Using clp_config.database directly after the unified configuration loading is the correct approach and maintains consistency with the new configuration management pattern.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (2)

122-127: Secure credential override implementation.

The environment variable credential override pattern enhances security by separating sensitive data from configuration files. The error handling appropriately catches missing environment variables.

234-239: Consistent credential override pattern.

Good consistency - the same environment variable credential override pattern is applied in both extraction functions, ensuring uniform security handling throughout the script.

components/clp-py-utils/clp_py_utils/clp_config.py (6)

612-612: Well-designed generated config file path.

The generated_config_file_path field provides a centralized location for container configuration files, supporting the PR's goal of standardizing configuration delivery.

624-626: Proper path resolution for generated config.

The generated config file path is correctly made absolute, ensuring consistent path handling across different execution contexts.

765-771: Secure environment variable credential loading.

The database credential loading method properly validates that required environment variables are present and provides clear error messages for missing credentials.

772-778: Consistent queue credential handling.

The queue credential loading follows the same secure pattern as the database credentials, maintaining consistency across different service credentials.

779-784: Complete credential environment variable support.

The Redis credential loading completes the environment variable support for all major service credentials, enabling full separation of sensitive data from configuration files.

799-808: Enhanced security in configuration export.

Excellent security improvement - removing sensitive credential fields and the generated config file path from exported configuration prevents accidental exposure of sensitive information.

components/clp-package-utils/clp_package_utils/general.py (5)

100-100: LGTM! Consistent with existing mount attributes.

The addition of generated_config_dir follows the established pattern for optional Docker mounts in the class.

286-297: LGTM! Mount handling follows established patterns.

The logic correctly checks if the path is already mounted before creating a new mount, consistent with how other paths are handled.

403-407: LGTM! Node-specific path handling is consistent.

The modification correctly makes the generated config file path node-specific by including the hostname, following the same pattern as data and logs directories.

623-676: LGTM! Well-structured centralized environment variable generation.

The function effectively centralizes environment variable generation with clear flag-based controls. The implementation correctly formats credentials and connection strings.

Note: Ensure that Docker logs and process listings are properly secured since credentials are passed via environment variables.

333-336: No compatibility issue with dict[str, str]
The project’s Poetry configuration specifies python = ">= 3.9" (in clp-package-utils, as well as other modules), so built-in generics like dict[str, str] are fully supported. No changes needed.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (5)

244-252: LGTM! Proper credential handling for database table creation.

The function correctly uses the centralized environment variable generation with database credentials and includes all necessary mounts.

602-610: LGTM! Scheduler configuration properly centralized.

The function correctly generates environment variables with Celery connection parameters and includes component-specific variables for logging.

998-1012: LGTM! Reducer configuration includes all necessary credentials.

The function correctly generates environment variables with database, queue, and Redis credentials, and properly includes the generated config mount.

925-938: LGTM! WebUI configuration properly handles database credentials.

The function correctly generates environment variables with database credentials and web-specific variables. The absence of generated_config_dir mount appears intentional as WebUI uses settings.json files rather than the generated config.

1146-1148: LGTM! Config generation properly centralized.

The config file is correctly generated once before starting any components, ensuring all containers use the same configuration file as intended by the PR objectives.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 489476f and 0e3f412.

📒 Files selected for processing (3)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (10 hunks)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...

Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....

Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

🧬 Code Graph Analysis (2)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

load_database_credentials_from_env (765-770)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

load_database_credentials_from_env (765-770)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (10)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

1137-1137: LGTM! Credential loading from environment variables implemented correctly.

The addition of clp_config.load_database_credentials_from_env() aligns with the PR objective to pass credentials via environment variables instead of configuration files.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

407-407: LGTM! Credential loading from environment variables implemented correctly.

The addition of clp_config.load_database_credentials_from_env() aligns with the PR objective to pass credentials via environment variables instead of configuration files.

412-412: Good improvement! Enhanced error logging with stack traces.

Changing from logger.error(ex) to logger.exception(ex) provides valuable stack trace information for debugging generic exceptions.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7)

50-51: LGTM! Import additions align with configuration standardization.

The new imports for dump_container_config and generate_common_environment_variables support the centralized configuration management approach.

244-261: Well-implemented credential separation for database table creation.

The changes correctly:

Pass database credentials via environment variables using the standardized utility

Include the generated config directory in mounts

Reference the centralized configuration file

This improves security by avoiding credential exposure through temporary files.

602-629: Excellent standardization of scheduler configuration handling.

The refactoring correctly:

Centralizes environment variable generation with appropriate credential flags

Maintains all necessary environment variables (logs directory, logging level)

Adds the generated config directory to mounts

Uses the pre-generated configuration file path

This aligns perfectly with the PR's goal of standardizing configuration delivery.

733-748: Verify if the generated config directory mount is needed for workers.

The environment variable generation is correctly implemented. However, I notice that unlike other components, the mounts.generated_config_dir is not added to necessary_mounts.

Since the config path is passed via CLP_CONFIG_PATH environment variable, please confirm whether workers need direct access to the generated config directory.

926-935: Appropriate credential handling for WebUI component.

The changes correctly pass database credentials via environment variables while maintaining the WebUI-specific configuration approach through settings.json files.

999-1022: Comprehensive credential separation for reducer component.

Excellent implementation that:

Passes all required credentials (database, queue, redis) via environment variables

Includes the generated config directory in mounts

Uses the centralized configuration file

This provides a secure and standardized approach for the reducer component.

1147-1149: Perfect placement of centralized configuration generation.

The dump_container_config call is correctly positioned after all validations and directory creation, ensuring a single configuration file is generated before any components start. This is the cornerstone of the configuration standardization effort.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

haiqi96

Best effort review for now

components/clp-package-utils/clp_package_utils/general.py

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py

components/clp-py-utils/clp_py_utils/clp_config.py

…ntainer_environment_variables

… of empty dictionary

…ile'

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (3)

components/clp-py-utils/clp_py_utils/clp_config.py (2)
611-611: Consider using a getter method instead of a member variable.

Based on past feedback, having generated_config_file_path as a member variable allows users to potentially overwrite the file path, which may not be desirable. Consider implementing this as a getter method instead:
def get_generated_config_file_path(self) -> pathlib.Path:
    return self.logs_directory / ".clp-config.yml"
This approach would ensure the path is always computed relative to the current logs_directory and prevents unwanted modifications.

765-783: Consider creating a reusable helper method for environment variable loading.

The three methods have similar structure with repeated try-catch blocks for environment variable loading. As suggested in past feedback, consider extracting this into a reusable helper method:
def _get_env_var(self, name: str) -> str:
    try:
        return os.environ[name]
    except KeyError as ex:
        raise ValueError(f"Missing environment variable: {ex}")
Then refactor the methods:
def load_database_credentials_from_env(self):
    self.database.username = self._get_env_var("CLP_DB_USER")
    self.database.password = self._get_env_var("CLP_DB_PASS")
This reduces code duplication and improves maintainability.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

121-127: Extract duplicated credential override logic to utils.

The logic for overriding database credentials from environment variables is duplicated between handle_extract_stream_cmd and handle_extract_file_cmd. As previously discussed, this should be extracted into a helper function in native/utils.py.

Also applies to: 233-239

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05283d5 and 00ceeb8.

📒 Files selected for processing (8)

components/clp-package-utils/clp_package_utils/general.py (9 hunks)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (11 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (5 hunks)

🧰 Additional context used

🧠 Learnings (13)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

📚 Learning: 2024-11-15T16:21:52.122Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2025-07-07T17:41:15.655Z

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-07-01T14:51:19.172Z

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2024-12-05T16:32:21.507Z

Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2025-07-03T20:10:43.789Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: 2025-07-03T12:58:18.407Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: 2025-07-28T08:33:57.487Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

📚 Learning: 2024-10-10T05:50:15.348Z

Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/SchemaTreeNode.hpp:0-0
Timestamp: 2024-10-10T05:50:15.348Z
Learning: In this codebase, assertions are not used; the project policy is to avoid adding assertions in the code.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-07-25T21:29:48.947Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1126
File: .gitignore:5-5
Timestamp: 2025-07-25T21:29:48.947Z
Learning: In the CLP project, the .clang-format file is maintained in the yscope-dev-utils submodule and copied over to the main CLP repository, so it should be ignored in .gitignore to prevent accidental commits of the copied file and maintain the single source of truth in the submodule.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

🧬 Code Graph Analysis (6)

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (330-366)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (330-366)

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (330-366)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (330-366)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

components/clp-py-utils/clp_py_utils/core.py (1)

make_config_path_absolute (42-53)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (589-809)

🔇 Additional comments (7)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

624-626: LGTM!

The generated config file path is correctly made absolute using the existing make_config_path_absolute utility function, which is consistent with how other paths are handled in this method.

798-808: LGTM!

The sensitive credential removal is correctly implemented. Excluding generated_config_file_path from the exported configuration and removing all credential fields (username, password for database/queue, and username for redis) ensures no sensitive information leaks through configuration exports.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

121-127: LGTM!

The database credentials are correctly passed as environment variables to the container. This implementation aligns with the PR objective of separating credentials from configuration files and follows the same pattern used in other scripts (archive_manager.py, compress.py, decompress.py).

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

214-220: LGTM!

The database credentials are correctly passed as environment variables to the container start command. This follows the same consistent pattern implemented across other scripts and aligns with the PR objective of externalizing sensitive credentials from configuration files.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

126-132: LGTM!

The database credentials are correctly passed as environment variables in the file extraction command. The implementation follows the same pattern used consistently across other scripts.

211-218: LGTM!

The database credentials are correctly passed as environment variables in the stream extraction command. This maintains consistency with the file extraction function and other scripts in the codebase.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

225-231: LGTM!

The database credentials are correctly passed as environment variables to the container start command. This implementation is consistent with the pattern applied across all other scripts (search.py, archive_manager.py, decompress.py) and successfully achieves the PR objective of externalizing sensitive credentials from configuration files.

components/clp-package-utils/clp_package_utils/general.py

coderabbitai · 2025-08-07T23:37:11Z

components/clp-package-utils/clp_package_utils/general.py

+    if not is_path_already_mounted(
+        clp_home,
+        CONTAINER_CLP_HOME,
+        clp_config.generated_config_file_path,
+        container_clp_config.generated_config_file_path,
+    ):
+        docker_mounts.generated_config_dir = DockerMount(
+            DockerMountType.BIND,
+            clp_config.generated_config_file_path,
+            container_clp_config.generated_config_file_path,
+        )
+


⚠️ Potential issue

Mount the parent directory of the config file, not the file itself.

The code attempts to mount generated_config_file_path which is a file path, but Docker requires mounting directories. You should mount the parent directory instead.

Apply this diff to fix the mounting issue:

if not is_path_already_mounted( clp_home, CONTAINER_CLP_HOME, - clp_config.generated_config_file_path, - container_clp_config.generated_config_file_path, + clp_config.generated_config_file_path.parent, + container_clp_config.generated_config_file_path.parent, ): docker_mounts.generated_config_dir = DockerMount( DockerMountType.BIND, - clp_config.generated_config_file_path, - container_clp_config.generated_config_file_path, + clp_config.generated_config_file_path.parent, + container_clp_config.generated_config_file_path.parent, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not is_path_already_mounted(

clp_home,

CONTAINER_CLP_HOME,

clp_config.generated_config_file_path,

container_clp_config.generated_config_file_path,

):

docker_mounts.generated_config_dir = DockerMount(

DockerMountType.BIND,

clp_config.generated_config_file_path,

container_clp_config.generated_config_file_path,

)

if not is_path_already_mounted(

clp_home,

CONTAINER_CLP_HOME,

clp_config.generated_config_file_path.parent,

container_clp_config.generated_config_file_path.parent,

):

docker_mounts.generated_config_dir = DockerMount(

DockerMountType.BIND,

clp_config.generated_config_file_path.parent,

container_clp_config.generated_config_file_path.parent,

)

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py around lines 275 to 286, the code mounts the config file path directly, but Docker requires mounting directories. Modify the mount source and target paths to use the parent directory of the config file paths instead of the file paths themselves, ensuring the mount points are directories.

We are using --mount which seems to work for file.

@haiqi96 You're right—Docker bind mounts support files with --mount as long as the source exists. No change is required to switch to the parent directory.

Two small hardening tweaks you might consider:

Make the config bind read-only to avoid accidental writes:

pass is_read_only=True when constructing the DockerMount for the generated config file.

Optionally skip the file-level mount if the logs directory is already mounted (the file will be available via that directory), to avoid redundant mounts.

Otherwise, keeping the file mount is fine given dump_container_config runs before container start.

🧠 Learnings used

Learnt from: junhaoliao PR: y-scope/clp#1152 File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613 Timestamp: 2025-08-08T06:59:42.409Z Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Learnt from: Bill-hbrhbr PR: y-scope/clp#1122 File: components/core/src/clp/clp/CMakeLists.txt:175-195 Timestamp: 2025-07-23T09:54:45.185Z Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

coderabbitai · 2025-08-07T23:37:11Z

components/clp-package-utils/clp_package_utils/general.py

+def generate_common_environment_variables(
+    include_clp_home=True,
+) -> List[str]:


💡 Verification agent

🧩 Analysis chain

Function signature mismatch with usage in start_clp.py.

The function generate_common_environment_variables doesn't accept container_clp_config as a parameter, but it's being called with this argument at line 1002 in start_clp.py. This will cause a TypeError at runtime.

🏁 Script executed:

#!/bin/bash # Verify all calls to generate_common_environment_variables rg -A 2 "generate_common_environment_variables\(" --type py

Length of output: 2317

We should confirm the exact locations in start_clp.py where the incorrect calls occur so we can update them precisely. Let’s list those lines with context:

🏁 Script executed:

#!/bin/bash rg -n -C3 "generate_common_environment_variables" components/clp-package-utils/clp_package_utils/scripts/start_clp.py

Length of output: 1838

Fix incorrect calls to generate_common_environment_variables in start_clp.py.

The function signature in components/clp-package-utils/clp_package_utils/general.py is

def generate_common_environment_variables( include_clp_home: bool = True, ) -> List[str]: ...

It does not accept a container_clp_config argument, yet in start_clp.py it’s being called like this:

• start_clp.py:247
*generate_common_environment_variables(container_clp_config),

• start_clp.py:608
*generate_common_environment_variables(container_clp_config),

• start_clp.py:929
*generate_common_environment_variables(container_clp_config),

• start_clp.py:1002
*generate_common_environment_variables(container_clp_config, include_clp_home=True),

To resolve the TypeError at runtime, update these invocations so they only pass the include_clp_home flag (or rely on its default), for example:

- *generate_common_environment_variables(container_clp_config), + *generate_common_environment_variables(), - *generate_common_environment_variables(container_clp_config, include_clp_home=True), + *generate_common_environment_variables(include_clp_home=True),

Please apply these changes at the locations above.

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py at lines 622 to 624, the function generate_common_environment_variables only accepts the include_clp_home boolean argument. In start_clp.py at lines 247, 608, 929, and 1002, calls to this function incorrectly pass container_clp_config as an argument. To fix the TypeError, update these calls to remove container_clp_config and only pass include_clp_home if needed, or rely on its default value.

Can you check the code again? i don't think we are calling generate_common_environment_variables with container_clp_config

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e echo "== Definition ==" rg -n -C1 $'^def generate_common_environment_variables\\(' components echo "" echo "== All call sites ==" rg -n -C3 "generate_common_environment_variables\\(" --type py echo "" echo "== start_clp.py call sites ==" fd start_clp.py components | xargs -I{} rg -n -C3 "generate_common_environment_variables\\(" {}

Length of output: 6350

@coderabbit-user Thanks, @haiqi96 — I rechecked the code with a repo-wide search. All calls to generate_common_environment_variables are correct; none passes container_clp_config.

Verified call sites in start_clp.py:

Line 246: *generate_common_environment_variables()

Line 607: *generate_common_environment_variables()

Line 738: *generate_common_environment_variables(include_clp_home=True)

Line 930: *generate_common_environment_variables()

Line 1005: *generate_common_environment_variables(include_clp_home=True)

Conclusion: No TypeError risk here. The earlier note was incorrect; no changes needed.

🧬 Code Graph Analysis Results

Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 313-346):

def generate_common_environment_variables( include_clp_home=True, ) -> List[str]: """ Generate a list of common environment variables for Docker containers. :param include_clp_home: :return: List of environment variable strings in the format "KEY=VALUE". """ clp_site_packages_dir = CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages" env_vars = [f"PYTHONPATH={clp_site_packages_dir}"] if include_clp_home: env_vars.append(f"CLP_HOME={CONTAINER_CLP_HOME}") return env_vars

This function generate_common_environment_variables does not take any container_clp_config parameter. It only optionally includes the CLP_HOME environment variable and always includes PYTHONPATH.

Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 348-374):

def generate_credential_environment_variables( container_clp_config: CLPConfig, include_db_credentials=False, include_queue_credentials=False, include_redis_credentials=False, ) -> List[str]: """ Generate a list of credential environment variables for Docker containers. :param container_clp_config: :param include_db_credentials: :param include_queue_credentials: :param include_redis_credentials: :return: List of environment variable strings in the format "KEY=VALUE". """ env_vars = [] if include_db_credentials: env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") if include_queue_credentials: env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") if include_redis_credentials: env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}") return env_vars

This function generate_credential_environment_variables explicitly takes container_clp_config as a parameter and uses it to generate environment variables related to credentials.

Summary of generate_container_config function (lines 180-236):

This function takes a clp_config and clp_home and returns a tuple (container_clp_config, docker_mounts) where:

container_clp_config is a deep copy of clp_config with paths adjusted for container usage.

docker_mounts is an instance of CLPDockerMounts containing DockerMount objects for various directories.

This function prepares the container configuration but does not call generate_common_environment_variables.

Summary of generate_container_start_cmd function (lines 270-300):

This function generates the Docker run command line as a list of strings. It accepts:

container_name

container_mounts (list of DockerMount or None)

container_image

extra_env_vars (optional dictionary of environment variables)

It sets some environment variables like PYTHONPATH and CLP_HOME internally, but it does not call generate_common_environment_variables with a container_clp_config argument.

Summary of generate_common_environment_variables usage:

The function generate_common_environment_variables does not accept a container_clp_config argument.

It only optionally includes CLP_HOME and always includes PYTHONPATH.

Credential environment variables are generated separately by generate_credential_environment_variables which does accept container_clp_config.

Summary of CLPConfig class (from clp_py_utils/clp_config.py):

CLPConfig is a configuration class with attributes like database, queue, redis, etc.

It has methods to load credentials from files or environment variables.

It is used as the type for container_clp_config in the environment variable generation functions.

These snippets and summaries provide context on the function signatures and usage of generate_common_environment_variables and related functions in the file general.py. They clarify that generate_common_environment_variables is not called with container_clp_config anywhere in this file.

🧠 Learnings used

Learnt from: haiqi96 PR: y-scope/clp#594 File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65 Timestamp: 2024-11-18T16:49:20.248Z Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: junhaoliao PR: y-scope/clp#1152 File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613 Timestamp: 2025-08-08T06:59:42.409Z Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Learnt from: Bill-hbrhbr PR: y-scope/clp#1122 File: components/core/src/clp/clp/CMakeLists.txt:175-195 Timestamp: 2025-07-23T09:54:45.185Z Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

… path

…cutor.utils

…unctions

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

components/clp-package-utils/clp_package_utils/general.py (6)

99-99: Nit: align typing style (use Optional instead of typing.Optional)

You already import Optional; use it here for consistency.

Apply this diff:

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

290-301: Make generated config bind mount read-only

The generated config should not be modified by containers. Mark the mount as read-only.

Apply this diff:

-        docker_mounts.generated_config_file = DockerMount(
-            DockerMountType.BIND,
-            clp_config.get_generated_config_file_path(),
-            container_clp_config.get_generated_config_file_path(),
-        )
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_generated_config_file_path(),
+            container_clp_config.get_generated_config_file_path(),
+            True,
+        )

391-397: Security: avoid leaking secrets via CLI env “-e KEY=VALUE”

Passing credentials via “-e KEY=VALUE” exposes them in process lists, shell history, and build logs. Prefer emitting only keys in the Docker CLI and pass the actual values via subprocess.run(..., env=...) so values don’t appear on the command line.

Two-step hardening (minimal surface change here, update call sites accordingly):

Build docker cmd with “-e KEY” only.
At container start, merge os.environ with the env map and pass it via subprocess’s env parameter.

Apply this diff for the builder:

-    env_vars = {
-        "PYTHONPATH": clp_site_packages_dir,
-        **(extra_env_vars if extra_env_vars is not None else {}),
-    }
-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    env_vars = {
+        "PYTHONPATH": str(clp_site_packages_dir),
+        **(extra_env_vars if extra_env_vars is not None else {}),
+    }
+    for key in env_vars.keys():
+        container_start_cmd.extend(["-e", key])

Follow-up: I can generate the required call-site changes (merging env dicts and passing env=...) across scripts if you’d like.

681-690: Defensive checks: avoid emitting None-valued credentials into env

If credentials aren’t loaded, this can produce “KEY=None” entries. Raise early with a clear message.

Apply this diff:

-    if include_db_credentials:
-        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
-        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
+    if include_db_credentials:
+        if (
+            container_clp_config.database.username is None
+            or container_clp_config.database.password is None
+        ):
+            raise ValueError("Database credentials not loaded.")
+        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
+        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
@@
-    if include_queue_credentials:
-        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
-        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
+    if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
+        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
+        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
@@
-    if include_redis_credentials:
-        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
+    if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
+        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

362-377: Nit: clarify docstring about extra_env_vars and reserved keys

Document that PYTHONPATH is pre-set and that callers should avoid overriding container-managed keys unless intentionally doing so.

Apply this diff:

-    :param extra_env_vars: Environment variables to set on top of the predefined ones.
+    :param extra_env_vars: Environment variables to set on top of the predefined ones.
+                           Note: PYTHONPATH is set by this function; avoid overriding unless
+                           intentional. Prefer not to pass secrets here if command-line visibility
+                           is a concern; pass them via subprocess env instead.

701-709: URL-encode credentials in Celery URLs; handle optional Redis auth

Raw credentials can break URLs if they contain reserved characters. Also omit the “:password@” segment when Redis has no password.

Apply this diff:

-    env_vars = [
-        f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
-        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
-        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
-        f"{container_clp_config.redis.query_backend_database}",
-    ]
+    from urllib.parse import quote
+    q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+    q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+    broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+
+    r_pass = container_clp_config.redis.password
+    redis_db = container_clp_config.redis.query_backend_database
+    if r_pass:
+        backend = (
+            "redis://default:"
+            f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+        )
+    else:
+        backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+
+    env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4124b60 and 3fe8368.

📒 Files selected for processing (1)

components/clp-package-utils/clp_package_utils/general.py (7 hunks)

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-06-02T18:22:24.060Z

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

🧬 Code Graph Analysis (1)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (14)

get_generated_config_file_path (925-926)

dump_to_primitive_dict (249-250)

dump_to_primitive_dict (361-362)

dump_to_primitive_dict (456-457)

dump_to_primitive_dict (552-553)

dump_to_primitive_dict (569-572)

dump_to_primitive_dict (601-604)

dump_to_primitive_dict (702-705)

dump_to_primitive_dict (724-727)

dump_to_primitive_dict (934-950)

dump_to_primitive_dict (962-970)

load_credentials_from_file (252-262)

load_credentials_from_file (364-373)

load_credentials_from_file (459-469)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (5)

components/clp-package-utils/clp_package_utils/general.py (5)

12-12: LGTM: typing imports expanded appropriately

Importing Dict, List, Optional, Tuple is fine and used below.

17-18: LGTM: CLP_GENERATED_CONFIG_FILENAME import

Matches new shared-config workflow.

343-347: LGTM: thin wrapper for per-container config dump

Clear and reuses shared dump logic.

349-360: LGTM: shared generated-config dumper

Docstring and behaviour align with the new “single generated config file” approach.

475-490: LGTM: switch to component-level load_credentials_from_file

Consistent with refactor centralizing credential loading on the sub-configs.

coderabbitai · 2025-08-19T13:47:53Z

components/clp-package-utils/clp_package_utils/general.py

+def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):
    """
-    Writes the given config to the logs directory so that it's accessible in the container.
+    Writes the given container config to the logs directory, so that it's accessible in the
+    container.
+
    :param container_clp_config: The config to write.
    :param clp_config: The corresponding config on the host (used to determine the logs directory).
-    :param container_name:
+    :param config_filename:
    :return: The path to the config file in the container and on the host.
    """
-    container_config_filename = f".{container_name}-config.yml"
-    config_file_path_on_host = clp_config.logs_directory / container_config_filename
-    config_file_path_on_container = container_clp_config.logs_directory / container_config_filename
+    config_file_path_on_host = clp_config.logs_directory / config_filename
+    config_file_path_on_container = container_clp_config.logs_directory / config_filename
    with open(config_file_path_on_host, "w") as f:
        yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

    return config_file_path_on_container, config_file_path_on_host


🛠️ Refactor suggestion

Ensure directory exists before writing config; add return type

Prevent failures when the node-specific logs directory hasn’t been created yet. Also, the function returns a tuple—add its annotation.

Apply this diff:

-def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str): +def dump_config( + container_clp_config: CLPConfig, + clp_config: CLPConfig, + config_filename: str, +) -> Tuple[pathlib.Path, pathlib.Path]: @@ - with open(config_file_path_on_host, "w") as f: + config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True) + with open(config_file_path_on_host, "w") as f: yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):

"""

Writes the given config to the logs directory so that it's accessible in the container.

Writes the given container config to the logs directory, so that it's accessible in the

container.

:param container_clp_config: The config to write.

:param clp_config: The corresponding config on the host (used to determine the logs directory).

:param container_name:

:param config_filename:

:return: The path to the config file in the container and on the host.

"""

container_config_filename = f".{container_name}-config.yml"

config_file_path_on_host = clp_config.logs_directory / container_config_filename

config_file_path_on_container = container_clp_config.logs_directory / container_config_filename

config_file_path_on_host = clp_config.logs_directory / config_filename

config_file_path_on_container = container_clp_config.logs_directory / config_filename

with open(config_file_path_on_host, "w") as f:

yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

return config_file_path_on_container, config_file_path_on_host

def dump_config(

container_clp_config: CLPConfig,

clp_config: CLPConfig,

config_filename: str,

) -> Tuple[pathlib.Path, pathlib.Path]:

"""

Writes the given container config to the logs directory, so that it's accessible in the

container.

:param container_clp_config: The config to write.

:param clp_config: The corresponding config on the host (used to determine the logs directory).

:param config_filename:

:return: The path to the config file in the container and on the host.

"""

config_file_path_on_host = clp_config.logs_directory / config_filename

config_file_path_on_container = container_clp_config.logs_directory / config_filename

config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)

with open(config_file_path_on_host, "w") as f:

yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

return config_file_path_on_container, config_file_path_on_host

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py around lines 325 to 340, ensure the target logs directory exists before opening the file and add an explicit return type annotation for the function; to fix, call mkdir(parents=True, exist_ok=True) on clp_config.logs_directory (or the parent directory of config_file_path_on_host) before opening the file to write, and update the function signature to include the return type -> Tuple[Path, Path] (and import Tuple and Path from typing/pathlib if not already imported).

if the parent path doesn't exist, it's probably fine to let it fail to show any implementation error during development

@junhaoliao That's a fair point about letting it fail fast to catch implementation errors during development. The explicit failure when the directory doesn't exist will indeed help identify setup issues early rather than masking them with automatic directory creation.

The return type annotation suggestion is still valid though, as it's separate from the directory creation behavior and improves type safety:

def dump_config( container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str, ) -> Tuple[pathlib.Path, pathlib.Path]:

✏️ Learnings added

Learnt from: junhaoliao PR: y-scope/clp#1152 File: components/clp-package-utils/clp_package_utils/general.py:0-0 Timestamp: 2025-08-19T14:41:28.856Z Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

components/clp-package-utils/clp_package_utils/general.py

…`dump_shared_container_config`

…ping functions

junhaoliao · 2025-08-19T14:41:26Z

@kirkrodrigues i just revalidated. it's ready for review again

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
123-151: Bug: DB credentials are not passed to the container

Unlike other scripts, this wrapper does not pass CLP_DB_USER/CLP_DB_PASS to the container, so native.dataset_manager inside the container likely cannot connect to the DB.

Apply this change locally (minimal fix consistent with current general.generate_container_start_cmd API):
@@
     container_name = generate_container_name("dataset-manager")
 
     container_clp_config, mounts = generate_container_config(clp_config, clp_home)
     generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
-        container_clp_config, clp_config, f".{container_name}-config.yml"
+        container_clp_config, clp_config, f".{container_name}-config.yml"
     )
@@
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container
-    )
+    # Pass DB credentials to the container
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
+    )
Outside this range, add the imports at the top:
from clp_py_utils.clp_config import CLP_DB_USER_ENV_VAR_NAME, CLP_DB_PASS_ENV_VAR_NAME
Optional hardening (recommended across the repo): avoid leaking secrets via Docker argv by emitting only “-e KEY” and injecting values via subprocess env; see related comments on archive_manager/compress/decompress and the generate_container_start_cmd refactor.

♻️ Duplicate comments (20)

components/clp-py-utils/clp_py_utils/clp_config.py (5)
252-263: Deprecate file-based DB credential loader to align with env-only policy

This PR’s objective is to pass credentials exclusively via environment variables for containers. Keep file-based loading only for non-container/legacy workflows and add a clear deprecation note to guide callers.

Apply:
     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)
264-270: Fail fast on empty DB creds loaded from env

_get_env_var() guards missing vars, but empty strings still pass. Validate non-empty here to surface misconfig early.

Apply:
     def load_credentials_from_env(self):
         """
         :raise ValueError: if any expected environment variable is not set.
         """
-        self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
-        self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)
+        self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)
+        if not self.username:
+            raise ValueError(f"{CLP_DB_USER_ENV_VAR_NAME} cannot be empty")
+        if not self.password:
+            raise ValueError(f"{CLP_DB_PASS_ENV_VAR_NAME} cannot be empty")
364-374: Deprecate file-based Redis credential loader (containers should use env)

Mark this as deprecated for parity with DB/Queue and to steer callers away from file-based secrets.

Apply:
     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)
459-470: Deprecate file-based Queue credential loader

Same rationale as DB/Redis — containers should use env-vars.

Apply:
     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)
471-477: Fail fast on empty Queue creds loaded from env

Prevent silent propagation of blank credentials.

Apply:
     def load_credentials_from_env(self):
         """
         :raise ValueError: if any expected environment variable is not set.
         """
-        self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
-        self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)
+        self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)
+        if not self.username:
+            raise ValueError(f"{CLP_QUEUE_USER_ENV_VAR_NAME} cannot be empty")
+        if not self.password:
+            raise ValueError(f"{CLP_QUEUE_PASS_ENV_VAR_NAME} cannot be empty")
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
225-231: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)

Passing secrets as “-e KEY=VALUE” exposes them via ps and error logs. Emit only names on the CLI and inject values via subprocess env.

Apply within this file after updating general.generate_container_start_cmd to accept names only:
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd: List[str] = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    # Provide only names to the Docker CLI; inject values via subprocess env.
+    container_start_cmd: List[str] = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )
Then execute with injected env (outside this range):
import os  # ensure imported at top
# ...
proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars})
Additionally, avoid logging full command lines that include env values.
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
227-233: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)

Same issue: secrets are exposed via argv. Switch to passing only names on the CLI and values via subprocess env.

Apply here once general.generate_container_start_cmd is refactored to take env var names:
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )
Outside this range:
import os  # at top if missing
# ...
subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
128-134: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in file extraction path

Do not embed secret values in argv; pass only names on the CLI and inject values via subprocess env.

Apply after refactoring general.generate_container_start_cmd:
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )
Outside this range:
import os  # at top if missing
# ...
subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
201-210: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in stream extraction path

Same fix as above for the second path in this script.

Apply:
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )
Outside this range:
# Ensure subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
components/clp-package-utils/clp_package_utils/scripts/search.py (1)

120-126: Security: Avoid exposing DB credentials via Docker CLI

Database credentials are currently exposed in the Docker command line through -e KEY=VALUE, making them visible via ps and in logs. This is the same security issue flagged in previous reviews.

The extra_env_vars dictionary contains sensitive database credentials that will be embedded in the Docker command. This needs to be fixed by either:

Passing only environment variable names to the Docker CLI (e.g., -e KEY) and providing values via subprocess environment

Using Docker secrets or config files for credential management

See the previous review comment for the detailed fix.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

610-616: Scheduler logging level misconfiguration persists

The compression scheduler still incorrectly uses the query scheduler's logging level. This known issue is already tracked.

As previously identified and tracked in issue #1220, Line 615 sets CLP_LOGGING_LEVEL from clp_config.query_scheduler.logging_level for both schedulers.

739-763: Redis database parameter ignored in generic_start_worker

The redis_database parameter passed to generic_start_worker isn't used when calling get_celery_connection_env_vars_list, causing workers to always use the query backend database.

Line 741 calls get_celery_connection_env_vars_list(container_clp_config) without passing the redis_database parameter, so the RESULT_BACKEND will always point to the query backend database instead of the worker-specific one.

932-960: Security: WebUI exposes DB and AWS credentials on Docker CLI

Database credentials and AWS credentials are embedded in the Docker command line as KEY=VALUE pairs.

Lines 934 and 947-948 add credentials directly to the command line. These should be passed via subprocess environment variables instead.

1074-1093: Security: Garbage collector leaks credentials on Docker CLI

Database and AWS credentials are exposed in the Docker command line.

Line 1076 adds database credentials to the Docker command via get_credential_env_vars_list, exposing them in process listings. The same issue exists with AWS credentials on line 1091.

251-261: Security: DB credentials leak via Docker CLI in create_db_tables

Database credentials are being embedded directly in the Docker command line, exposing them via process listings and debug logs.

The env_vars list contains CLP_DB_USER=<value> and CLP_DB_PASS=<value> which will be visible in process listings. This needs to be fixed as noted in previous reviews.

610-623: Security: Credentials exposed on Docker CLI for schedulers

Database and AWS credentials are passed as -e KEY=VALUE on the Docker command line, making them visible via process listings.

Both database credentials (via get_credential_env_vars_list) and AWS credentials (when applicable) are exposed in the Docker command. This needs the same fix as other components.
components/clp-package-utils/clp_package_utils/general.py (4)
385-391: Security: Container start command embeds secrets in Docker CLI

The generate_container_start_cmd function adds environment variables as -e KEY=VALUE to the Docker command, exposing any secrets passed via extra_env_vars.

Lines 389-390 embed the full KEY=VALUE pairs in the Docker command line. To avoid leaking secrets:

Emit only -e KEY (without values) in the Docker command

Pass actual values via the env parameter of subprocess.run()

This requires updating all callers to provide environment variables through subprocess rather than the command line.

689-704: Celery connection always uses query backend database

The function hardcodes container_clp_config.redis.query_backend_database for the RESULT_BACKEND, preventing proper database isolation between compression and query operations.

Line 701 always uses query_backend_database. The function needs to accept a redis_database parameter to allow callers to specify the correct database for each scheduler/worker type.

675-677: Add defensive checks for database credentials

The function should verify credentials are loaded before using them to avoid propagating None values.

Add validation before using credentials:
 if include_db_credentials:
+    container_clp_config.database.ensure_credentials_loaded()
     env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
     env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
335-338: Ensure parent directory exists before writing config file

The function writes to clp_config.logs_directory / config_filename without ensuring the directory exists.

Create the parent directory before writing:
 config_file_path_on_host = clp_config.logs_directory / config_filename
 config_file_path_on_container = container_clp_config.logs_directory / config_filename
+config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
 with open(config_file_path_on_host, "w") as f:
     yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3fe8368 and ae8f581.

📒 Files selected for processing (8)

components/clp-package-utils/clp_package_utils/general.py (7 hunks)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/compress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/search.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-08-13T14:48:49.020Z

Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-06-02T18:22:24.060Z

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

🧬 Code Graph Analysis (7)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

components/clp-package-utils/clp_package_utils/general.py (2)

DockerMount (66-85)

generate_container_start_cmd (356-397)

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (356-397)

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (356-397)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (4)

get_shared_config_file_path (925-926)

load_credentials_from_file (252-262)

load_credentials_from_file (364-373)

load_credentials_from_file (459-469)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

generate_container_start_cmd (356-397)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

components/clp-py-utils/clp_py_utils/core.py (2)

read_yaml_config_file (56-62)

get_config_value (25-39)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

components/clp-package-utils/clp_package_utils/general.py (4)

dump_shared_container_config (343-353)

get_celery_connection_env_vars_list (689-704)

get_credential_env_vars_list (659-686)

get_common_env_vars_list (642-656)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

get_shared_config_file_path (925-926)

StorageType (107-109)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

generate_container_auth_options (101-177)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (13)

components/clp-py-utils/clp_py_utils/clp_config.py (9)

84-84: Good: shared generated-config filename is centralized

Defining CLP_SHARED_CONFIG_FILENAME avoids string drift across call sites.

88-94: Good: env-var names hoisted to constants

Centralizing these reduces typos and eases reuse across scripts and services.

249-251: Good: succinct secret stripping in DB dump

Using Pydantic’s exclude keeps credentials out of the serialized config.

361-363: Good: Redis secret excluded from dump

Keeps password out of generated/shared config.

375-380: Clarify Redis password policy vs. implementation

Model marks password Optional[str] = None, but loader requires CLP_REDIS_PASS to be set. Confirm intent:

If Redis auth is mandatory in container flows, keep as-is and update docstrings elsewhere accordingly.

If password is truly optional, consider allowing unset env without raising and rely on None to mean “no-auth”.

Would you like me to patch load_credentials_from_env to treat a missing CLP_REDIS_PASS as unauthenticated Redis?

453-458: Good: Queue secrets excluded from dump

Avoids accidental leakage into generated config.

780-785: Good: simplified _get_env_var with os.getenv

This is concise and produces cleaner error messages than KeyError.

925-926: Good: shared-config path uses the declared constant

Avoids hardcoding ".clp-config.yml" in multiple places.

935-946: Good: maintainable top-level dump via excluded keys + loop

Reduces duplication and keeps serialization cohesive for nested configs.

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

127-128: LGTM: dot-prefixed generated config filename

Matches the new convention for hidden, on-host generated config files.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

217-218: LGTM: dot-prefixed generated config filename

Consistent with the new hidden-on-host convention for generated configs.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

207-208: LGTM: dot-prefixed generated config filename

Consistent with the standardized approach for generated configs.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

104-105: LGTM: dot-prefixed generated config filename

Matches the new hidden filename convention and keeps host filesystem tidy.

coderabbitai · 2025-08-19T14:42:31Z

components/clp-package-utils/clp_package_utils/general.py

+    env_vars = [
+        f"BROKER_URL=amqp://"
+        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
+        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
+        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
+        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
+        f"{container_clp_config.redis.query_backend_database}",


🛠️ Refactor suggestion

URL-encode credentials in Celery connection strings

Credentials containing special characters (e.g., @, :, /) will break the connection URLs.

Apply URL encoding to credentials before embedding them in connection strings:

+from urllib.parse import quote def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: """ :param container_clp_config: :return: A list of Celery connection environment variables for Docker containers, in the format "KEY=VALUE". """ + queue_user = quote(container_clp_config.queue.username, safe='') + queue_pass = quote(container_clp_config.queue.password, safe='') + redis_pass = quote(container_clp_config.redis.password, safe='') env_vars = [ f"BROKER_URL=amqp://" - f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" + f"{queue_user}:{queue_pass}@" f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", - f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" + f"RESULT_BACKEND=redis://default:{redis_pass}@" f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" f"{container_clp_config.redis.query_backend_database}", ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

env_vars = [

f"BROKER_URL=amqp://"

f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"

f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",

f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"

f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"

f"{container_clp_config.redis.query_backend_database}",

from urllib.parse import quote

def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:

"""

:param container_clp_config:

:return: A list of Celery connection environment variables for Docker containers, in the format

"KEY=VALUE".

"""

queue_user = quote(container_clp_config.queue.username, safe='')

queue_pass = quote(container_clp_config.queue.password, safe='')

redis_pass = quote(container_clp_config.redis.password, safe='')

env_vars = [

f"BROKER_URL=amqp://"

f"{queue_user}:{queue_pass}@"

f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",

f"RESULT_BACKEND=redis://default:{redis_pass}@"

f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"

f"{container_clp_config.redis.query_backend_database}",

]

return env_vars

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py around lines 695 to 701, the Celery BROKER_URL and RESULT_BACKEND strings embed raw usernames and passwords which will break when credentials contain special characters; URL-encode the queue username and password and the redis password (using urllib.parse.quote_plus or quote) before interpolating them into the connection URLs so special characters are percent-encoded, and update the env_vars construction to use the encoded credential variables.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
739-746: Workers: wire redis_database through to Celery env

generic_start_worker accepts redis_database but it isn’t used when building env vars; RESULT_BACKEND will default to the query DB. Pass it through:
-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=redis_database,
+        ),
         f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}",
         f"CLP_LOGS_DIR={container_logs_dir}",
         f"CLP_LOGGING_LEVEL={worker_config.logging_level}",
         f"CLP_WORKER_LOG_PATH={container_worker_log_path}",
     ]
components/clp-package-utils/clp_package_utils/general.py (1)
358-399: Optional hardening: avoid embedding env values in generated docker cmd

generate_container_start_cmd currently emits -e KEY=VALUE for all extras, which can leak secrets on the CLI. Consider emitting keys only and requiring callers to pass values via subprocess.run(env=...). If you adopt this, update docstrings accordingly and adjust call sites.

Example change:
-    if extra_env_vars is not None:
-        for key, value in extra_env_vars.items():
-            container_start_cmd.extend(["-e", f"{key}={value}"])
+    if extra_env_vars is not None:
+        for key in extra_env_vars.keys():
+            container_start_cmd.extend(["-e", key])

♻️ Duplicate comments (5)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

615-615: Compression scheduler logging level still taken from query scheduler (tracked)

CLP_LOGGING_LEVEL uses clp_config.query_scheduler.logging_level for both schedulers. This is the known misconfiguration tracked in your issue; leaving as-is if you prefer to fix in a follow-up.

If you want to patch now, replace with:
-        f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}",
+        f"CLP_LOGGING_LEVEL={(
+            clp_config.compression_scheduler.logging_level
+            if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+            else clp_config.query_scheduler.logging_level
+        )}",
610-616: RESULT_BACKEND should use the scheduler-specific Redis DB

The Celery result backend currently always targets the query DB. Compression scheduler should use its own DB to avoid mixing result sets.

Update the call to parameterise the Redis DB (paired with a small helper change in general.py):
-        *get_celery_connection_env_vars_list(container_clp_config),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=(
+                container_clp_config.redis.compression_backend_database
+                if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+                else container_clp_config.redis.query_backend_database
+            ),
+        ),

components/clp-package-utils/clp_package_utils/general.py (3)

99-99: Nit: prefer Optional over typing.Optional

You already import Optional; align to Optional[DockerMount] for consistency.

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

325-336: Add return type to dump_container_config

The function returns a tuple of paths; annotate for clarity and type-safety.

-def dump_container_config(
-    container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str
-):
+def dump_container_config(
+    container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str
+) -> Tuple[pathlib.Path, pathlib.Path]:

661-689: Guard against missing credentials in env builder

Avoid emitting strings with “None” if a loader wasn’t called; add defensive checks.

-    if include_db_credentials:
-        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
-        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
+    if include_db_credentials:
+        if (
+            container_clp_config.database.username is None
+            or container_clp_config.database.password is None
+        ):
+            raise ValueError("Database credentials not loaded.")
+        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
+        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
@@
-    if include_queue_credentials:
-        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
-        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
+    if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
+        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
+        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
@@
-    if include_redis_credentials:
-        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
+    if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
+        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ae8f581 and f063bd9.

📒 Files selected for processing (2)

components/clp-package-utils/clp_package_utils/general.py (7 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)

🧰 Additional context used

🧠 Learnings (7)

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-06-02T18:22:24.060Z

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-19T14:41:28.856Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

🧬 Code Graph Analysis (2)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

components/clp-package-utils/clp_package_utils/general.py (6)

dump_shared_container_config (345-355)

generate_container_config (229-310)

get_celery_connection_env_vars_list (691-706)

get_clp_home (109-125)

get_common_env_vars_list (644-658)

get_credential_env_vars_list (661-688)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

get_shared_config_file_path (925-926)

StorageType (107-109)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

generate_container_auth_options (101-177)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (5)

get_shared_config_file_path (925-926)

CLPConfig (787-956)

load_credentials_from_file (252-262)

load_credentials_from_file (364-373)

load_credentials_from_file (459-469)

🔇 Additional comments (5)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)

52-58: Imports cleanup aligns with the new helpers

Good move to import and adopt the shared config dump and env-var helpers; this improves consistency and reduces duplication.

269-269: Use of shared config file path is correct

Switching to the unified shared config path is consistent with the PR objective and reduces per-component YAML churn.

1006-1014: Reducer: shared config mount and path usage look right

Including mounts.generated_config_file and using get_shared_config_file_path is correct and aligns with the unified-config flow.

Also applies to: 1019-1026

1246-1248: Centralised shared-config dump

Dumping the shared config once and mounting it everywhere removes duplication and the risk of per-component config drift. Good change.

components/clp-package-utils/clp_package_utils/general.py (1)

644-658: LGTM: concise and reusable common env builder

Good split; the include_clp_home_env_var flag makes intent explicit and keeps call sites tidy.

coderabbitai · 2025-08-19T14:51:34Z

components/clp-package-utils/clp_package_utils/general.py

+    if not is_path_already_mounted(
+        clp_home,
+        CONTAINER_CLP_HOME,
+        clp_config.get_shared_config_file_path(),
+        container_clp_config.get_shared_config_file_path(),
+    ):
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_shared_config_file_path(),
+            container_clp_config.get_shared_config_file_path(),
+        )
+


🧹 Nitpick (assertive)

Mark the shared config file mount read-only

The shared config should not be mutated in-container. Make the bind mount ro.

- docker_mounts.generated_config_file = DockerMount( - DockerMountType.BIND, - clp_config.get_shared_config_file_path(), - container_clp_config.get_shared_config_file_path(), - ) + docker_mounts.generated_config_file = DockerMount( + DockerMountType.BIND, + clp_config.get_shared_config_file_path(), + container_clp_config.get_shared_config_file_path(), + True, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not is_path_already_mounted(

clp_home,

CONTAINER_CLP_HOME,

clp_config.get_shared_config_file_path(),

container_clp_config.get_shared_config_file_path(),

):

docker_mounts.generated_config_file = DockerMount(

DockerMountType.BIND,

clp_config.get_shared_config_file_path(),

container_clp_config.get_shared_config_file_path(),

)

if not is_path_already_mounted(

clp_home,

CONTAINER_CLP_HOME,

clp_config.get_shared_config_file_path(),

container_clp_config.get_shared_config_file_path(),

):

docker_mounts.generated_config_file = DockerMount(

DockerMountType.BIND,

clp_config.get_shared_config_file_path(),

container_clp_config.get_shared_config_file_path(),

True,

)

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py around lines 290 to 301, the shared config file is currently bind-mounted into the container as writable; change the DockerMount used for docker_mounts.generated_config_file to be read-only so the in-container process cannot mutate it. Modify the DockerMount construction to include the read-only option (for example by adding mode='ro' or readonly=True or the equivalent parameter your DockerMount class expects) when creating the bind mount, ensuring the source and target paths remain the same.

coderabbitai · 2025-08-19T14:51:34Z

components/clp-package-utils/clp_package_utils/general.py

+def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
+    """
+    :param container_clp_config:
+    :return: A list of Celery connection environment variables for Docker containers, in the format
+    "KEY=VALUE".
+    """
+    env_vars = [
+        f"BROKER_URL=amqp://"
+        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
+        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
+        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
+        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
+        f"{container_clp_config.redis.query_backend_database}",
+    ]
+
+    return env_vars


🛠️ Refactor suggestion

Celery env: parameterise Redis DB and URL-encode credentials

This helper hardcodes the query DB and embeds raw credentials. Parameterise the DB index and encode user/pass to avoid URL parsing issues.

-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: +def get_celery_connection_env_vars_list( + container_clp_config: CLPConfig, + redis_database: Optional[int] = None, +) -> List[str]: """ :param container_clp_config: :return: A list of Celery connection environment variables for Docker containers, in the format "KEY=VALUE". """ - env_vars = [ - f"BROKER_URL=amqp://" - f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" - f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", - f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" - f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" - f"{container_clp_config.redis.query_backend_database}", - ] + from urllib.parse import quote + q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="") + q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="") + r_pass = container_clp_config.redis.password + redis_db = ( + container_clp_config.redis.query_backend_database + if redis_database is None + else redis_database + ) + broker = ( + f"amqp://{q_user}:{q_pass}@" + f"{container_clp_config.queue.host}:{container_clp_config.queue.port}" + ) + if r_pass: + backend = ( + "redis://default:" + f"{quote(r_pass, safe='')}@" + f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" + ) + else: + backend = ( + f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" + ) + env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"] return env_vars

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:

"""

:param container_clp_config:

:return: A list of Celery connection environment variables for Docker containers, in the format

"KEY=VALUE".

"""

env_vars = [

f"BROKER_URL=amqp://"

f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"

f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",

f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"

f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"

f"{container_clp_config.redis.query_backend_database}",

]

return env_vars

def get_celery_connection_env_vars_list(

container_clp_config: CLPConfig,

redis_database: Optional[int] = None,

) -> List[str]:

"""

:param container_clp_config:

:return: A list of Celery connection environment variables for Docker containers, in the format

"KEY=VALUE".

"""

from urllib.parse import quote

q_user = "" if container_clp_config.queue.username is None else quote(

container_clp_config.queue.username, safe=""

)

q_pass = "" if container_clp_config.queue.password is None else quote(

container_clp_config.queue.password, safe=""

)

r_pass = container_clp_config.redis.password

redis_db = (

container_clp_config.redis.query_backend_database

if redis_database is None

else redis_database

)

broker = (

f"amqp://{q_user}:{q_pass}@"

f"{container_clp_config.queue.host}:{container_clp_config.queue.port}"

)

if r_pass:

backend = (

"redis://default:"

f"{quote(r_pass, safe='')}@"

f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"

)

else:

backend = (

f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"

)

env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]

return env_vars

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/general.py around lines 691 to 706, the Celery env helper currently hardcodes the Redis DB index and injects raw credentials into URLs; update it to parameterise the Redis DB index (use the existing config field, e.g., redis.query_backend_database or rename to redis.db if needed) and URL-encode both broker and redis username/password to avoid URL parsing issues. Import and use urllib.parse.quote_plus to encode container_clp_config.queue.username/container_clp_config.queue.password and container_clp_config.redis.username/container_clp_config.redis.password, then build BROKER_URL and RESULT_BACKEND using the encoded credentials and the config DB index instead of a hardcoded value; return the env_vars list as before.

coderabbitai · 2025-08-19T14:51:35Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

+    necessary_mounts = [mounts.clp_home, mounts.logs_dir, mounts.generated_config_file]
    aws_mount, aws_env_vars = generate_container_auth_options(clp_config, component_name)


🧹 Nitpick (assertive)

Scheduler: add-on — consider not embedding queue/redis secrets on CLI

As a follow-up hardening, avoid BROKER_URL/RESULT_BACKEND values on the CLI; pass only keys via -e and inject values via subprocess.run(env=...). This mirrors the pattern in create_db_tables/webui/garbage_collector suggestions.

Also applies to: 622-623, 628-636, 638-639

coderabbitai · 2025-08-19T14:51:35Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

+        mounts.generated_config_file,
+    ]
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
+        f"CLP_LOGS_DIR={container_logs_dir}",
+        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
    ]


⚠️ Potential issue

Garbage collector: remove secrets from CLI; inject via subprocess env

Same leakage pattern as WebUI. Emit keys-only on the CLI and pass actual secret values via env.

Apply:

- necessary_mounts = [ + necessary_mounts = [ mounts.clp_home, mounts.logs_dir, mounts.generated_config_file, ] - env_vars = [ - *get_common_env_vars_list(include_clp_home_env_var=True), - *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), - f"CLP_LOGS_DIR={container_logs_dir}", - f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", - ] + env_vars = [ + *get_common_env_vars_list(include_clp_home_env_var=True), + # Keys only; values injected via env below + "CLP_DB_USER", + "CLP_DB_PASS", + f"CLP_LOGS_DIR={container_logs_dir}", + f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", + ] @@ - if aws_env_vars: - env_vars.extend(aws_env_vars) + if aws_env_vars: + # Convert KEY=VALUE entries into keys-only now; stash values for env injection + for kv in aws_env_vars: + env_vars.append(kv.split("=", 1)[0]) @@ - cmd = container_start_cmd + garbage_collector_cmd - subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True) + cmd = container_start_cmd + garbage_collector_cmd + secrets_env = { + "CLP_DB_USER": container_clp_config.database.username, + "CLP_DB_PASS": container_clp_config.database.password, + } + if aws_env_vars: + for kv in aws_env_vars: + k, v = kv.split("=", 1) + secrets_env[k] = v + subprocess.run( + cmd, + stdout=subprocess.DEVNULL, + check=True, + env={**os.environ, **secrets_env}, + )

Also applies to: 1087-1094, 1097-1104

kirkrodrigues

A few minor nits.

For the PR title, how about:

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149).

components/clp-py-utils/clp_py_utils/clp_config.py

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

Co-authored-by: kirkrodrigues <[email protected]>

…tion into a separate function

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
739-746: Workers: ensure RESULT_BACKEND uses the intended Redis DB

generic_start_worker receives redis_database but get_celery_connection_env_vars_list ignores it, so workers will point to the query DB. Pass the parameter:
-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=redis_database,
+        ),
         f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}",
         f"CLP_LOGS_DIR={container_logs_dir}",
         f"CLP_LOGGING_LEVEL={worker_config.logging_level}",
         f"CLP_WORKER_LOG_PATH={container_worker_log_path}",
     ]
@@
-    append_docker_options(container_start_cmd, necessary_mounts, env_vars)
+    append_docker_options(container_start_cmd, necessary_mounts, env_vars)
(Helper signature change in general.py required; see my comment there.)

Also applies to: 763-763
components/clp-package-utils/clp_package_utils/general.py (1)
362-397: Avoid embedding KEY=VALUE in docker argv in generate_container_start_cmd

The function emits -e KEY=VALUE which can expose secrets via ps/logs. Prefer -e KEY only and require callers to pass values via subprocess.run(env=...). If acceptable, I can provide a full patch; for now:
-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    for key in env_vars.keys():
+        container_start_cmd.extend(["-e", key])
Callers should merge env dicts into subprocess.run(env=...).
components/clp-py-utils/clp_py_utils/clp_config.py (1)
796-796: Syntax error in type annotation: remove ‘()’ from Reducer

Typing the field as Reducer() = Reducer() is invalid. It must be the class type on the left-hand side.
-    reducer: Reducer() = Reducer()
+    reducer: Reducer = Reducer()

♻️ Duplicate comments (10)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

226-232: Critical: DB credentials are exposed on the Docker CLI (-e KEY=VALUE). Pass only names to Docker and inject values via subprocess env.

Current generate_container_start_cmd emits "-e KEY=VALUE", which leaks secrets via process listings and logs. Provide only env var names on the docker CLI and pass the values via subprocess.run(..., env=...). This also prevents accidental leakage in debug logs.

Apply within this hunk (after updating general.generate_container_start_cmd accordingly):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd: List[str] = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd: List[str] = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk, inject values at execution and avoid logging the full command with secrets:

# Add at top-level imports
import os  # outside this hunk

# Replace the subprocess invocation and debug logging
proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars})
ret_code = proc.returncode
if 0 != ret_code:
    logger.error("Archive manager failed.")
    # Avoid printing the full command string to prevent accidental leakage.
    logger.debug("Docker command failed (redacted).")

Companion change required in components/clp-package-utils/clp_package_utils/general.py: update generate_container_start_cmd to accept a list of env-var names and emit only "-e KEY" (no "=VALUE"). For example:

-def generate_container_start_cmd(
-    container_name: str,
-    container_mounts: List[Optional[DockerMount]],
-    container_image: str,
-    extra_env_vars: Optional[Dict[str, str]] = None,
-) -> List[str]:
+def generate_container_start_cmd(
+    container_name: str,
+    container_mounts: List[Optional[DockerMount]],
+    container_image: str,
+    env_var_names: Optional[List[str]] = None,
+) -> List[str]:
@@
-    env_vars = {
-        "PYTHONPATH": clp_site_packages_dir,
-        **(extra_env_vars if extra_env_vars is not None else {}),
-    }
-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    # Provide only names; values must be injected via subprocess.run(env=...).
+    container_start_cmd.extend(["-e", f"PYTHONPATH={clp_site_packages_dir}"])
+    for key in (env_var_names or []):
+        container_start_cmd.extend(["-e", key])

This pattern should be applied consistently across all scripts that pass secrets.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

129-135: Critical: DB credentials leak on Docker CLI; pass only env names and inject values via subprocess env. Also ensure cleanup on failure.

Security: Avoid "-e KEY=VALUE" on the CLI. Provide only names; inject values via subprocess.run(env=...).
Robustness: The generated config file is not removed on failure because unlink() is after a return in the except path—use a finally.

Apply within this hunk (after updating general.generate_container_start_cmd to accept names):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk: inject env and guarantee cleanup:

import os  # at top if missing

cmd = container_start_cmd + extract_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
except subprocess.CalledProcessError:
    logger.exception("Docker or file extraction command failed.")
    return -1
finally:
    # Remove generated files even on failure
    generated_config_path_on_host.unlink(missing_ok=True)

Optionally, factor out DB env assembly to avoid duplication:

def _build_db_env(clp_config: CLPConfig) -> dict[str, str]:
    return {
        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
    }
# ...
extra_env_vars = _build_db_env(clp_config)

202-211: Mirror the secret-handling and cleanup fixes in the stream extraction path

Apply the same secure env handling and robust cleanup:

-    generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
-        container_clp_config, clp_config, get_container_config_filename(container_name)
-    )
+    generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
+        container_clp_config, clp_config, get_container_config_filename(container_name)
+    )
@@
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk:

import os  # if missing

cmd = container_start_cmd + extract_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
except subprocess.CalledProcessError:
    logger.exception("Docker or stream extraction command failed.")
    return -1
finally:
    generated_config_path_on_host.unlink(missing_ok=True)

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

121-127: Critical: Don’t expose DB creds via Docker CLI; inject via subprocess env and add failure-safe cleanup

Pass only env var names to Docker; inject values via env in subprocess.run.
Ensure generated config file is removed even if search fails (currently not removed on CalledProcessError).

Apply within this hunk (after general.generate_container_start_cmd is updated to accept names):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk:

import os  # if missing

cmd = container_start_cmd + search_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
finally:
    generated_config_path_on_host.unlink(missing_ok=True)

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

228-234: Critical: Secrets leak via “-e KEY=VALUE”. Pass only names to Docker and inject values via subprocess env; ensure cleanup on failure.

Security: Avoid including DB creds on the docker argv.
Reliability: If the docker run fails, the generated config file isn’t removed.

Apply within this hunk (after updating general.generate_container_start_cmd):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk, inject env and protect cleanup:

import os  # at top if missing

cmd = container_start_cmd + compress_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
finally:
    generated_config_path_on_host.unlink(missing_ok=True)

Optional DRY: factor out a small helper to build DB env vars (as suggested previously) and reuse across scripts.

def _build_db_env(clp_config: CLPConfig) -> dict[str, str]:
    return {
        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
    }

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

610-616: Fix scheduler logging level and use the correct Redis DB for RESULT_BACKEND

Two issues:

CLP_LOGGING_LEVEL always uses the query scheduler’s logging level.
get_celery_connection_env_vars_list always uses the query backend DB; compression should use its own DB.

Apply:

     env_vars = [
         *get_common_env_vars_list(),
         *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=(
+                container_clp_config.redis.compression_backend_database
+                if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+                else container_clp_config.redis.query_backend_database
+            ),
+        ),
         f"CLP_LOGS_DIR={container_logs_dir}",
-        f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}",
+        f"CLP_LOGGING_LEVEL={(
+            clp_config.compression_scheduler.logging_level
+            if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+            else clp_config.query_scheduler.logging_level
+        )}",
     ]

Note: This requires updating the helper signature in general.py (see my comment there).

components/clp-package-utils/clp_package_utils/general.py (4)

99-99: Typing style: use Optional instead of typing.Optional

You already import Optional. Keep annotations consistent.

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

665-693: Defensively ensure credentials are loaded before emitting env vars

Prevent “None” values from being emitted when loaders weren’t called; fail fast for required creds.

     if include_db_credentials:
+        container_clp_config.database.ensure_credentials_loaded()
         env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
         env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")

     if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
         env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
         env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")

     if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
         env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

290-301: Mount the shared config file read-only

The shared config should not be mutated in-container. Mark the bind mount readonly.

-        docker_mounts.generated_config_file = DockerMount(
-            DockerMountType.BIND,
-            clp_config.get_shared_config_file_path(),
-            container_clp_config.get_shared_config_file_path(),
-        )
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_shared_config_file_path(),
+            container_clp_config.get_shared_config_file_path(),
+            True,
+        )

695-710: Celery env: parameterize Redis DB and URL-encode credentials

Hardcoding the query DB breaks compression components; also embed raw credentials which can break URL parsing. Add a redis_database parameter and encode credentials.

-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
+def get_celery_connection_env_vars_list(
+    container_clp_config: CLPConfig,
+    redis_database: Optional[int] = None,
+) -> List[str]:
@@
-    env_vars = [
-        f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
-        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
-        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
-        f"{container_clp_config.redis.query_backend_database}",
-    ]
+    from urllib.parse import quote
+    q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+    q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+    r_pass = container_clp_config.redis.password
+    db_index = (
+        container_clp_config.redis.query_backend_database
+        if redis_database is None
+        else redis_database
+    )
+    broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+    if r_pass:
+        backend = (
+            "redis://default:"
+            f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}"
+        )
+    else:
+        backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}"
+    env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
 
     return env_vars

Update call sites in start_clp.py (schedulers/workers) accordingly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f063bd9 and 31342c3.

📒 Files selected for processing (8)

components/clp-package-utils/clp_package_utils/general.py (7 hunks)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (2 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (5 hunks)
components/clp-package-utils/clp_package_utils/scripts/search.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)

🧰 Additional context used

🧠 Learnings (9)

📚 Learning: 2025-08-13T14:48:49.020Z

Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py

📚 Learning: 2024-11-15T16:22:33.635Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-06-02T18:22:24.060Z

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-19T14:41:28.856Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

🧬 Code Graph Analysis (8)

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

components/clp-package-utils/clp_package_utils/general.py (1)

get_container_config_filename (325-326)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

components/clp-package-utils/clp_package_utils/general.py (2)

get_container_config_filename (325-326)

generate_container_start_cmd (362-403)

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

components/clp-package-utils/clp_package_utils/general.py (2)

get_container_config_filename (325-326)

generate_container_start_cmd (362-403)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

components/clp-package-utils/clp_package_utils/general.py (3)

get_container_config_filename (325-326)

DockerMount (66-85)

generate_container_start_cmd (362-403)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

components/clp-py-utils/clp_py_utils/core.py (2)

read_yaml_config_file (56-62)

get_config_value (25-39)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (14)

get_shared_config_file_path (925-926)

dump_to_primitive_dict (249-250)

dump_to_primitive_dict (361-362)

dump_to_primitive_dict (456-457)

dump_to_primitive_dict (552-553)

dump_to_primitive_dict (569-572)

dump_to_primitive_dict (601-604)

dump_to_primitive_dict (702-705)

dump_to_primitive_dict (724-727)

dump_to_primitive_dict (934-956)

dump_to_primitive_dict (968-976)

load_credentials_from_file (252-262)

load_credentials_from_file (364-373)

load_credentials_from_file (459-469)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

components/clp-package-utils/clp_package_utils/general.py (6)

dump_shared_container_config (349-359)

generate_container_config (229-310)

get_celery_connection_env_vars_list (695-710)

get_clp_home (109-125)

get_common_env_vars_list (648-662)

get_credential_env_vars_list (665-692)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

get_shared_config_file_path (925-926)

StorageType (107-109)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

generate_container_auth_options (101-177)

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

components/clp-package-utils/clp_package_utils/general.py (2)

get_container_config_filename (325-326)

generate_container_start_cmd (362-403)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (22)

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

22-22: Switch to get_container_config_filename + updated dump signature: LGTM

Importing and using get_container_config_filename(container_name) to drive dump_container_config is consistent with the unified-config approach and avoids per-component filename logic spread across call sites.

Also aligns with our previous learning that dataset validation should be done at the wrapper level; this file maintains that pattern.

Also applies to: 128-129

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

218-219: Config filename resolution via get_container_config_filename: LGTM

Dumping config with an explicit filename derived from the container name makes the flow clearer and standardizes generated file naming across scripts.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

105-106: Config filename via get_container_config_filename: LGTM

Consistent with the new unified config flow.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

118-119: Using get_container_config_filename when dumping config: LGTM

Matches the standardization across scripts.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

208-209: Config filename standardization via get_container_config_filename: LGTM

Good move towards a single pattern for container config file creation.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

52-58: Imports LGTM; new helpers align with refactor goals

The added imports for dump_shared_container_config, get_celery_connection_env_vars_list, get_common_env_vars_list, and get_credential_env_vars_list match the new standardized env/config flow.

269-269: LGTM: use shared config path

Switching to the unified shared config file is correct and aligns with the PR objective.

617-617: Shared config mount added — good catch

Adding mounts.generated_config_file here prevents reliance on the logs dir mount and avoids drift.

631-636: LGTM: pass unified config to the scheduler

Using the shared config file for schedulers is consistent with the new approach.

1006-1014: LGTM: reducer picks up shared config and mounts it

Using CLP_CONFIG with the shared config file and adding the generated_config_file mount is consistent and correct.

Also applies to: 1019-1025

1246-1247: LGTM: shared config dump on startup

Dumping the shared container config once at startup ensures consumers can read non-credential settings consistently.

components/clp-package-utils/clp_package_utils/general.py (4)

325-327: LGTM: centralised container-config filename generator

Having a single helper prevents filename drift across consumers.

329-346: Docstring and return contract are clear

dump_container_config’s contract (container path and host path) is clear and matches usage. Good split from dump_shared_container_config.

471-490: File-based credential loaders invoked intentionally — confirm intent

These helpers load credentials from the credentials file. Given the PR’s scope is to avoid writing creds to temporary per-container files (not to remove the credentials file entirely), this usage is fine. If the long-term plan is env-only source of truth, consider deprecating these and switching to load_credentials_from_env().

648-663: LGTM: common env list

Returning only PYTHONPATH and optionally CLP_HOME keeps the surface minimal and re-usable.

components/clp-py-utils/clp_py_utils/clp_config.py (7)

85-86: LGTM: single source of truth for shared config filename

Using CLP_SHARED_CONFIG_FILENAME avoids filename drift and is referenced by start scripts and general.py.

88-94: LGTM: env-var name constants introduced

Centralising credential env-var names reduces typos and improves consistency across components.

347-354: Redis password optional is sensible

Allowing None for password matches Redis’ optional auth. Good alignment with downstream URL builder once encoded/omitted accordingly.

453-458: Queue: secret scrubbing looks correct

dump_to_primitive_dict excludes username/password as intended.

Also applies to: 456-458

780-785: _get_env_var: concise and clear

Helper is minimal and produces clean error messages. Good.

925-927: LGTM: shared config path uses the declared filename constant

Correct and consistent with the rest of the stack.

934-956: Top-level dump: refactor to avoid duplicating nested serialization

Your refactor to build d excluding nested keys then appending each nested dump is cleaner and maintainable.

coderabbitai · 2025-08-19T16:17:45Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

+    env_vars = [
+        *get_common_env_vars_list(),
+        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
+    ]
+    necessary_mounts = [
+        mounts.clp_home,
+        mounts.data_dir,
+        mounts.logs_dir,
+        mounts.generated_config_file,
+    ]


⚠️ Potential issue

Stop leaking DB credentials in create_db_tables; inject via subprocess env

Env vars currently include CLP_DB_USER/CLP_DB_PASS as KEY=VALUE and you log the full command. This leaks credentials via ps and debug logs. Emit key names only on the CLI and pass actual values via subprocess.run(env=...).

Apply:

@@ - env_vars = [ - *get_common_env_vars_list(), - *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), - ] + from clp_py_utils.clp_config import ( + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + ) + env_vars = [ + *get_common_env_vars_list(), + # Keys only; values injected via subprocess env below + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + ] @@ - cmd = container_start_cmd + create_tables_cmd - logger.debug(" ".join(cmd)) - subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True) + cmd = container_start_cmd + create_tables_cmd + logger.debug(" ".join(cmd)) + secrets_env = { + CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password, + } + subprocess.run( + cmd, + stdout=subprocess.DEVNULL, + check=True, + env={**os.environ, **secrets_env}, + )

Also applies to: 274-276

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines 251-260 (and also apply same change to 274-276), the current env_vars list contains secret values (CLP_DB_USER/CLP_DB_PASS) which are being concatenated into the logged CLI string and can leak via ps/debug logs; instead build two things: (1) a display-only list of env KEY names (e.g., "CLP_DB_USER", "CLP_DB_PASS", plus other keys) that you can safely join and log, and (2) an actual env dict mapping keys to their secret values (pulled from get_credential_env_vars_list or container_clp_config) which you pass into subprocess.run(..., env=env_dict); update any subprocess.run calls that currently embed KEY=VALUE in the command to pass only the command arguments and the env parameter, and ensure all logs print only the key names, not the values.

coderabbitai · 2025-08-19T16:17:46Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

    if aws_env_vars:
-        necessary_env_vars.extend(aws_env_vars)
+        env_vars.extend(aws_env_vars)
    if (


🛠️ Refactor suggestion

Avoid embedding AWS secrets on the CLI; prefer keys-only + env injection

aws_env_vars are KEY=VALUE pairs and will leak via ps/logs. Emit only keys and pass values via subprocess.run(env=...). If you want, I can provide a concrete patch after you confirm keys-only mode here.

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines 621-623, the code currently extends env_vars with aws_env_vars (KEY=VALUE pairs) which embeds secret values into CLI args/logs; instead, parse aws_env_vars into key/value pairs, add only the keys (not "KEY=VALUE") to the command argument list (or flag list) and pass the actual values via the subprocess environment parameter (e.g., merge into the env dict passed to subprocess.run); ensure you handle entries without '=' safely and do not log the values anywhere.

coderabbitai · 2025-08-19T16:17:46Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

+    env_vars = [
+        *get_common_env_vars_list(),
+        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
        f"NODE_PATH={node_path}",
        f"HOST={clp_config.webui.host}",
        f"PORT={clp_config.webui.port}",
-        f"CLP_DB_USER={clp_config.database.username}",
-        f"CLP_DB_PASS={clp_config.database.password}",
        f"NODE_ENV=production",
    ]


⚠️ Potential issue

WebUI: remove DB/AWS secrets from CLI; inject via subprocess env

Both DB and AWS credentials are emitted as KEY=VALUE on the docker argv, and then the full command is executed/logged. Switch to keys-only on the CLI and pass values via env.

Apply:

@@ - env_vars = [ - *get_common_env_vars_list(), - *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), - f"NODE_PATH={node_path}", - f"HOST={clp_config.webui.host}", - f"PORT={clp_config.webui.port}", - f"NODE_ENV=production", - ] + from clp_py_utils.clp_config import ( + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + ) + env_vars = [ + *get_common_env_vars_list(), + # Keys only; values injected via secrets_env below + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + f"NODE_PATH={node_path}", + f"HOST={clp_config.webui.host}", + f"PORT={clp_config.webui.port}", + f"NODE_ENV=production", + ] @@ - if AwsAuthType.credentials == auth.type: - credentials = auth.credentials - env_vars.append(f"AWS_ACCESS_KEY_ID={credentials.access_key_id}") - env_vars.append(f"AWS_SECRET_ACCESS_KEY={credentials.secret_access_key}") + if AwsAuthType.credentials == auth.type: + # Keys only; values injected via secrets_env below + env_vars.append("AWS_ACCESS_KEY_ID") + env_vars.append("AWS_SECRET_ACCESS_KEY") @@ - if aws_env_vars: - env_vars.extend(aws_env_vars) + if aws_env_vars: + # Convert KEY=VALUE entries into keys-only; stash values to secrets_env + for kv in aws_env_vars: + env_vars.append(kv.split("=", 1)[0]) @@ - cmd = container_cmd + node_cmd - subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True) + cmd = container_cmd + node_cmd + secrets_env = { + CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password, + } + if StorageType.S3 == stream_storage.type: + auth = stream_storage.s3_config.aws_authentication + if AwsAuthType.credentials == auth.type: + credentials = auth.credentials + secrets_env["AWS_ACCESS_KEY_ID"] = credentials.access_key_id + secrets_env["AWS_SECRET_ACCESS_KEY"] = credentials.secret_access_key + else: + if aws_env_vars: + for kv in aws_env_vars: + k, v = kv.split("=", 1) + secrets_env[k] = v + subprocess.run( + cmd, + stdout=subprocess.DEVNULL, + check=True, + env={**os.environ, **secrets_env}, + )

Also applies to: 947-948, 956-956, 960-960, 967-969

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines 932-939 (also update the similar spots at 947-948, 956, 960, and 967-969), the current code places full DB/AWS KEY=VALUE pairs directly into the docker command args and logs them; change this to emit only the keys (e.g., "DB_USER", "DB_PASS", "AWS_ACCESS_KEY_ID", etc.) on the CLI and move the corresponding values into the subprocess environment dict passed to the process invocation (populate env from get_credential_env_vars_list but strip values from the argv and set them in os.environ or the subprocess env param); ensure the docker argv contains only KEY names (no =VALUE), and remove sensitive values from any logged command output by constructing the argv separately from the env with values kept only in the subprocess env.

coderabbitai · 2025-08-19T16:17:46Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

    necessary_mounts = [
        mounts.clp_home,
        mounts.logs_dir,
+        mounts.generated_config_file,
+    ]
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
+        f"CLP_LOGS_DIR={container_logs_dir}",
+        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
    ]


⚠️ Potential issue

Garbage collector: remove DB/AWS secrets from CLI; inject via subprocess env

Same leak pattern as create_db_tables/WebUI. Emit keys-only and pass values via env.

Apply:

@@ - env_vars = [ - *get_common_env_vars_list(include_clp_home_env_var=True), - *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), - f"CLP_LOGS_DIR={container_logs_dir}", - f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", - ] + from clp_py_utils.clp_config import ( + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + ) + env_vars = [ + *get_common_env_vars_list(include_clp_home_env_var=True), + # Keys only; values injected via env below + CLP_DB_USER_ENV_VAR_NAME, + CLP_DB_PASS_ENV_VAR_NAME, + f"CLP_LOGS_DIR={container_logs_dir}", + f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", + ] @@ - if aws_env_vars: - env_vars.extend(aws_env_vars) + if aws_env_vars: + for kv in aws_env_vars: + env_vars.append(kv.split("=", 1)[0]) @@ - cmd = container_start_cmd + garbage_collector_cmd - subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True) + cmd = container_start_cmd + garbage_collector_cmd + secrets_env = { + CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password, + } + if aws_env_vars: + for kv in aws_env_vars: + k, v = kv.split("=", 1) + secrets_env[k] = v + subprocess.run( + cmd, + stdout=subprocess.DEVNULL, + check=True, + env={**os.environ, **secrets_env}, + )

Also applies to: 1091-1091, 1093-1101

coderabbitai · 2025-08-19T16:17:46Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    def dump_to_primitive_dict(self):
+        return self.dict(exclude={"username", "password"})
+
+    def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        config = read_yaml_config_file(credentials_file_path)
+        if config is None:
+            raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
+        try:
+            self.username = get_config_value(config, f"{DB_COMPONENT_NAME}.user")
+            self.password = get_config_value(config, f"{DB_COMPONENT_NAME}.password")
+        except KeyError as ex:
+            raise ValueError(
+                f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
+            )
+
+    def load_credentials_from_env(self):
+        """
+        :raise ValueError: if any expected environment variable is not set.
+        """
+        self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)
+


🧹 Nitpick (assertive)

Database: secret scrubbing + env loader look good; docstring minor

dump_to_primitive_dict correctly excludes secrets; env loader is straightforward. Small docstring nit: you’re using “:raise ValueError:” which is acceptable per team preference; no change required.

🤖 Prompt for AI Agents

components/clp-py-utils/clp_py_utils/clp_config.py around lines 249 to 270: no functional changes required — dump_to_primitive_dict correctly excludes secrets, load_credentials_from_file and load_credentials_from_env behave as expected, and the docstring using ":raise ValueError:" is acceptable per team preference; leave the code as-is.

coderabbitai · 2025-08-19T16:17:46Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    def dump_to_primitive_dict(self):
+        return self.dict(exclude={"password"})
+
+    def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        config = read_yaml_config_file(credentials_file_path)
+        if config is None:
+            raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
+        try:
+            self.password = get_config_value(config, f"{REDIS_COMPONENT_NAME}.password")
+        except KeyError as ex:
+            raise ValueError(
+                f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
+            )
+
+    def load_credentials_from_env(self):
+        """
+        :raise ValueError: if any expected environment variable is not set.
+        """
+        self.password = _get_env_var(CLP_REDIS_PASS_ENV_VAR_NAME)
+


🧹 Nitpick (assertive)

Redis: secret scrubbing + env loader are fine; optional: docstring “raises”

Behaviour is correct; if you later standardize on “:raises ValueError:” across the file, update here for consistency.

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/clp_config.py around lines 361 to 380, update the load_credentials_from_env docstring to match the project’s docstring style by changing the ":raise ValueError:" line to ":raises ValueError:" (plural) so it is consistent with other docstrings in the file; ensure punctuation and spacing match adjacent docstrings.

coderabbitai · 2025-08-19T16:17:46Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        config = read_yaml_config_file(credentials_file_path)
+        if config is None:
+            raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
+        try:
+            self.username = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.user")
+            self.password = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.password")
+        except KeyError as ex:
+            raise ValueError(
+                f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
+            )
+
+    def load_credentials_from_env(self):
+        """
+        :raise ValueError: if any expected environment variable is not set.
+        """
+        self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)



🧹 Nitpick (assertive)

Queue: file/env loaders consistent; optional: validate non-empty values

If you want early failure on blank strings (not just missing vars), add non-empty checks here similar to DB. Can be a follow-up.

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/clp_config.py around lines 459 to 477, the credential loaders accept empty strings which can silently allow invalid credentials; add explicit non-empty validation after reading username and password in both load_credentials_from_file and load_credentials_from_env, and raise ValueError with a clear message including the credentials_file_path or environment variable names when a value is empty; mirror the DB loader behavior (check for falsy/blank values, not just missing keys) so both file and env paths fail fast on blank credentials.

junhaoliao · 2025-10-28T08:12:14Z

components/clp-package-utils/clp_package_utils/general.py

        "--log-driver", "local"
    ]
+    env_vars = {
+        "PYTHONPATH": clp_site_packages_dir,


the above "-e", f"PYTHONPATH={clp_site_packages_dir}", line was not removed, as a mistake possibly during branch syncs with main, which causes the environment variable to be defined twice. The duplication isn't harmful in the correctness though.

@coderabbitai create an issue to track the removal of the deplicate

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Create GitHub issue to track duplicate PYTHONPATH removal gh issue create \ --title "Remove duplicate PYTHONPATH environment variable in generate_container_start_cmd" \ --body "## Description In \`components/clp-package-utils/clp_package_utils/general.py\`, the \`generate_container_start_cmd\` function defines the PYTHONPATH environment variable twice: 1. Directly in the \`container_start_cmd\` array: \`\"-e\", f\"PYTHONPATH={clp_site_packages_dir}\"\` 2. In the \`env_vars\` dictionary, which is then iterated to add environment variables to the command This duplication occurred during branch syncs and should be cleaned up. While not harmful to correctness, only one definition is necessary. ## References - PR: https://github.com/y-scope/clp/pull/1152 - Comment: https://github.com/y-scope/clp/pull/1152#discussion_r2468365115 - Reported by: @junhaoliao" \ --assignee junhaoliao \ --repo y-scope/clp

Length of output: 42

@junhaoliao, I've created issue #1513 to track the removal of the duplicate PYTHONPATH environment variable definition in generate_container_start_cmd. The issue has been assigned to you.

🧬 Code Graph Analysis Results

[clp-package-utils/general.py] Summary: generate_container_start_cmd

Description: Creates the docker run command to start a container with mounts, environment variables, and a name.

Parameters:

container_name: str

container_mounts: List[Optional[DockerMount]]

container_image: str

extra_env_vars: Optional[Dict[str, str]]

Returns: List[str] representing the docker run command.

Key behavior:

Defines clp_site_packages_dir as CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages".

Includes an environment variable "-e PYTHONPATH={clp_site_packages_dir}" unconditionally.

Merges extra_env_vars (if provided) into the command as additional -e KEY=VALUE pairs.

Appends mount arguments for non-None mounts.

Important details:

The command explicitly sets PYTHONPATH in the container startup command, which may duplicate with other environment-variable provisioning (e.g., a separate list of environment vars that also sets PYTHONPATH).

This function is central to how the container’s environment is prepared and could interact with other env var provisioning logic.

[clp-package-utils/general.py] Summary: get_common_env_vars_list

Description: Returns a list of common environment variables for Docker containers.

Parameters:

include_clp_home_env_var: bool (default True)

Returns: List[str] where each item is in the form KEY=VALUE.

Key behavior:

Always includes PYTHONPATH={CONTAINER_CLP_HOME}/lib/python3/site-packages.

If include_clp_home_env_var is True, also includes CLP_HOME={CONTAINER_CLP_HOME}.

Important details:

This list provides a standard set of env vars for containers.

Note: PYTHONPATH is defined here, which can interact with the PYTHONPATH set in generate_container_start_cmd (potential duplication if both methods are used).

junhaoliao added 4 commits August 3, 2025 00:37

feat: Enhance database configuration by loading credentials from envi…

b143237

…ronment variables

refactor

6cddeda

lint

f9d5b59

fix

489476f

junhaoliao requested a review from a team as a code owner August 3, 2025 06:09

junhaoliao changed the title ~~refactor: Standardize configuration and credential delivery to containers (resolves #1149).~~ refactor: Standardize configuration and credential passing to containers (resolves #1149). Aug 3, 2025

junhaoliao changed the title ~~refactor: Standardize configuration and credential passing to containers (resolves #1149).~~ refactor(package): Standardize configuration and credential passing to containers (resolves #1149). Aug 3, 2025

junhaoliao requested a review from haiqi96 August 3, 2025 06:12

coderabbitai bot reviewed Aug 3, 2025

View reviewed changes

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py Outdated Show resolved Hide resolved

fix issues caused by refactoring

0e3f412

junhaoliao marked this pull request as draft August 3, 2025 06:17

coderabbitai bot reviewed Aug 3, 2025

View reviewed changes

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py Outdated Show resolved Hide resolved

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py Outdated Show resolved Hide resolved

junhaoliao marked this pull request as ready for review August 5, 2025 14:45

fix: Update exception handling to catch ValueError instead of KeyError

05283d5

haiqi96 reviewed Aug 6, 2025

View reviewed changes

junhaoliao added 8 commits August 7, 2025 17:30

refactor: Remove unused logger initialization in general.py

34c5e8f

reorder generated_config_file_path

842efac

fix: Update type hint for extra_env_vars to use Dict instead of dict

05715fb

refactor: make extra_env_vars optional

87d07c6

refactor: rename generate_common_environment_variables to generate_co…

73334ed

…ntainer_environment_variables

improve environment variable generation for Docker containers

866334d

Set default value for extra_env_vars to None for immutability instead…

dff9bef

… of empty dictionary

Rename 'load_and_validate_config_file' to 'validate_and_load_config_f…

00ceeb8

…ile'

coderabbitai bot reviewed Aug 7, 2025

View reviewed changes

junhaoliao added 5 commits August 7, 2025 19:40

Add get_generated_config_file_path method to dynamically generate the…

cc9a7c7

… path

extract environment variable retrieval logic

4e1c272

lint

7d82bec

update error logging in query scheduler

1b7c70c

Remove unused import of load_worker_config from job_orchestration.exe…

781bb16

…cutor.utils

junhaoliao added 4 commits August 19, 2025 09:45

remove unused variable clp_site_packages_dir

e15547e

add missing mount to start_garbage_collector

4bba4ed

Update function docstrings raise to use consistent colon placement

6ab2f80

refactor(clp-package-utils): rename environment variable generation f…

cd4437d

…unctions

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

refactor(clp-package-utils): rename dump_shared_config function to …

14e0219

…`dump_shared_container_config`

coderabbitai bot mentioned this pull request Aug 19, 2025

Database credential configuration naming inconsistency between YAML and object properties #1232

Open

junhaoliao added 4 commits August 19, 2025 10:05

refactor(clp-package-utils): rename and restructure configuration dum…

33bc068

…ping functions

refactor(config): rename generated config file to shared config file

ae8f581

Merge branch 'main' into db-config-file

0e52cb1

lint

f063bd9

junhaoliao requested a review from kirkrodrigues August 19, 2025 14:38

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

kirkrodrigues previously approved these changes Aug 19, 2025

View reviewed changes

components/clp-py-utils/clp_py_utils/clp_config.py Outdated Show resolved Hide resolved

components/clp-package-utils/clp_package_utils/scripts/start_clp.py Outdated Show resolved Hide resolved

remove redundant comment - Apply suggestions from code review

04f81b7

Co-authored-by: kirkrodrigues <[email protected]>

junhaoliao dismissed kirkrodrigues’s stale review via 04f81b7 August 19, 2025 15:55

junhaoliao and others added 2 commits August 19, 2025 11:55

order constants - Apply suggestions from code review

34c430f

Co-authored-by: kirkrodrigues <[email protected]>

refactor(clp-package-utils): extract container config filename genera…

31342c3

…tion into a separate function

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

Merge branch 'main' into db-config-file

ca2db62

junhaoliao changed the title ~~refactor(package): Standardize configuration and credentials passing to containers (resolves #1149).~~ refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). Aug 19, 2025

junhaoliao requested a review from kirkrodrigues August 19, 2025 17:46

kirkrodrigues approved these changes Aug 19, 2025

View reviewed changes

junhaoliao merged commit dab1010 into y-scope:main Aug 19, 2025
8 of 9 checks passed

junhaoliao deleted the db-config-file branch August 19, 2025 19:23

haiqi96 mentioned this pull request Aug 19, 2025

fix(package): Pass credentials via env vars to dataset_manager container. #1239

Merged

3 tasks

junhaoliao commented Oct 28, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Oct 28, 2025

Remove duplicate PYTHONPATH environment variable in generate_container_start_cmd #1513

Closed

		necessary_mounts = [mounts.clp_home, mounts.logs_dir, mounts.generated_config_file]
		aws_mount, aws_env_vars = generate_container_auth_options(clp_config, component_name)

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152

Uh oh!

Conversation

junhaoliao commented Aug 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Code / File inspection

Manual tests

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

haiqi96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiqi96 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiqi96 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 8, 2025

Choose a reason for hiding this comment

Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 313-346):

Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 348-374):

Summary of generate_container_config function (lines 180-236):

Summary of generate_container_start_cmd function (lines 270-300):

Summary of generate_common_environment_variables usage:

Summary of CLPConfig class (from clp_py_utils/clp_config.py):

junhaoliao commented Aug 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Aug 7, 2025 •

edited

Loading

coderabbitai bot Aug 7, 2025 •

edited

Loading

Relevant snippet from `components/clp-package-utils/clp_package_utils/general.py` (lines 313-346):

Relevant snippet from `components/clp-package-utils/clp_package_utils/general.py` (lines 348-374):

Summary of `generate_container_config` function (lines 180-236):

Summary of `generate_container_start_cmd` function (lines 270-300):

Summary of `generate_common_environment_variables` usage:

Summary of `CLPConfig` class (from `clp_py_utils/clp_config.py`):

coderabbitai bot Aug 19, 2025 •

edited

Loading