Skip to content

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Aug 3, 2025

Description

This PR standardizes how configuration and credentials are passed to containers, addressing security concerns and code complexity in the start_clp.py script.

The key changes are:
Standardize configuration and credential delivery to improve consistency and security by:

  • Separating credentials from general configuration.
    • Instead, passing all credentials via environment variables to avoid leaking them in temporary files.
  • Using a single generated configuration file for all non-credential configuration data. (instead of generating config files per component)

See #1149 for more details.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Code / File inspection

  1. Inspected that all components properly load credentials from environment variables.
  2. Confirmed that the single generated configuration file contains all necessary non-credential configuration data

Manual tests

  1. Tested container startup with the new configuration approach:

    • cd clp-package/sbin/
    • ./start-clp.sh
  2. Tested wrapper scripts:

    • ./compress.sh ~/samples/hive-24hr/
    • ./decompress.sh x
    • ./search.sh "1"
    • ./admin-tools/archive-manager.sh find
  3. Performed a search with query string "1" in the webui and observed results returned successfully.

  4. Enabled garbage_collector and observed archives / search results are purged in below configurations:

    1. results_cache.retention_period = 1 + archive_output.retention_period = 5800000 + performed a search: observed search results were cleaned in around one minute after query completed.
    2. results_cache.retention_period = 1 + archive_output.retention_period = 1 + performed a search: observed no search result; archives were cleaned in around one minute after compression.
    results_cache:
      # Retention period for search results, in minutes. Set to null to disable automatic deletion.
      retention_period: 1
    
    ## Where archives should be output to
    archive_output:
      # Retention period for archives, in minutes. Set to null to disable automatic deletion.
      retention_period: 1 # 5800000 for validation with the result cache 
    
    ## Garbage collector config
    garbage_collector:
      logging_level: "INFO"
    
      # Interval (in minutes) at which garbage collector jobs run
      sweep_interval:
        archive: 1
        search_result: 1

Summary by CodeRabbit

  • New Features

    • Inject extra environment variables into container startup; add shared generated container config file and mount support; new helpers to construct common, credential and connection env lists.
  • Bug Fixes

    • Load DB/queue/Redis credentials from environment when present; omit sensitive fields from serialized exports; improved error logging with tracebacks.
  • Documentation

    • Added -c short alias and clarified --config help text.
  • Refactor

    • Centralized container config dumping and env/credential handling; standardized generated config filename usage.

@junhaoliao junhaoliao requested a review from a team as a code owner August 3, 2025 06:09
@junhaoliao junhaoliao changed the title refactor: Standardize configuration and credential delivery to containers (resolves #1149). refactor: Standardize configuration and credential passing to containers (resolves #1149). Aug 3, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 3, 2025

Warning

Rate limit exceeded

@junhaoliao has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 31342c3 and ca2db62.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)

Walkthrough

Adds shared generated-config file support and mounting, moves secret credentials to environment-variable flows with new env-name constants and load-from-env methods, extends container start to accept extra environment variables, centralizes config dumping, and updates many scripts and the CLPConfig model to use the new plumbing.

Changes

Cohort / File(s) Change Summary
Config model & env constants
components/clp-py-utils/clp_py_utils/clp_config.py
Added CLP_SHARED_CONFIG_FILENAME and CLP_*_ENV_VAR_NAME constants; Database/Queue/Redis gained dump/load-from-file and load-from-env methods; credential fields made optional; secrets excluded from dump_to_primitive_dict; CLPConfig.get_shared_config_file_path() and updated CLPConfig.dump_to_primitive_dict.
CLP Docker / config utilities
components/clp-package-utils/clp_package_utils/general.py
Added generated_config_file: Optional[DockerMount] to CLPDockerMounts; generate_container_config now mounts shared config when missing; dump_container_config signature changed to accept config_filename and return host/container paths; added dump_shared_container_config and get_container_config_filename; generate_container_start_cmd accepts extra_env_vars and injects -e KEY=VALUE; added env helper functions (get_common_env_vars_list, get_credential_env_vars_list, get_celery_connection_env_vars_list) and formatting helpers.
Containerized script entrypoints
components/clp-package-utils/clp_package_utils/scripts/*
.../archive_manager.py, compress.py, decompress.py, search.py, dataset_manager.py
Use get_container_config_filename(container_name) for dumped config filename; construct extra_env_vars mapping DB creds to CLP_DB_* env-name constants; pass extra_env_vars as new 4th argument to generate_container_start_cmd; imports updated to use new constants/helpers.
Native (no-container) scripts
components/clp-package-utils/clp_package_utils/scripts/native/*
archive_manager.py, compress.py, decompress.py, search.py
After config validation, call clp_config.database.load_credentials_from_env() to load DB credentials from environment; native decompress now uses CLP_DB_*_ENV_VAR_NAME keys for subprocess env.
start_clp orchestration
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
Replaced per-container YAML dumps with shared generated config path (container_clp_config.get_shared_config_file_path()); add mounts.generated_config_file to mounts; centralized env-var construction via new helpers; call dump_shared_container_config; removed per-container YAML writes.
DB / init CLP scripts
components/clp-py-utils/clp_py_utils/*
create-db-tables.py, initialize-clp-metadata-db.py, initialize-orchestration-db.py
CLI: add -c alias for --config; switch to pathlib.Path and CLPConfig.parse_obj(...); call clp_config.database.load_credentials_from_env() after parsing; added explicit ValidationError/ValueError handling and adjusted imports to use CLPConfig.
Job orchestration (schedulers/executor/gc)
components/job-orchestration/...
compression_scheduler.py, query_scheduler.py, compression_task.py, garbage_collector.py
After CLPConfig parse, call clp_config.database.load_credentials_from_env(); unify ValidationError/ValueError handling; use logger.exception for generic failures; update executor env-var keys to CLP_DB_* constants; adjust startup logs to use component names.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Script
    participant CLPConfig
    participant OS
    participant Docker

    User->>Script: invoke CLI (start_clp / compress / search / decompress / native)
    Script->>CLPConfig: parse YAML -> CLPConfig
    Script->>OS: read env (CLP_DB_*, CLP_QUEUE_*, CLP_REDIS_*)
    OS-->>Script: env values
    Script->>CLPConfig: clp_config.database.load_credentials_from_env()
    alt shared config required
        Script->>OS: dump_shared_container_config / dump_container_config(config_filename)
        OS-->>Script: host config path
        Script->>Docker: ensure generated_config_file mount present
    end
    Script->>Script: build extra_env_vars (KEY=VALUE) + common env list
    Script->>Docker: generate_container_start_cmd(..., extra_env_vars)
    Docker-->>Script: container started with mounts and injected env
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • kirkrodrigues
  • haiqi96
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@junhaoliao junhaoliao changed the title refactor: Standardize configuration and credential passing to containers (resolves #1149). refactor(package): Standardize configuration and credential passing to containers (resolves #1149). Aug 3, 2025
@junhaoliao junhaoliao requested a review from haiqi96 August 3, 2025 06:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

732-752: Missing generated_config_dir mount in worker configuration.

The function sets CLP_CONFIG_PATH environment variable pointing to the generated config file but doesn't include mounts.generated_config_dir in the necessary mounts. This could cause the worker to fail accessing the config file.

     necessary_mounts = [
         mounts.clp_home,
         mounts.data_dir,
         mounts.logs_dir,
+        mounts.generated_config_dir,
     ]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 581bd46 and 489476f.

📒 Files selected for processing (17)
  • components/clp-package-utils/clp_package_utils/general.py (8 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (10 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (5 hunks)
  • components/clp-py-utils/clp_py_utils/create-db-tables.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (3 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1 hunks)
🧰 Additional context used
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: in `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variabl...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: in clp installation scripts, consistency across platform scripts is prioritized over defensive progr...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: for installation scripts in the clp project, prefer explicit error handling over automatic dependenc...
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
📚 Learning: in the current clp codebase implementation, dataset validation using validate_dataset() is performed...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
📚 Learning: in the clp project, when reviewing cmakelists.txt changes that introduce new compression library dep...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in the c++ `filedecompressor` implementations at `components/core/src/clp/clp/filedecompressor.cpp` ...
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (40)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

34-34: LGTM: Import addition supports configuration standardization.

The import of load_worker_config aligns with the PR's goal of standardizing configuration and credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

120-126: LGTM: Improved security by passing database credentials via environment variables.

The addition of extra_env_vars containing database credentials and updating the generate_container_start_cmd call aligns with the PR's security improvements. This approach prevents credentials from being embedded in temporary configuration files.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

209-219: LGTM: Consistent credential handling standardization.

The addition of extra_env_vars for database credentials follows the same secure pattern implemented across other scripts. This ensures consistent and secure credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

105-131: LGTM: Secure credential handling in file extraction function.

The addition of extra_env_vars containing database credentials and updating the generate_container_start_cmd call ensures secure credential delivery for file extraction operations.


211-217: LGTM: Consistent credential handling in stream extraction function.

The implementation matches the pattern used in the file extraction function, ensuring consistent and secure credential handling across both extraction methods.

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

22-22: Confirm configuration-loading updates in create-db-tables.py

I see the new -c alias and updated help text (“CLP configuration file.”) on line 22; that part looks good. However, I wasn’t able to find any use of CLPConfig.parse_obj() or load_database_credentials_from_env() in this script—only the help text change is present. If you intended to expand the configuration-loading logic here, please verify those changes have been applied or update the summary accordingly.

components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

208-211: LGTM - Credential extraction for environment variables

The extraction of database credentials into environment variables aligns with the PR's goal to standardize credential delivery to containers.


231-231: LGTM - Environment variable injection into container start command

The modification to pass extra_env_vars to generate_container_start_cmd correctly implements the standardized approach for credential delivery to containers.

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2)

3-3: LGTM - Import addition for environment variable access

The os module import is necessary for accessing environment variables in the credential override block.


192-197: LGTM - Environment variable override for database credentials

The implementation correctly overrides database credentials from environment variables and provides appropriate error handling for missing variables. This aligns with the PR's standardization of credential delivery.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

4-4: LGTM - Import addition for environment variable access

The os module import is necessary for accessing environment variables in the credential override logic.


233-238: LGTM - Environment variable override for database credentials

The implementation correctly overrides database credentials from environment variables with proper error handling. This follows the established pattern across native scripts for credential standardization.

components/clp-package-utils/clp_package_utils/scripts/native/search.py (2)

7-7: LGTM - Import addition for environment variable access

The os module import enables access to environment variables for the credential override functionality.


301-306: LGTM - Environment variable override for database credentials

The implementation correctly loads database credentials from environment variables with appropriate error handling. This is consistent with the pattern used across other native scripts in this PR.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

1136-1137: LGTM - Standardized configuration loading with environment variable credentials

The change to use CLPConfig.load_from_generated_config_file() and load_database_credentials_from_env() correctly implements the new standardized approach for configuration and credential loading.


1138-1143: LGTM - Improved exception handling

The exception handling is now more specific, catching ValidationError and KeyError separately from generic exceptions, which provides better error reporting and aligns with the new configuration loading approach.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

406-407: LGTM! Improved configuration loading pattern.

The switch to CLPConfig.load_from_generated_config_file() and environment-based credential loading aligns well with the PR's goal to standardize configuration delivery and enhance security by separating credentials from config files.


408-414: Enhanced error handling with proper exception categorization.

Good improvement in exception handling - catching specific ValidationError and KeyError exceptions separately from generic exceptions, with appropriate logging for debugging purposes.

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (2)

42-42: Consistent command-line interface improvement.

Adding the -c short option alongside --config improves usability and maintains consistency with other scripts in the codebase.


48-56: Excellent standardization of configuration loading.

The shift from parsing only the Database section to parsing the full CLPConfig with environment variable credential overrides aligns perfectly with the PR's security and standardization objectives. The error handling appropriately catches validation and configuration errors.

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2)

47-56: Consistent configuration standardization pattern.

This follows the same secure pattern established in other initialization scripts - using full CLPConfig parsing with environment variable credential overrides. The error handling is comprehensive and appropriate.


59-60: Proper configuration object usage.

Using clp_config.database directly after the unified configuration loading is the correct approach and maintains consistency with the new configuration management pattern.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (2)

122-127: Secure credential override implementation.

The environment variable credential override pattern enhances security by separating sensitive data from configuration files. The error handling appropriately catches missing environment variables.


234-239: Consistent credential override pattern.

Good consistency - the same environment variable credential override pattern is applied in both extraction functions, ensuring uniform security handling throughout the script.

components/clp-py-utils/clp_py_utils/clp_config.py (6)

612-612: Well-designed generated config file path.

The generated_config_file_path field provides a centralized location for container configuration files, supporting the PR's goal of standardizing configuration delivery.


624-626: Proper path resolution for generated config.

The generated config file path is correctly made absolute, ensuring consistent path handling across different execution contexts.


765-771: Secure environment variable credential loading.

The database credential loading method properly validates that required environment variables are present and provides clear error messages for missing credentials.


772-778: Consistent queue credential handling.

The queue credential loading follows the same secure pattern as the database credentials, maintaining consistency across different service credentials.


779-784: Complete credential environment variable support.

The Redis credential loading completes the environment variable support for all major service credentials, enabling full separation of sensitive data from configuration files.


799-808: Enhanced security in configuration export.

Excellent security improvement - removing sensitive credential fields and the generated config file path from exported configuration prevents accidental exposure of sensitive information.

components/clp-package-utils/clp_package_utils/general.py (5)

100-100: LGTM! Consistent with existing mount attributes.

The addition of generated_config_dir follows the established pattern for optional Docker mounts in the class.


286-297: LGTM! Mount handling follows established patterns.

The logic correctly checks if the path is already mounted before creating a new mount, consistent with how other paths are handled.


403-407: LGTM! Node-specific path handling is consistent.

The modification correctly makes the generated config file path node-specific by including the hostname, following the same pattern as data and logs directories.


623-676: LGTM! Well-structured centralized environment variable generation.

The function effectively centralizes environment variable generation with clear flag-based controls. The implementation correctly formats credentials and connection strings.

Note: Ensure that Docker logs and process listings are properly secured since credentials are passed via environment variables.


333-336: No compatibility issue with dict[str, str]
The project’s Poetry configuration specifies python = ">= 3.9" (in clp-package-utils, as well as other modules), so built-in generics like dict[str, str] are fully supported. No changes needed.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (5)

244-252: LGTM! Proper credential handling for database table creation.

The function correctly uses the centralized environment variable generation with database credentials and includes all necessary mounts.


602-610: LGTM! Scheduler configuration properly centralized.

The function correctly generates environment variables with Celery connection parameters and includes component-specific variables for logging.


998-1012: LGTM! Reducer configuration includes all necessary credentials.

The function correctly generates environment variables with database, queue, and Redis credentials, and properly includes the generated config mount.


925-938: LGTM! WebUI configuration properly handles database credentials.

The function correctly generates environment variables with database credentials and web-specific variables. The absence of generated_config_dir mount appears intentional as WebUI uses settings.json files rather than the generated config.


1146-1148: LGTM! Config generation properly centralized.

The config file is correctly generated once before starting any components, ensuring all containers use the same configuration file as intended by the PR objectives.

@junhaoliao junhaoliao marked this pull request as draft August 3, 2025 06:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 489476f and 0e3f412.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (10 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
🧬 Code Graph Analysis (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • load_database_credentials_from_env (765-770)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • load_database_credentials_from_env (765-770)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (10)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

1137-1137: LGTM! Credential loading from environment variables implemented correctly.

The addition of clp_config.load_database_credentials_from_env() aligns with the PR objective to pass credentials via environment variables instead of configuration files.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

407-407: LGTM! Credential loading from environment variables implemented correctly.

The addition of clp_config.load_database_credentials_from_env() aligns with the PR objective to pass credentials via environment variables instead of configuration files.


412-412: Good improvement! Enhanced error logging with stack traces.

Changing from logger.error(ex) to logger.exception(ex) provides valuable stack trace information for debugging generic exceptions.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7)

50-51: LGTM! Import additions align with configuration standardization.

The new imports for dump_container_config and generate_common_environment_variables support the centralized configuration management approach.


244-261: Well-implemented credential separation for database table creation.

The changes correctly:

  • Pass database credentials via environment variables using the standardized utility
  • Include the generated config directory in mounts
  • Reference the centralized configuration file

This improves security by avoiding credential exposure through temporary files.


602-629: Excellent standardization of scheduler configuration handling.

The refactoring correctly:

  • Centralizes environment variable generation with appropriate credential flags
  • Maintains all necessary environment variables (logs directory, logging level)
  • Adds the generated config directory to mounts
  • Uses the pre-generated configuration file path

This aligns perfectly with the PR's goal of standardizing configuration delivery.


733-748: Verify if the generated config directory mount is needed for workers.

The environment variable generation is correctly implemented. However, I notice that unlike other components, the mounts.generated_config_dir is not added to necessary_mounts.

Since the config path is passed via CLP_CONFIG_PATH environment variable, please confirm whether workers need direct access to the generated config directory.


926-935: Appropriate credential handling for WebUI component.

The changes correctly pass database credentials via environment variables while maintaining the WebUI-specific configuration approach through settings.json files.


999-1022: Comprehensive credential separation for reducer component.

Excellent implementation that:

  • Passes all required credentials (database, queue, redis) via environment variables
  • Includes the generated config directory in mounts
  • Uses the centralized configuration file

This provides a secure and standardized approach for the reducer component.


1147-1149: Perfect placement of centralized configuration generation.

The dump_container_config call is correctly positioned after all validations and directory creation, ensuring a single configuration file is generated before any components start. This is the cornerstone of the configuration standardization effort.

@junhaoliao junhaoliao marked this pull request as ready for review August 5, 2025 14:45
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best effort review for now

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
components/clp-py-utils/clp_py_utils/clp_config.py (2)

611-611: Consider using a getter method instead of a member variable.

Based on past feedback, having generated_config_file_path as a member variable allows users to potentially overwrite the file path, which may not be desirable. Consider implementing this as a getter method instead:

def get_generated_config_file_path(self) -> pathlib.Path:
    return self.logs_directory / ".clp-config.yml"

This approach would ensure the path is always computed relative to the current logs_directory and prevents unwanted modifications.


765-783: Consider creating a reusable helper method for environment variable loading.

The three methods have similar structure with repeated try-catch blocks for environment variable loading. As suggested in past feedback, consider extracting this into a reusable helper method:

def _get_env_var(self, name: str) -> str:
    try:
        return os.environ[name]
    except KeyError as ex:
        raise ValueError(f"Missing environment variable: {ex}")

Then refactor the methods:

def load_database_credentials_from_env(self):
    self.database.username = self._get_env_var("CLP_DB_USER")
    self.database.password = self._get_env_var("CLP_DB_PASS")

This reduces code duplication and improves maintainability.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

121-127: Extract duplicated credential override logic to utils.

The logic for overriding database credentials from environment variables is duplicated between handle_extract_stream_cmd and handle_extract_file_cmd. As previously discussed, this should be extracted into a helper function in native/utils.py.

Also applies to: 233-239

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05283d5 and 00ceeb8.

📒 Files selected for processing (8)
  • components/clp-package-utils/clp_package_utils/general.py (9 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (11 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (5 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-07-01T14:51:19.172Z
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-03T20:10:43.789Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2025-07-03T12:58:18.407Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2025-07-28T08:33:57.487Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2024-10-10T05:50:15.348Z
Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/SchemaTreeNode.hpp:0-0
Timestamp: 2024-10-10T05:50:15.348Z
Learning: In this codebase, assertions are not used; the project policy is to avoid adding assertions in the code.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-07-25T21:29:48.947Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1126
File: .gitignore:5-5
Timestamp: 2025-07-25T21:29:48.947Z
Learning: In the CLP project, the .clang-format file is maintained in the yscope-dev-utils submodule and copied over to the main CLP repository, so it should be ignored in .gitignore to prevent accidental commits of the copied file and maintain the single source of truth in the submodule.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (6)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (330-366)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (330-366)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (330-366)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (330-366)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • make_config_path_absolute (42-53)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • CLPConfig (589-809)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/clp_config.py (2)

624-626: LGTM!

The generated config file path is correctly made absolute using the existing make_config_path_absolute utility function, which is consistent with how other paths are handled in this method.


798-808: LGTM!

The sensitive credential removal is correctly implemented. Excluding generated_config_file_path from the exported configuration and removing all credential fields (username, password for database/queue, and username for redis) ensures no sensitive information leaks through configuration exports.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

121-127: LGTM!

The database credentials are correctly passed as environment variables to the container. This implementation aligns with the PR objective of separating credentials from configuration files and follows the same pattern used in other scripts (archive_manager.py, compress.py, decompress.py).

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

214-220: LGTM!

The database credentials are correctly passed as environment variables to the container start command. This follows the same consistent pattern implemented across other scripts and aligns with the PR objective of externalizing sensitive credentials from configuration files.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

126-132: LGTM!

The database credentials are correctly passed as environment variables in the file extraction command. The implementation follows the same pattern used consistently across other scripts.


211-218: LGTM!

The database credentials are correctly passed as environment variables in the stream extraction command. This maintains consistency with the file extraction function and other scripts in the codebase.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

225-231: LGTM!

The database credentials are correctly passed as environment variables to the container start command. This implementation is consistent with the pattern applied across all other scripts (search.py, archive_manager.py, decompress.py) and successfully achieves the PR objective of externalizing sensitive credentials from configuration files.

Comment on lines 275 to 286
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.generated_config_file_path,
container_clp_config.generated_config_file_path,
):
docker_mounts.generated_config_dir = DockerMount(
DockerMountType.BIND,
clp_config.generated_config_file_path,
container_clp_config.generated_config_file_path,
)

Copy link
Contributor

@coderabbitai coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mount the parent directory of the config file, not the file itself.

The code attempts to mount generated_config_file_path which is a file path, but Docker requires mounting directories. You should mount the parent directory instead.

Apply this diff to fix the mounting issue:

 if not is_path_already_mounted(
     clp_home,
     CONTAINER_CLP_HOME,
-    clp_config.generated_config_file_path,
-    container_clp_config.generated_config_file_path,
+    clp_config.generated_config_file_path.parent,
+    container_clp_config.generated_config_file_path.parent,
 ):
     docker_mounts.generated_config_dir = DockerMount(
         DockerMountType.BIND,
-        clp_config.generated_config_file_path,
-        container_clp_config.generated_config_file_path,
+        clp_config.generated_config_file_path.parent,
+        container_clp_config.generated_config_file_path.parent,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.generated_config_file_path,
container_clp_config.generated_config_file_path,
):
docker_mounts.generated_config_dir = DockerMount(
DockerMountType.BIND,
clp_config.generated_config_file_path,
container_clp_config.generated_config_file_path,
)
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.generated_config_file_path.parent,
container_clp_config.generated_config_file_path.parent,
):
docker_mounts.generated_config_dir = DockerMount(
DockerMountType.BIND,
clp_config.generated_config_file_path.parent,
container_clp_config.generated_config_file_path.parent,
)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 275 to
286, the code mounts the config file path directly, but Docker requires mounting
directories. Modify the mount source and target paths to use the parent
directory of the config file paths instead of the file paths themselves,
ensuring the mount points are directories.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using --mount which seems to work for file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haiqi96 You're right—Docker bind mounts support files with --mount as long as the source exists. No change is required to switch to the parent directory.

Two small hardening tweaks you might consider:

  • Make the config bind read-only to avoid accidental writes:
    • pass is_read_only=True when constructing the DockerMount for the generated config file.
  • Optionally skip the file-level mount if the logs directory is already mounted (the file will be available via that directory), to avoid redundant mounts.

Otherwise, keeping the file mount is fine given dump_container_config runs before container start.


🧠 Learnings used
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.409Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Comment on lines 622 to 624
def generate_common_environment_variables(
include_clp_home=True,
) -> List[str]:
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Function signature mismatch with usage in start_clp.py.

The function generate_common_environment_variables doesn't accept container_clp_config as a parameter, but it's being called with this argument at line 1002 in start_clp.py. This will cause a TypeError at runtime.


🏁 Script executed:

#!/bin/bash
# Verify all calls to generate_common_environment_variables
rg -A 2 "generate_common_environment_variables\(" --type py

Length of output: 2317


We should confirm the exact locations in start_clp.py where the incorrect calls occur so we can update them precisely. Let’s list those lines with context:


🏁 Script executed:

#!/bin/bash
rg -n -C3 "generate_common_environment_variables" components/clp-package-utils/clp_package_utils/scripts/start_clp.py

Length of output: 1838


Fix incorrect calls to generate_common_environment_variables in start_clp.py.

The function signature in components/clp-package-utils/clp_package_utils/general.py is

def generate_common_environment_variables(
    include_clp_home: bool = True,
) -> List[str]:
    ...

It does not accept a container_clp_config argument, yet in start_clp.py it’s being called like this:

• start_clp.py:247
*generate_common_environment_variables(container_clp_config),

• start_clp.py:608
*generate_common_environment_variables(container_clp_config),

• start_clp.py:929
*generate_common_environment_variables(container_clp_config),

• start_clp.py:1002
*generate_common_environment_variables(container_clp_config, include_clp_home=True),

To resolve the TypeError at runtime, update these invocations so they only pass the include_clp_home flag (or rely on its default), for example:

- *generate_common_environment_variables(container_clp_config),
+ *generate_common_environment_variables(),

- *generate_common_environment_variables(container_clp_config, include_clp_home=True),
+ *generate_common_environment_variables(include_clp_home=True),

Please apply these changes at the locations above.

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py at lines 622 to
624, the function generate_common_environment_variables only accepts the
include_clp_home boolean argument. In start_clp.py at lines 247, 608, 929, and
1002, calls to this function incorrectly pass container_clp_config as an
argument. To fix the TypeError, update these calls to remove
container_clp_config and only pass include_clp_home if needed, or rely on its
default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the code again? i don't think we are calling generate_common_environment_variables with container_clp_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e

echo "== Definition =="
rg -n -C1 $'^def generate_common_environment_variables\\(' components

echo ""
echo "== All call sites =="
rg -n -C3 "generate_common_environment_variables\\(" --type py

echo ""
echo "== start_clp.py call sites =="
fd start_clp.py components | xargs -I{} rg -n -C3 "generate_common_environment_variables\\(" {}

Length of output: 6350


@coderabbit-user Thanks, @haiqi96 — I rechecked the code with a repo-wide search. All calls to generate_common_environment_variables are correct; none passes container_clp_config.

Verified call sites in start_clp.py:

  • Line 246: *generate_common_environment_variables()
  • Line 607: *generate_common_environment_variables()
  • Line 738: *generate_common_environment_variables(include_clp_home=True)
  • Line 930: *generate_common_environment_variables()
  • Line 1005: *generate_common_environment_variables(include_clp_home=True)

Conclusion: No TypeError risk here. The earlier note was incorrect; no changes needed.


🧬 Code Graph Analysis Results

Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 313-346):

def generate_common_environment_variables(
    include_clp_home=True,
) -> List[str]:
    """
    Generate a list of common environment variables for Docker containers.

    :param include_clp_home:
    :return: List of environment variable strings in the format "KEY=VALUE".
    """
    clp_site_packages_dir = CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages"
    env_vars = [f"PYTHONPATH={clp_site_packages_dir}"]

    if include_clp_home:
        env_vars.append(f"CLP_HOME={CONTAINER_CLP_HOME}")

    return env_vars

This function generate_common_environment_variables does not take any container_clp_config parameter. It only optionally includes the CLP_HOME environment variable and always includes PYTHONPATH.


Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 348-374):

def generate_credential_environment_variables(
    container_clp_config: CLPConfig,
    include_db_credentials=False,
    include_queue_credentials=False,
    include_redis_credentials=False,
) -> List[str]:
    """
    Generate a list of credential environment variables for Docker containers.

    :param container_clp_config:
    :param include_db_credentials:
    :param include_queue_credentials:
    :param include_redis_credentials:
    :return: List of environment variable strings in the format "KEY=VALUE".
    """
    env_vars = []

    if include_db_credentials:
        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")

    if include_queue_credentials:
        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")

    if include_redis_credentials:
        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

    return env_vars

This function generate_credential_environment_variables explicitly takes container_clp_config as a parameter and uses it to generate environment variables related to credentials.


Summary of generate_container_config function (lines 180-236):

This function takes a clp_config and clp_home and returns a tuple (container_clp_config, docker_mounts) where:

  • container_clp_config is a deep copy of clp_config with paths adjusted for container usage.
  • docker_mounts is an instance of CLPDockerMounts containing DockerMount objects for various directories.

This function prepares the container configuration but does not call generate_common_environment_variables.


Summary of generate_container_start_cmd function (lines 270-300):

This function generates the Docker run command line as a list of strings. It accepts:

  • container_name
  • container_mounts (list of DockerMount or None)
  • container_image
  • extra_env_vars (optional dictionary of environment variables)

It sets some environment variables like PYTHONPATH and CLP_HOME internally, but it does not call generate_common_environment_variables with a container_clp_config argument.


Summary of generate_common_environment_variables usage:

  • The function generate_common_environment_variables does not accept a container_clp_config argument.
  • It only optionally includes CLP_HOME and always includes PYTHONPATH.
  • Credential environment variables are generated separately by generate_credential_environment_variables which does accept container_clp_config.

Summary of CLPConfig class (from clp_py_utils/clp_config.py):

  • CLPConfig is a configuration class with attributes like database, queue, redis, etc.
  • It has methods to load credentials from files or environment variables.
  • It is used as the type for container_clp_config in the environment variable generation functions.

These snippets and summaries provide context on the function signatures and usage of generate_common_environment_variables and related functions in the file general.py. They clarify that generate_common_environment_variables is not called with container_clp_config anywhere in this file.


🧠 Learnings used
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.409Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
components/clp-package-utils/clp_package_utils/general.py (6)

99-99: Nit: align typing style (use Optional instead of typing.Optional)

You already import Optional; use it here for consistency.

Apply this diff:

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

290-301: Make generated config bind mount read-only

The generated config should not be modified by containers. Mark the mount as read-only.

Apply this diff:

-        docker_mounts.generated_config_file = DockerMount(
-            DockerMountType.BIND,
-            clp_config.get_generated_config_file_path(),
-            container_clp_config.get_generated_config_file_path(),
-        )
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_generated_config_file_path(),
+            container_clp_config.get_generated_config_file_path(),
+            True,
+        )

391-397: Security: avoid leaking secrets via CLI env “-e KEY=VALUE”

Passing credentials via “-e KEY=VALUE” exposes them in process lists, shell history, and build logs. Prefer emitting only keys in the Docker CLI and pass the actual values via subprocess.run(..., env=...) so values don’t appear on the command line.

Two-step hardening (minimal surface change here, update call sites accordingly):

  • Build docker cmd with “-e KEY” only.
  • At container start, merge os.environ with the env map and pass it via subprocess’s env parameter.

Apply this diff for the builder:

-    env_vars = {
-        "PYTHONPATH": clp_site_packages_dir,
-        **(extra_env_vars if extra_env_vars is not None else {}),
-    }
-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    env_vars = {
+        "PYTHONPATH": str(clp_site_packages_dir),
+        **(extra_env_vars if extra_env_vars is not None else {}),
+    }
+    for key in env_vars.keys():
+        container_start_cmd.extend(["-e", key])

Follow-up: I can generate the required call-site changes (merging env dicts and passing env=...) across scripts if you’d like.


681-690: Defensive checks: avoid emitting None-valued credentials into env

If credentials aren’t loaded, this can produce “KEY=None” entries. Raise early with a clear message.

Apply this diff:

-    if include_db_credentials:
-        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
-        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
+    if include_db_credentials:
+        if (
+            container_clp_config.database.username is None
+            or container_clp_config.database.password is None
+        ):
+            raise ValueError("Database credentials not loaded.")
+        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
+        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
@@
-    if include_queue_credentials:
-        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
-        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
+    if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
+        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
+        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
@@
-    if include_redis_credentials:
-        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
+    if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
+        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

362-377: Nit: clarify docstring about extra_env_vars and reserved keys

Document that PYTHONPATH is pre-set and that callers should avoid overriding container-managed keys unless intentionally doing so.

Apply this diff:

-    :param extra_env_vars: Environment variables to set on top of the predefined ones.
+    :param extra_env_vars: Environment variables to set on top of the predefined ones.
+                           Note: PYTHONPATH is set by this function; avoid overriding unless
+                           intentional. Prefer not to pass secrets here if command-line visibility
+                           is a concern; pass them via subprocess env instead.

701-709: URL-encode credentials in Celery URLs; handle optional Redis auth

Raw credentials can break URLs if they contain reserved characters. Also omit the “:password@” segment when Redis has no password.

Apply this diff:

-    env_vars = [
-        f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
-        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
-        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
-        f"{container_clp_config.redis.query_backend_database}",
-    ]
+    from urllib.parse import quote
+    q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+    q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+    broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+
+    r_pass = container_clp_config.redis.password
+    redis_db = container_clp_config.redis.query_backend_database
+    if r_pass:
+        backend = (
+            "redis://default:"
+            f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+        )
+    else:
+        backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+
+    env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4124b60 and 3fe8368.

📒 Files selected for processing (1)
  • components/clp-package-utils/clp_package_utils/general.py (7 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (14)
  • get_generated_config_file_path (925-926)
  • dump_to_primitive_dict (249-250)
  • dump_to_primitive_dict (361-362)
  • dump_to_primitive_dict (456-457)
  • dump_to_primitive_dict (552-553)
  • dump_to_primitive_dict (569-572)
  • dump_to_primitive_dict (601-604)
  • dump_to_primitive_dict (702-705)
  • dump_to_primitive_dict (724-727)
  • dump_to_primitive_dict (934-950)
  • dump_to_primitive_dict (962-970)
  • load_credentials_from_file (252-262)
  • load_credentials_from_file (364-373)
  • load_credentials_from_file (459-469)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (5)
components/clp-package-utils/clp_package_utils/general.py (5)

12-12: LGTM: typing imports expanded appropriately

Importing Dict, List, Optional, Tuple is fine and used below.


17-18: LGTM: CLP_GENERATED_CONFIG_FILENAME import

Matches new shared-config workflow.


343-347: LGTM: thin wrapper for per-container config dump

Clear and reuses shared dump logic.


349-360: LGTM: shared generated-config dumper

Docstring and behaviour align with the new “single generated config file” approach.


475-490: LGTM: switch to component-level load_credentials_from_file

Consistent with refactor centralizing credential loading on the sub-configs.

Comment on lines 325 to 340
def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):
"""
Writes the given config to the logs directory so that it's accessible in the container.
Writes the given container config to the logs directory, so that it's accessible in the
container.
:param container_clp_config: The config to write.
:param clp_config: The corresponding config on the host (used to determine the logs directory).
:param container_name:
:param config_filename:
:return: The path to the config file in the container and on the host.
"""
container_config_filename = f".{container_name}-config.yml"
config_file_path_on_host = clp_config.logs_directory / container_config_filename
config_file_path_on_container = container_clp_config.logs_directory / container_config_filename
config_file_path_on_host = clp_config.logs_directory / config_filename
config_file_path_on_container = container_clp_config.logs_directory / config_filename
with open(config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

return config_file_path_on_container, config_file_path_on_host
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure directory exists before writing config; add return type

Prevent failures when the node-specific logs directory hasn’t been created yet. Also, the function returns a tuple—add its annotation.

Apply this diff:

-def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):
+def dump_config(
+    container_clp_config: CLPConfig,
+    clp_config: CLPConfig,
+    config_filename: str,
+) -> Tuple[pathlib.Path, pathlib.Path]:
@@
-    with open(config_file_path_on_host, "w") as f:
+    config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
+    with open(config_file_path_on_host, "w") as f:
         yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):
"""
Writes the given config to the logs directory so that it's accessible in the container.
Writes the given container config to the logs directory, so that it's accessible in the
container.
:param container_clp_config: The config to write.
:param clp_config: The corresponding config on the host (used to determine the logs directory).
:param container_name:
:param config_filename:
:return: The path to the config file in the container and on the host.
"""
container_config_filename = f".{container_name}-config.yml"
config_file_path_on_host = clp_config.logs_directory / container_config_filename
config_file_path_on_container = container_clp_config.logs_directory / container_config_filename
config_file_path_on_host = clp_config.logs_directory / config_filename
config_file_path_on_container = container_clp_config.logs_directory / config_filename
with open(config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
return config_file_path_on_container, config_file_path_on_host
def dump_config(
container_clp_config: CLPConfig,
clp_config: CLPConfig,
config_filename: str,
) -> Tuple[pathlib.Path, pathlib.Path]:
"""
Writes the given container config to the logs directory, so that it's accessible in the
container.
:param container_clp_config: The config to write.
:param clp_config: The corresponding config on the host (used to determine the logs directory).
:param config_filename:
:return: The path to the config file in the container and on the host.
"""
config_file_path_on_host = clp_config.logs_directory / config_filename
config_file_path_on_container = container_clp_config.logs_directory / config_filename
config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
with open(config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
return config_file_path_on_container, config_file_path_on_host
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 325 to
340, ensure the target logs directory exists before opening the file and add an
explicit return type annotation for the function; to fix, call
mkdir(parents=True, exist_ok=True) on clp_config.logs_directory (or the parent
directory of config_file_path_on_host) before opening the file to write, and
update the function signature to include the return type -> Tuple[Path, Path]
(and import Tuple and Path from typing/pathlib if not already imported).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the parent path doesn't exist, it's probably fine to let it fail to show any implementation error during development

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao That's a fair point about letting it fail fast to catch implementation errors during development. The explicit failure when the directory doesn't exist will indeed help identify setup issues early rather than masking them with automatic directory creation.

The return type annotation suggestion is still valid though, as it's separate from the directory creation behavior and improves type safety:

def dump_config(
    container_clp_config: CLPConfig,
    clp_config: CLPConfig,
    config_filename: str,
) -> Tuple[pathlib.Path, pathlib.Path]:

✏️ Learnings added
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@junhaoliao
Copy link
Member Author

@kirkrodrigues i just revalidated. it's ready for review again

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

123-151: Bug: DB credentials are not passed to the container

Unlike other scripts, this wrapper does not pass CLP_DB_USER/CLP_DB_PASS to the container, so native.dataset_manager inside the container likely cannot connect to the DB.

Apply this change locally (minimal fix consistent with current general.generate_container_start_cmd API):

@@
     container_name = generate_container_name("dataset-manager")
 
     container_clp_config, mounts = generate_container_config(clp_config, clp_home)
     generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
-        container_clp_config, clp_config, f".{container_name}-config.yml"
+        container_clp_config, clp_config, f".{container_name}-config.yml"
     )
@@
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container
-    )
+    # Pass DB credentials to the container
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
+    )

Outside this range, add the imports at the top:

from clp_py_utils.clp_config import CLP_DB_USER_ENV_VAR_NAME, CLP_DB_PASS_ENV_VAR_NAME

Optional hardening (recommended across the repo): avoid leaking secrets via Docker argv by emitting only “-e KEY” and injecting values via subprocess env; see related comments on archive_manager/compress/decompress and the generate_container_start_cmd refactor.

♻️ Duplicate comments (20)
components/clp-py-utils/clp_py_utils/clp_config.py (5)

252-263: Deprecate file-based DB credential loader to align with env-only policy

This PR’s objective is to pass credentials exclusively via environment variables for containers. Keep file-based loading only for non-container/legacy workflows and add a clear deprecation note to guide callers.

Apply:

     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)

264-270: Fail fast on empty DB creds loaded from env

_get_env_var() guards missing vars, but empty strings still pass. Validate non-empty here to surface misconfig early.

Apply:

     def load_credentials_from_env(self):
         """
         :raise ValueError: if any expected environment variable is not set.
         """
-        self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
-        self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)
+        self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)
+        if not self.username:
+            raise ValueError(f"{CLP_DB_USER_ENV_VAR_NAME} cannot be empty")
+        if not self.password:
+            raise ValueError(f"{CLP_DB_PASS_ENV_VAR_NAME} cannot be empty")

364-374: Deprecate file-based Redis credential loader (containers should use env)

Mark this as deprecated for parity with DB/Queue and to steer callers away from file-based secrets.

Apply:

     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)

459-470: Deprecate file-based Queue credential loader

Same rationale as DB/Redis — containers should use env-vars.

Apply:

     def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
+        """
+        DEPRECATED: Prefer environment variables via load_credentials_from_env().
+        Retained for non-container workflows only; slated for removal in a future release.
+        """
         config = read_yaml_config_file(credentials_file_path)

471-477: Fail fast on empty Queue creds loaded from env

Prevent silent propagation of blank credentials.

Apply:

     def load_credentials_from_env(self):
         """
         :raise ValueError: if any expected environment variable is not set.
         """
-        self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
-        self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)
+        self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
+        self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)
+        if not self.username:
+            raise ValueError(f"{CLP_QUEUE_USER_ENV_VAR_NAME} cannot be empty")
+        if not self.password:
+            raise ValueError(f"{CLP_QUEUE_PASS_ENV_VAR_NAME} cannot be empty")
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

225-231: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)

Passing secrets as “-e KEY=VALUE” exposes them via ps and error logs. Emit only names on the CLI and inject values via subprocess env.

Apply within this file after updating general.generate_container_start_cmd to accept names only:

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd: List[str] = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    # Provide only names to the Docker CLI; inject values via subprocess env.
+    container_start_cmd: List[str] = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Then execute with injected env (outside this range):

import os  # ensure imported at top
# ...
proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars})

Additionally, avoid logging full command lines that include env values.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

227-233: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)

Same issue: secrets are exposed via argv. Switch to passing only names on the CLI and values via subprocess env.

Apply here once general.generate_container_start_cmd is refactored to take env var names:

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this range:

import os  # at top if missing
# ...
subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

128-134: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in file extraction path

Do not embed secret values in argv; pass only names on the CLI and inject values via subprocess env.

Apply after refactoring general.generate_container_start_cmd:

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this range:

import os  # at top if missing
# ...
subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})

201-210: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in stream extraction path

Same fix as above for the second path in this script.

Apply:

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this range:

# Ensure subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
components/clp-package-utils/clp_package_utils/scripts/search.py (1)

120-126: Security: Avoid exposing DB credentials via Docker CLI

Database credentials are currently exposed in the Docker command line through -e KEY=VALUE, making them visible via ps and in logs. This is the same security issue flagged in previous reviews.

The extra_env_vars dictionary contains sensitive database credentials that will be embedded in the Docker command. This needs to be fixed by either:

  1. Passing only environment variable names to the Docker CLI (e.g., -e KEY) and providing values via subprocess environment
  2. Using Docker secrets or config files for credential management

See the previous review comment for the detailed fix.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

610-616: Scheduler logging level misconfiguration persists

The compression scheduler still incorrectly uses the query scheduler's logging level. This known issue is already tracked.

As previously identified and tracked in issue #1220, Line 615 sets CLP_LOGGING_LEVEL from clp_config.query_scheduler.logging_level for both schedulers.


739-763: Redis database parameter ignored in generic_start_worker

The redis_database parameter passed to generic_start_worker isn't used when calling get_celery_connection_env_vars_list, causing workers to always use the query backend database.

Line 741 calls get_celery_connection_env_vars_list(container_clp_config) without passing the redis_database parameter, so the RESULT_BACKEND will always point to the query backend database instead of the worker-specific one.


932-960: Security: WebUI exposes DB and AWS credentials on Docker CLI

Database credentials and AWS credentials are embedded in the Docker command line as KEY=VALUE pairs.

Lines 934 and 947-948 add credentials directly to the command line. These should be passed via subprocess environment variables instead.


1074-1093: Security: Garbage collector leaks credentials on Docker CLI

Database and AWS credentials are exposed in the Docker command line.

Line 1076 adds database credentials to the Docker command via get_credential_env_vars_list, exposing them in process listings. The same issue exists with AWS credentials on line 1091.


251-261: Security: DB credentials leak via Docker CLI in create_db_tables

Database credentials are being embedded directly in the Docker command line, exposing them via process listings and debug logs.

The env_vars list contains CLP_DB_USER=<value> and CLP_DB_PASS=<value> which will be visible in process listings. This needs to be fixed as noted in previous reviews.


610-623: Security: Credentials exposed on Docker CLI for schedulers

Database and AWS credentials are passed as -e KEY=VALUE on the Docker command line, making them visible via process listings.

Both database credentials (via get_credential_env_vars_list) and AWS credentials (when applicable) are exposed in the Docker command. This needs the same fix as other components.

components/clp-package-utils/clp_package_utils/general.py (4)

385-391: Security: Container start command embeds secrets in Docker CLI

The generate_container_start_cmd function adds environment variables as -e KEY=VALUE to the Docker command, exposing any secrets passed via extra_env_vars.

Lines 389-390 embed the full KEY=VALUE pairs in the Docker command line. To avoid leaking secrets:

  1. Emit only -e KEY (without values) in the Docker command
  2. Pass actual values via the env parameter of subprocess.run()

This requires updating all callers to provide environment variables through subprocess rather than the command line.


689-704: Celery connection always uses query backend database

The function hardcodes container_clp_config.redis.query_backend_database for the RESULT_BACKEND, preventing proper database isolation between compression and query operations.

Line 701 always uses query_backend_database. The function needs to accept a redis_database parameter to allow callers to specify the correct database for each scheduler/worker type.


675-677: Add defensive checks for database credentials

The function should verify credentials are loaded before using them to avoid propagating None values.

Add validation before using credentials:

 if include_db_credentials:
+    container_clp_config.database.ensure_credentials_loaded()
     env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
     env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")

335-338: Ensure parent directory exists before writing config file

The function writes to clp_config.logs_directory / config_filename without ensuring the directory exists.

Create the parent directory before writing:

 config_file_path_on_host = clp_config.logs_directory / config_filename
 config_file_path_on_container = container_clp_config.logs_directory / config_filename
+config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
 with open(config_file_path_on_host, "w") as f:
     yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3fe8368 and ae8f581.

📒 Files selected for processing (8)
  • components/clp-package-utils/clp_package_utils/general.py (7 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (7)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
  • DockerMount (66-85)
  • generate_container_start_cmd (356-397)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (356-397)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (356-397)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
  • get_shared_config_file_path (925-926)
  • load_credentials_from_file (252-262)
  • load_credentials_from_file (364-373)
  • load_credentials_from_file (459-469)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • generate_container_start_cmd (356-397)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
  • read_yaml_config_file (56-62)
  • get_config_value (25-39)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (4)
  • dump_shared_container_config (343-353)
  • get_celery_connection_env_vars_list (689-704)
  • get_credential_env_vars_list (659-686)
  • get_common_env_vars_list (642-656)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_shared_config_file_path (925-926)
  • StorageType (107-109)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_container_auth_options (101-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (13)
components/clp-py-utils/clp_py_utils/clp_config.py (9)

84-84: Good: shared generated-config filename is centralized

Defining CLP_SHARED_CONFIG_FILENAME avoids string drift across call sites.


88-94: Good: env-var names hoisted to constants

Centralizing these reduces typos and eases reuse across scripts and services.


249-251: Good: succinct secret stripping in DB dump

Using Pydantic’s exclude keeps credentials out of the serialized config.


361-363: Good: Redis secret excluded from dump

Keeps password out of generated/shared config.


375-380: Clarify Redis password policy vs. implementation

Model marks password Optional[str] = None, but loader requires CLP_REDIS_PASS to be set. Confirm intent:

  • If Redis auth is mandatory in container flows, keep as-is and update docstrings elsewhere accordingly.
  • If password is truly optional, consider allowing unset env without raising and rely on None to mean “no-auth”.

Would you like me to patch load_credentials_from_env to treat a missing CLP_REDIS_PASS as unauthenticated Redis?


453-458: Good: Queue secrets excluded from dump

Avoids accidental leakage into generated config.


780-785: Good: simplified _get_env_var with os.getenv

This is concise and produces cleaner error messages than KeyError.


925-926: Good: shared-config path uses the declared constant

Avoids hardcoding ".clp-config.yml" in multiple places.


935-946: Good: maintainable top-level dump via excluded keys + loop

Reduces duplication and keeps serialization cohesive for nested configs.

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

127-128: LGTM: dot-prefixed generated config filename

Matches the new convention for hidden, on-host generated config files.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

217-218: LGTM: dot-prefixed generated config filename

Consistent with the new hidden-on-host convention for generated configs.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

207-208: LGTM: dot-prefixed generated config filename

Consistent with the standardized approach for generated configs.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

104-105: LGTM: dot-prefixed generated config filename

Matches the new hidden filename convention and keeps host filesystem tidy.

Comment on lines +695 to +701
env_vars = [
f"BROKER_URL=amqp://"
f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

URL-encode credentials in Celery connection strings

Credentials containing special characters (e.g., @, :, /) will break the connection URLs.

Apply URL encoding to credentials before embedding them in connection strings:

+from urllib.parse import quote

 def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
     """
     :param container_clp_config:
     :return: A list of Celery connection environment variables for Docker containers, in the format
     "KEY=VALUE".
     """
+    queue_user = quote(container_clp_config.queue.username, safe='')
+    queue_pass = quote(container_clp_config.queue.password, safe='')
+    redis_pass = quote(container_clp_config.redis.password, safe='')
     env_vars = [
         f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
+        f"{queue_user}:{queue_pass}@"
         f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
+        f"RESULT_BACKEND=redis://default:{redis_pass}@"
         f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
         f"{container_clp_config.redis.query_backend_database}",
     ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
env_vars = [
f"BROKER_URL=amqp://"
f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
from urllib.parse import quote
def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
queue_user = quote(container_clp_config.queue.username, safe='')
queue_pass = quote(container_clp_config.queue.password, safe='')
redis_pass = quote(container_clp_config.redis.password, safe='')
env_vars = [
f"BROKER_URL=amqp://"
f"{queue_user}:{queue_pass}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
f"RESULT_BACKEND=redis://default:{redis_pass}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
]
return env_vars
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 695 to
701, the Celery BROKER_URL and RESULT_BACKEND strings embed raw usernames and
passwords which will break when credentials contain special characters;
URL-encode the queue username and password and the redis password (using
urllib.parse.quote_plus or quote) before interpolating them into the connection
URLs so special characters are percent-encoded, and update the env_vars
construction to use the encoded credential variables.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

739-746: Workers: wire redis_database through to Celery env

generic_start_worker accepts redis_database but it isn’t used when building env vars; RESULT_BACKEND will default to the query DB. Pass it through:

-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=redis_database,
+        ),
         f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}",
         f"CLP_LOGS_DIR={container_logs_dir}",
         f"CLP_LOGGING_LEVEL={worker_config.logging_level}",
         f"CLP_WORKER_LOG_PATH={container_worker_log_path}",
     ]
components/clp-package-utils/clp_package_utils/general.py (1)

358-399: Optional hardening: avoid embedding env values in generated docker cmd

generate_container_start_cmd currently emits -e KEY=VALUE for all extras, which can leak secrets on the CLI. Consider emitting keys only and requiring callers to pass values via subprocess.run(env=...). If you adopt this, update docstrings accordingly and adjust call sites.

Example change:

-    if extra_env_vars is not None:
-        for key, value in extra_env_vars.items():
-            container_start_cmd.extend(["-e", f"{key}={value}"])
+    if extra_env_vars is not None:
+        for key in extra_env_vars.keys():
+            container_start_cmd.extend(["-e", key])
♻️ Duplicate comments (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

615-615: Compression scheduler logging level still taken from query scheduler (tracked)

CLP_LOGGING_LEVEL uses clp_config.query_scheduler.logging_level for both schedulers. This is the known misconfiguration tracked in your issue; leaving as-is if you prefer to fix in a follow-up.

If you want to patch now, replace with:

-        f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}",
+        f"CLP_LOGGING_LEVEL={(
+            clp_config.compression_scheduler.logging_level
+            if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+            else clp_config.query_scheduler.logging_level
+        )}",

610-616: RESULT_BACKEND should use the scheduler-specific Redis DB

The Celery result backend currently always targets the query DB. Compression scheduler should use its own DB to avoid mixing result sets.

Update the call to parameterise the Redis DB (paired with a small helper change in general.py):

-        *get_celery_connection_env_vars_list(container_clp_config),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=(
+                container_clp_config.redis.compression_backend_database
+                if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+                else container_clp_config.redis.query_backend_database
+            ),
+        ),
components/clp-package-utils/clp_package_utils/general.py (3)

99-99: Nit: prefer Optional over typing.Optional

You already import Optional; align to Optional[DockerMount] for consistency.

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

325-336: Add return type to dump_container_config

The function returns a tuple of paths; annotate for clarity and type-safety.

-def dump_container_config(
-    container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str
-):
+def dump_container_config(
+    container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str
+) -> Tuple[pathlib.Path, pathlib.Path]:

661-689: Guard against missing credentials in env builder

Avoid emitting strings with “None” if a loader wasn’t called; add defensive checks.

-    if include_db_credentials:
-        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
-        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
+    if include_db_credentials:
+        if (
+            container_clp_config.database.username is None
+            or container_clp_config.database.password is None
+        ):
+            raise ValueError("Database credentials not loaded.")
+        env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
+        env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
@@
-    if include_queue_credentials:
-        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
-        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
+    if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
+        env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
+        env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
@@
-    if include_redis_credentials:
-        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
+    if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
+        env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ae8f581 and f063bd9.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/general.py (7 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-19T14:41:28.856Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (6)
  • dump_shared_container_config (345-355)
  • generate_container_config (229-310)
  • get_celery_connection_env_vars_list (691-706)
  • get_clp_home (109-125)
  • get_common_env_vars_list (644-658)
  • get_credential_env_vars_list (661-688)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_shared_config_file_path (925-926)
  • StorageType (107-109)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_container_auth_options (101-177)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
  • get_shared_config_file_path (925-926)
  • CLPConfig (787-956)
  • load_credentials_from_file (252-262)
  • load_credentials_from_file (364-373)
  • load_credentials_from_file (459-469)
🔇 Additional comments (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)

52-58: Imports cleanup aligns with the new helpers

Good move to import and adopt the shared config dump and env-var helpers; this improves consistency and reduces duplication.


269-269: Use of shared config file path is correct

Switching to the unified shared config path is consistent with the PR objective and reduces per-component YAML churn.


1006-1014: Reducer: shared config mount and path usage look right

  • Including mounts.generated_config_file and using get_shared_config_file_path is correct and aligns with the unified-config flow.

Also applies to: 1019-1026


1246-1248: Centralised shared-config dump

Dumping the shared config once and mounting it everywhere removes duplication and the risk of per-component config drift. Good change.

components/clp-package-utils/clp_package_utils/general.py (1)

644-658: LGTM: concise and reusable common env builder

Good split; the include_clp_home_env_var flag makes intent explicit and keeps call sites tidy.

Comment on lines +290 to +301
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
):
docker_mounts.generated_config_file = DockerMount(
DockerMountType.BIND,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Mark the shared config file mount read-only

The shared config should not be mutated in-container. Make the bind mount ro.

-        docker_mounts.generated_config_file = DockerMount(
-            DockerMountType.BIND,
-            clp_config.get_shared_config_file_path(),
-            container_clp_config.get_shared_config_file_path(),
-        )
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_shared_config_file_path(),
+            container_clp_config.get_shared_config_file_path(),
+            True,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
):
docker_mounts.generated_config_file = DockerMount(
DockerMountType.BIND,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
)
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
):
docker_mounts.generated_config_file = DockerMount(
DockerMountType.BIND,
clp_config.get_shared_config_file_path(),
container_clp_config.get_shared_config_file_path(),
True,
)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 290 to
301, the shared config file is currently bind-mounted into the container as
writable; change the DockerMount used for docker_mounts.generated_config_file to
be read-only so the in-container process cannot mutate it. Modify the
DockerMount construction to include the read-only option (for example by adding
mode='ro' or readonly=True or the equivalent parameter your DockerMount class
expects) when creating the bind mount, ensuring the source and target paths
remain the same.

Comment on lines +691 to +706
def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
env_vars = [
f"BROKER_URL=amqp://"
f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
]

return env_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Celery env: parameterise Redis DB and URL-encode credentials

This helper hardcodes the query DB and embeds raw credentials. Parameterise the DB index and encode user/pass to avoid URL parsing issues.

-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
+def get_celery_connection_env_vars_list(
+    container_clp_config: CLPConfig,
+    redis_database: Optional[int] = None,
+) -> List[str]:
     """
     :param container_clp_config:
     :return: A list of Celery connection environment variables for Docker containers, in the format
     "KEY=VALUE".
     """
-    env_vars = [
-        f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
-        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
-        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
-        f"{container_clp_config.redis.query_backend_database}",
-    ]
+    from urllib.parse import quote
+    q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+    q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+    r_pass = container_clp_config.redis.password
+    redis_db = (
+        container_clp_config.redis.query_backend_database
+        if redis_database is None
+        else redis_database
+    )
+    broker = (
+        f"amqp://{q_user}:{q_pass}@"
+        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+    )
+    if r_pass:
+        backend = (
+            "redis://default:"
+            f"{quote(r_pass, safe='')}@"
+            f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+        )
+    else:
+        backend = (
+            f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+        )
+    env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
 
     return env_vars
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
env_vars = [
f"BROKER_URL=amqp://"
f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
]
return env_vars
def get_celery_connection_env_vars_list(
container_clp_config: CLPConfig,
redis_database: Optional[int] = None,
) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
from urllib.parse import quote
q_user = "" if container_clp_config.queue.username is None else quote(
container_clp_config.queue.username, safe=""
)
q_pass = "" if container_clp_config.queue.password is None else quote(
container_clp_config.queue.password, safe=""
)
r_pass = container_clp_config.redis.password
redis_db = (
container_clp_config.redis.query_backend_database
if redis_database is None
else redis_database
)
broker = (
f"amqp://{q_user}:{q_pass}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}"
)
if r_pass:
backend = (
"redis://default:"
f"{quote(r_pass, safe='')}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
)
else:
backend = (
f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
)
env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
return env_vars
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 691 to
706, the Celery env helper currently hardcodes the Redis DB index and injects
raw credentials into URLs; update it to parameterise the Redis DB index (use the
existing config field, e.g., redis.query_backend_database or rename to redis.db
if needed) and URL-encode both broker and redis username/password to avoid URL
parsing issues. Import and use urllib.parse.quote_plus to encode
container_clp_config.queue.username/container_clp_config.queue.password and
container_clp_config.redis.username/container_clp_config.redis.password, then
build BROKER_URL and RESULT_BACKEND using the encoded credentials and the config
DB index instead of a hardcoded value; return the env_vars list as before.

Comment on lines +617 to 618
necessary_mounts = [mounts.clp_home, mounts.logs_dir, mounts.generated_config_file]
aws_mount, aws_env_vars = generate_container_auth_options(clp_config, component_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Scheduler: add-on — consider not embedding queue/redis secrets on CLI

As a follow-up hardening, avoid BROKER_URL/RESULT_BACKEND values on the CLI; pass only keys via -e and inject values via subprocess.run(env=...). This mirrors the pattern in create_db_tables/webui/garbage_collector suggestions.

Also applies to: 622-623, 628-636, 638-639

Comment on lines +1072 to 1079
mounts.generated_config_file,
]
env_vars = [
*get_common_env_vars_list(include_clp_home_env_var=True),
*get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
f"CLP_LOGS_DIR={container_logs_dir}",
f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Garbage collector: remove secrets from CLI; inject via subprocess env

Same leakage pattern as WebUI. Emit keys-only on the CLI and pass actual secret values via env.

Apply:

-    necessary_mounts = [
+    necessary_mounts = [
         mounts.clp_home,
         mounts.logs_dir,
         mounts.generated_config_file,
     ]
-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-        f"CLP_LOGS_DIR={container_logs_dir}",
-        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
-    ]
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        # Keys only; values injected via env below
+        "CLP_DB_USER",
+        "CLP_DB_PASS",
+        f"CLP_LOGS_DIR={container_logs_dir}",
+        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
+    ]
@@
-    if aws_env_vars:
-        env_vars.extend(aws_env_vars)
+    if aws_env_vars:
+        # Convert KEY=VALUE entries into keys-only now; stash values for env injection
+        for kv in aws_env_vars:
+            env_vars.append(kv.split("=", 1)[0])
@@
-    cmd = container_start_cmd + garbage_collector_cmd
-    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+    cmd = container_start_cmd + garbage_collector_cmd
+    secrets_env = {
+        "CLP_DB_USER": container_clp_config.database.username,
+        "CLP_DB_PASS": container_clp_config.database.password,
+    }
+    if aws_env_vars:
+        for kv in aws_env_vars:
+            k, v = kv.split("=", 1)
+            secrets_env[k] = v
+    subprocess.run(
+        cmd,
+        stdout=subprocess.DEVNULL,
+        check=True,
+        env={**os.environ, **secrets_env},
+    )

Also applies to: 1087-1094, 1097-1104

kirkrodrigues
kirkrodrigues previously approved these changes Aug 19, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits.

For the PR title, how about:

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

739-746: Workers: ensure RESULT_BACKEND uses the intended Redis DB

generic_start_worker receives redis_database but get_celery_connection_env_vars_list ignores it, so workers will point to the query DB. Pass the parameter:

-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=redis_database,
+        ),
         f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}",
         f"CLP_LOGS_DIR={container_logs_dir}",
         f"CLP_LOGGING_LEVEL={worker_config.logging_level}",
         f"CLP_WORKER_LOG_PATH={container_worker_log_path}",
     ]
@@
-    append_docker_options(container_start_cmd, necessary_mounts, env_vars)
+    append_docker_options(container_start_cmd, necessary_mounts, env_vars)

(Helper signature change in general.py required; see my comment there.)

Also applies to: 763-763

components/clp-package-utils/clp_package_utils/general.py (1)

362-397: Avoid embedding KEY=VALUE in docker argv in generate_container_start_cmd

The function emits -e KEY=VALUE which can expose secrets via ps/logs. Prefer -e KEY only and require callers to pass values via subprocess.run(env=...). If acceptable, I can provide a full patch; for now:

-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    for key in env_vars.keys():
+        container_start_cmd.extend(["-e", key])

Callers should merge env dicts into subprocess.run(env=...).

components/clp-py-utils/clp_py_utils/clp_config.py (1)

796-796: Syntax error in type annotation: remove ‘()’ from Reducer

Typing the field as Reducer() = Reducer() is invalid. It must be the class type on the left-hand side.

-    reducer: Reducer() = Reducer()
+    reducer: Reducer = Reducer()
♻️ Duplicate comments (10)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

226-232: Critical: DB credentials are exposed on the Docker CLI (-e KEY=VALUE). Pass only names to Docker and inject values via subprocess env.

Current generate_container_start_cmd emits "-e KEY=VALUE", which leaks secrets via process listings and logs. Provide only env var names on the docker CLI and pass the values via subprocess.run(..., env=...). This also prevents accidental leakage in debug logs.

Apply within this hunk (after updating general.generate_container_start_cmd accordingly):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd: List[str] = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd: List[str] = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk, inject values at execution and avoid logging the full command with secrets:

# Add at top-level imports
import os  # outside this hunk

# Replace the subprocess invocation and debug logging
proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars})
ret_code = proc.returncode
if 0 != ret_code:
    logger.error("Archive manager failed.")
    # Avoid printing the full command string to prevent accidental leakage.
    logger.debug("Docker command failed (redacted).")

Companion change required in components/clp-package-utils/clp_package_utils/general.py: update generate_container_start_cmd to accept a list of env-var names and emit only "-e KEY" (no "=VALUE"). For example:

-def generate_container_start_cmd(
-    container_name: str,
-    container_mounts: List[Optional[DockerMount]],
-    container_image: str,
-    extra_env_vars: Optional[Dict[str, str]] = None,
-) -> List[str]:
+def generate_container_start_cmd(
+    container_name: str,
+    container_mounts: List[Optional[DockerMount]],
+    container_image: str,
+    env_var_names: Optional[List[str]] = None,
+) -> List[str]:
@@
-    env_vars = {
-        "PYTHONPATH": clp_site_packages_dir,
-        **(extra_env_vars if extra_env_vars is not None else {}),
-    }
-    for key, value in env_vars.items():
-        container_start_cmd.extend(["-e", f"{key}={value}"])
+    # Provide only names; values must be injected via subprocess.run(env=...).
+    container_start_cmd.extend(["-e", f"PYTHONPATH={clp_site_packages_dir}"])
+    for key in (env_var_names or []):
+        container_start_cmd.extend(["-e", key])

This pattern should be applied consistently across all scripts that pass secrets.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

129-135: Critical: DB credentials leak on Docker CLI; pass only env names and inject values via subprocess env. Also ensure cleanup on failure.

  • Security: Avoid "-e KEY=VALUE" on the CLI. Provide only names; inject values via subprocess.run(env=...).
  • Robustness: The generated config file is not removed on failure because unlink() is after a return in the except path—use a finally.

Apply within this hunk (after updating general.generate_container_start_cmd to accept names):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk: inject env and guarantee cleanup:

import os  # at top if missing

cmd = container_start_cmd + extract_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
except subprocess.CalledProcessError:
    logger.exception("Docker or file extraction command failed.")
    return -1
finally:
    # Remove generated files even on failure
    generated_config_path_on_host.unlink(missing_ok=True)

Optionally, factor out DB env assembly to avoid duplication:

def _build_db_env(clp_config: CLPConfig) -> dict[str, str]:
    return {
        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
    }
# ...
extra_env_vars = _build_db_env(clp_config)

202-211: Mirror the secret-handling and cleanup fixes in the stream extraction path

Apply the same secure env handling and robust cleanup:

-    generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
-        container_clp_config, clp_config, get_container_config_filename(container_name)
-    )
+    generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
+        container_clp_config, clp_config, get_container_config_filename(container_name)
+    )
@@
-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk:

import os  # if missing

cmd = container_start_cmd + extract_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
except subprocess.CalledProcessError:
    logger.exception("Docker or stream extraction command failed.")
    return -1
finally:
    generated_config_path_on_host.unlink(missing_ok=True)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)

121-127: Critical: Don’t expose DB creds via Docker CLI; inject via subprocess env and add failure-safe cleanup

  • Pass only env var names to Docker; inject values via env in subprocess.run.
  • Ensure generated config file is removed even if search fails (currently not removed on CalledProcessError).

Apply within this hunk (after general.generate_container_start_cmd is updated to accept names):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk:

import os  # if missing

cmd = container_start_cmd + search_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
finally:
    generated_config_path_on_host.unlink(missing_ok=True)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

228-234: Critical: Secrets leak via “-e KEY=VALUE”. Pass only names to Docker and inject values via subprocess env; ensure cleanup on failure.

  • Security: Avoid including DB creds on the docker argv.
  • Reliability: If the docker run fails, the generated config file isn’t removed.

Apply within this hunk (after updating general.generate_container_start_cmd):

-    extra_env_vars = {
-        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
-        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
-    }
-    container_start_cmd = generate_container_start_cmd(
-        container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
-    )
+    extra_env_vars = {
+        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
+    }
+    container_start_cmd = generate_container_start_cmd(
+        container_name,
+        necessary_mounts,
+        clp_config.execution_container,
+        list(extra_env_vars.keys()),
+    )

Outside this hunk, inject env and protect cleanup:

import os  # at top if missing

cmd = container_start_cmd + compress_cmd
try:
    subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
finally:
    generated_config_path_on_host.unlink(missing_ok=True)

Optional DRY: factor out a small helper to build DB env vars (as suggested previously) and reuse across scripts.

def _build_db_env(clp_config: CLPConfig) -> dict[str, str]:
    return {
        CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
        CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
    }
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

610-616: Fix scheduler logging level and use the correct Redis DB for RESULT_BACKEND

Two issues:

  • CLP_LOGGING_LEVEL always uses the query scheduler’s logging level.
  • get_celery_connection_env_vars_list always uses the query backend DB; compression should use its own DB.

Apply:

     env_vars = [
         *get_common_env_vars_list(),
         *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-        *get_celery_connection_env_vars_list(container_clp_config),
+        *get_celery_connection_env_vars_list(
+            container_clp_config,
+            redis_database=(
+                container_clp_config.redis.compression_backend_database
+                if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+                else container_clp_config.redis.query_backend_database
+            ),
+        ),
         f"CLP_LOGS_DIR={container_logs_dir}",
-        f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}",
+        f"CLP_LOGGING_LEVEL={(
+            clp_config.compression_scheduler.logging_level
+            if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME
+            else clp_config.query_scheduler.logging_level
+        )}",
     ]

Note: This requires updating the helper signature in general.py (see my comment there).

components/clp-package-utils/clp_package_utils/general.py (4)

99-99: Typing style: use Optional instead of typing.Optional

You already import Optional. Keep annotations consistent.

-        self.generated_config_file: typing.Optional[DockerMount] = None
+        self.generated_config_file: Optional[DockerMount] = None

665-693: Defensively ensure credentials are loaded before emitting env vars

Prevent “None” values from being emitted when loaders weren’t called; fail fast for required creds.

     if include_db_credentials:
+        container_clp_config.database.ensure_credentials_loaded()
         env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
         env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")

     if include_queue_credentials:
+        if (
+            container_clp_config.queue.username is None
+            or container_clp_config.queue.password is None
+        ):
+            raise ValueError("Queue credentials not loaded.")
         env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
         env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")

     if include_redis_credentials:
+        if container_clp_config.redis.password is None:
+            raise ValueError("Redis password not loaded.")
         env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")

290-301: Mount the shared config file read-only

The shared config should not be mutated in-container. Mark the bind mount readonly.

-        docker_mounts.generated_config_file = DockerMount(
-            DockerMountType.BIND,
-            clp_config.get_shared_config_file_path(),
-            container_clp_config.get_shared_config_file_path(),
-        )
+        docker_mounts.generated_config_file = DockerMount(
+            DockerMountType.BIND,
+            clp_config.get_shared_config_file_path(),
+            container_clp_config.get_shared_config_file_path(),
+            True,
+        )

695-710: Celery env: parameterize Redis DB and URL-encode credentials

Hardcoding the query DB breaks compression components; also embed raw credentials which can break URL parsing. Add a redis_database parameter and encode credentials.

-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
+def get_celery_connection_env_vars_list(
+    container_clp_config: CLPConfig,
+    redis_database: Optional[int] = None,
+) -> List[str]:
@@
-    env_vars = [
-        f"BROKER_URL=amqp://"
-        f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
-        f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
-        f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
-        f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
-        f"{container_clp_config.redis.query_backend_database}",
-    ]
+    from urllib.parse import quote
+    q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+    q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+    r_pass = container_clp_config.redis.password
+    db_index = (
+        container_clp_config.redis.query_backend_database
+        if redis_database is None
+        else redis_database
+    )
+    broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+    if r_pass:
+        backend = (
+            "redis://default:"
+            f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}"
+        )
+    else:
+        backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}"
+    env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
 
     return env_vars

Update call sites in start_clp.py (schedulers/workers) accordingly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f063bd9 and 31342c3.

📒 Files selected for processing (8)
  • components/clp-package-utils/clp_package_utils/general.py (7 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (5 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (13 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
📚 Learning: 2024-11-15T16:22:33.635Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-19T14:41:28.856Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
🧬 Code Graph Analysis (8)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • get_container_config_filename (325-326)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
  • get_container_config_filename (325-326)
  • generate_container_start_cmd (362-403)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
  • get_container_config_filename (325-326)
  • generate_container_start_cmd (362-403)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (3)
  • get_container_config_filename (325-326)
  • DockerMount (66-85)
  • generate_container_start_cmd (362-403)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
  • read_yaml_config_file (56-62)
  • get_config_value (25-39)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (14)
  • get_shared_config_file_path (925-926)
  • dump_to_primitive_dict (249-250)
  • dump_to_primitive_dict (361-362)
  • dump_to_primitive_dict (456-457)
  • dump_to_primitive_dict (552-553)
  • dump_to_primitive_dict (569-572)
  • dump_to_primitive_dict (601-604)
  • dump_to_primitive_dict (702-705)
  • dump_to_primitive_dict (724-727)
  • dump_to_primitive_dict (934-956)
  • dump_to_primitive_dict (968-976)
  • load_credentials_from_file (252-262)
  • load_credentials_from_file (364-373)
  • load_credentials_from_file (459-469)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (6)
  • dump_shared_container_config (349-359)
  • generate_container_config (229-310)
  • get_celery_connection_env_vars_list (695-710)
  • get_clp_home (109-125)
  • get_common_env_vars_list (648-662)
  • get_credential_env_vars_list (665-692)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_shared_config_file_path (925-926)
  • StorageType (107-109)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_container_auth_options (101-177)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
  • get_container_config_filename (325-326)
  • generate_container_start_cmd (362-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (22)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)

22-22: Switch to get_container_config_filename + updated dump signature: LGTM

Importing and using get_container_config_filename(container_name) to drive dump_container_config is consistent with the unified-config approach and avoids per-component filename logic spread across call sites.

Also aligns with our previous learning that dataset validation should be done at the wrapper level; this file maintains that pattern.

Also applies to: 128-129

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

218-219: Config filename resolution via get_container_config_filename: LGTM

Dumping config with an explicit filename derived from the container name makes the flow clearer and standardizes generated file naming across scripts.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

105-106: Config filename via get_container_config_filename: LGTM

Consistent with the new unified config flow.

components/clp-package-utils/clp_package_utils/scripts/search.py (1)

118-119: Using get_container_config_filename when dumping config: LGTM

Matches the standardization across scripts.

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

208-209: Config filename standardization via get_container_config_filename: LGTM

Good move towards a single pattern for container config file creation.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

52-58: Imports LGTM; new helpers align with refactor goals

The added imports for dump_shared_container_config, get_celery_connection_env_vars_list, get_common_env_vars_list, and get_credential_env_vars_list match the new standardized env/config flow.


269-269: LGTM: use shared config path

Switching to the unified shared config file is correct and aligns with the PR objective.


617-617: Shared config mount added — good catch

Adding mounts.generated_config_file here prevents reliance on the logs dir mount and avoids drift.


631-636: LGTM: pass unified config to the scheduler

Using the shared config file for schedulers is consistent with the new approach.


1006-1014: LGTM: reducer picks up shared config and mounts it

Using CLP_CONFIG with the shared config file and adding the generated_config_file mount is consistent and correct.

Also applies to: 1019-1025


1246-1247: LGTM: shared config dump on startup

Dumping the shared container config once at startup ensures consumers can read non-credential settings consistently.

components/clp-package-utils/clp_package_utils/general.py (4)

325-327: LGTM: centralised container-config filename generator

Having a single helper prevents filename drift across consumers.


329-346: Docstring and return contract are clear

dump_container_config’s contract (container path and host path) is clear and matches usage. Good split from dump_shared_container_config.


471-490: File-based credential loaders invoked intentionally — confirm intent

These helpers load credentials from the credentials file. Given the PR’s scope is to avoid writing creds to temporary per-container files (not to remove the credentials file entirely), this usage is fine. If the long-term plan is env-only source of truth, consider deprecating these and switching to load_credentials_from_env().


648-663: LGTM: common env list

Returning only PYTHONPATH and optionally CLP_HOME keeps the surface minimal and re-usable.

components/clp-py-utils/clp_py_utils/clp_config.py (7)

85-86: LGTM: single source of truth for shared config filename

Using CLP_SHARED_CONFIG_FILENAME avoids filename drift and is referenced by start scripts and general.py.


88-94: LGTM: env-var name constants introduced

Centralising credential env-var names reduces typos and improves consistency across components.


347-354: Redis password optional is sensible

Allowing None for password matches Redis’ optional auth. Good alignment with downstream URL builder once encoded/omitted accordingly.


453-458: Queue: secret scrubbing looks correct

dump_to_primitive_dict excludes username/password as intended.

Also applies to: 456-458


780-785: _get_env_var: concise and clear

Helper is minimal and produces clean error messages. Good.


925-927: LGTM: shared config path uses the declared filename constant

Correct and consistent with the rest of the stack.


934-956: Top-level dump: refactor to avoid duplicating nested serialization

Your refactor to build d excluding nested keys then appending each nested dump is cleaner and maintainable.

Comment on lines +251 to +260
env_vars = [
*get_common_env_vars_list(),
*get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
]
necessary_mounts = [
mounts.clp_home,
mounts.data_dir,
mounts.logs_dir,
mounts.generated_config_file,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Stop leaking DB credentials in create_db_tables; inject via subprocess env

Env vars currently include CLP_DB_USER/CLP_DB_PASS as KEY=VALUE and you log the full command. This leaks credentials via ps and debug logs. Emit key names only on the CLI and pass actual values via subprocess.run(env=...).

Apply:

@@
-    env_vars = [
-        *get_common_env_vars_list(),
-        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-    ]
+    from clp_py_utils.clp_config import (
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+    )
+    env_vars = [
+        *get_common_env_vars_list(),
+        # Keys only; values injected via subprocess env below
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+    ]
@@
-    cmd = container_start_cmd + create_tables_cmd
-    logger.debug(" ".join(cmd))
-    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+    cmd = container_start_cmd + create_tables_cmd
+    logger.debug(" ".join(cmd))
+    secrets_env = {
+        CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+    }
+    subprocess.run(
+        cmd,
+        stdout=subprocess.DEVNULL,
+        check=True,
+        env={**os.environ, **secrets_env},
+    )

Also applies to: 274-276

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 251-260 (and also apply same change to 274-276), the current env_vars list
contains secret values (CLP_DB_USER/CLP_DB_PASS) which are being concatenated
into the logged CLI string and can leak via ps/debug logs; instead build two
things: (1) a display-only list of env KEY names (e.g., "CLP_DB_USER",
"CLP_DB_PASS", plus other keys) that you can safely join and log, and (2) an
actual env dict mapping keys to their secret values (pulled from
get_credential_env_vars_list or container_clp_config) which you pass into
subprocess.run(..., env=env_dict); update any subprocess.run calls that
currently embed KEY=VALUE in the command to pass only the command arguments and
the env parameter, and ensure all logs print only the key names, not the values.

Comment on lines 621 to 623
if aws_env_vars:
necessary_env_vars.extend(aws_env_vars)
env_vars.extend(aws_env_vars)
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid embedding AWS secrets on the CLI; prefer keys-only + env injection

aws_env_vars are KEY=VALUE pairs and will leak via ps/logs. Emit only keys and pass values via subprocess.run(env=...). If you want, I can provide a concrete patch after you confirm keys-only mode here.

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 621-623, the code currently extends env_vars with aws_env_vars (KEY=VALUE
pairs) which embeds secret values into CLI args/logs; instead, parse
aws_env_vars into key/value pairs, add only the keys (not "KEY=VALUE") to the
command argument list (or flag list) and pass the actual values via the
subprocess environment parameter (e.g., merge into the env dict passed to
subprocess.run); ensure you handle entries without '=' safely and do not log the
values anywhere.

Comment on lines +932 to 939
env_vars = [
*get_common_env_vars_list(),
*get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
f"NODE_PATH={node_path}",
f"HOST={clp_config.webui.host}",
f"PORT={clp_config.webui.port}",
f"CLP_DB_USER={clp_config.database.username}",
f"CLP_DB_PASS={clp_config.database.password}",
f"NODE_ENV=production",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

WebUI: remove DB/AWS secrets from CLI; inject via subprocess env

Both DB and AWS credentials are emitted as KEY=VALUE on the docker argv, and then the full command is executed/logged. Switch to keys-only on the CLI and pass values via env.

Apply:

@@
-    env_vars = [
-        *get_common_env_vars_list(),
-        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-        f"NODE_PATH={node_path}",
-        f"HOST={clp_config.webui.host}",
-        f"PORT={clp_config.webui.port}",
-        f"NODE_ENV=production",
-    ]
+    from clp_py_utils.clp_config import (
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+    )
+    env_vars = [
+        *get_common_env_vars_list(),
+        # Keys only; values injected via secrets_env below
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+        f"NODE_PATH={node_path}",
+        f"HOST={clp_config.webui.host}",
+        f"PORT={clp_config.webui.port}",
+        f"NODE_ENV=production",
+    ]
@@
-        if AwsAuthType.credentials == auth.type:
-            credentials = auth.credentials
-            env_vars.append(f"AWS_ACCESS_KEY_ID={credentials.access_key_id}")
-            env_vars.append(f"AWS_SECRET_ACCESS_KEY={credentials.secret_access_key}")
+        if AwsAuthType.credentials == auth.type:
+            # Keys only; values injected via secrets_env below
+            env_vars.append("AWS_ACCESS_KEY_ID")
+            env_vars.append("AWS_SECRET_ACCESS_KEY")
@@
-            if aws_env_vars:
-                env_vars.extend(aws_env_vars)
+            if aws_env_vars:
+                # Convert KEY=VALUE entries into keys-only; stash values to secrets_env
+                for kv in aws_env_vars:
+                    env_vars.append(kv.split("=", 1)[0])
@@
-    cmd = container_cmd + node_cmd
-    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+    cmd = container_cmd + node_cmd
+    secrets_env = {
+        CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+    }
+    if StorageType.S3 == stream_storage.type:
+        auth = stream_storage.s3_config.aws_authentication
+        if AwsAuthType.credentials == auth.type:
+            credentials = auth.credentials
+            secrets_env["AWS_ACCESS_KEY_ID"] = credentials.access_key_id
+            secrets_env["AWS_SECRET_ACCESS_KEY"] = credentials.secret_access_key
+        else:
+            if aws_env_vars:
+                for kv in aws_env_vars:
+                    k, v = kv.split("=", 1)
+                    secrets_env[k] = v
+    subprocess.run(
+        cmd,
+        stdout=subprocess.DEVNULL,
+        check=True,
+        env={**os.environ, **secrets_env},
+    )

Also applies to: 947-948, 956-956, 960-960, 967-969

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 932-939 (also update the similar spots at 947-948, 956, 960, and 967-969),
the current code places full DB/AWS KEY=VALUE pairs directly into the docker
command args and logs them; change this to emit only the keys (e.g., "DB_USER",
"DB_PASS", "AWS_ACCESS_KEY_ID", etc.) on the CLI and move the corresponding
values into the subprocess environment dict passed to the process invocation
(populate env from get_credential_env_vars_list but strip values from the argv
and set them in os.environ or the subprocess env param); ensure the docker argv
contains only KEY names (no =VALUE), and remove sensitive values from any logged
command output by constructing the argv separately from the env with values kept
only in the subprocess env.

Comment on lines 1069 to 1079
necessary_mounts = [
mounts.clp_home,
mounts.logs_dir,
mounts.generated_config_file,
]
env_vars = [
*get_common_env_vars_list(include_clp_home_env_var=True),
*get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
f"CLP_LOGS_DIR={container_logs_dir}",
f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Garbage collector: remove DB/AWS secrets from CLI; inject via subprocess env

Same leak pattern as create_db_tables/WebUI. Emit keys-only and pass values via env.

Apply:

@@
-    env_vars = [
-        *get_common_env_vars_list(include_clp_home_env_var=True),
-        *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
-        f"CLP_LOGS_DIR={container_logs_dir}",
-        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
-    ]
+    from clp_py_utils.clp_config import (
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+    )
+    env_vars = [
+        *get_common_env_vars_list(include_clp_home_env_var=True),
+        # Keys only; values injected via env below
+        CLP_DB_USER_ENV_VAR_NAME,
+        CLP_DB_PASS_ENV_VAR_NAME,
+        f"CLP_LOGS_DIR={container_logs_dir}",
+        f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
+    ]
@@
-    if aws_env_vars:
-        env_vars.extend(aws_env_vars)
+    if aws_env_vars:
+        for kv in aws_env_vars:
+            env_vars.append(kv.split("=", 1)[0])
@@
-    cmd = container_start_cmd + garbage_collector_cmd
-    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+    cmd = container_start_cmd + garbage_collector_cmd
+    secrets_env = {
+        CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+        CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+    }
+    if aws_env_vars:
+        for kv in aws_env_vars:
+            k, v = kv.split("=", 1)
+            secrets_env[k] = v
+    subprocess.run(
+        cmd,
+        stdout=subprocess.DEVNULL,
+        check=True,
+        env={**os.environ, **secrets_env},
+    )

Also applies to: 1091-1091, 1093-1101

Comment on lines +249 to +270
def dump_to_primitive_dict(self):
return self.dict(exclude={"username", "password"})

def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
config = read_yaml_config_file(credentials_file_path)
if config is None:
raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
try:
self.username = get_config_value(config, f"{DB_COMPONENT_NAME}.user")
self.password = get_config_value(config, f"{DB_COMPONENT_NAME}.password")
except KeyError as ex:
raise ValueError(
f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
)

def load_credentials_from_env(self):
"""
:raise ValueError: if any expected environment variable is not set.
"""
self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME)
self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Database: secret scrubbing + env loader look good; docstring minor

dump_to_primitive_dict correctly excludes secrets; env loader is straightforward. Small docstring nit: you’re using “:raise ValueError:” which is acceptable per team preference; no change required.

🤖 Prompt for AI Agents
components/clp-py-utils/clp_py_utils/clp_config.py around lines 249 to 270: no
functional changes required — dump_to_primitive_dict correctly excludes secrets,
load_credentials_from_file and load_credentials_from_env behave as expected, and
the docstring using ":raise ValueError:" is acceptable per team preference;
leave the code as-is.

Comment on lines +361 to +380
def dump_to_primitive_dict(self):
return self.dict(exclude={"password"})

def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
config = read_yaml_config_file(credentials_file_path)
if config is None:
raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
try:
self.password = get_config_value(config, f"{REDIS_COMPONENT_NAME}.password")
except KeyError as ex:
raise ValueError(
f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
)

def load_credentials_from_env(self):
"""
:raise ValueError: if any expected environment variable is not set.
"""
self.password = _get_env_var(CLP_REDIS_PASS_ENV_VAR_NAME)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Redis: secret scrubbing + env loader are fine; optional: docstring “raises”

Behaviour is correct; if you later standardize on “:raises ValueError:” across the file, update here for consistency.

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 361 to 380,
update the load_credentials_from_env docstring to match the project’s docstring
style by changing the ":raise ValueError:" line to ":raises ValueError:"
(plural) so it is consistent with other docstrings in the file; ensure
punctuation and spacing match adjacent docstrings.

Comment on lines +459 to 477
def load_credentials_from_file(self, credentials_file_path: pathlib.Path):
config = read_yaml_config_file(credentials_file_path)
if config is None:
raise ValueError(f"Credentials file '{credentials_file_path}' is empty.")
try:
self.username = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.user")
self.password = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.password")
except KeyError as ex:
raise ValueError(
f"Credentials file '{credentials_file_path}' does not contain key '{ex}'."
)

def load_credentials_from_env(self):
"""
:raise ValueError: if any expected environment variable is not set.
"""
self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME)
self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Queue: file/env loaders consistent; optional: validate non-empty values

If you want early failure on blank strings (not just missing vars), add non-empty checks here similar to DB. Can be a follow-up.

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 459 to 477,
the credential loaders accept empty strings which can silently allow invalid
credentials; add explicit non-empty validation after reading username and
password in both load_credentials_from_file and load_credentials_from_env, and
raise ValueError with a clear message including the credentials_file_path or
environment variable names when a value is empty; mirror the DB loader behavior
(check for falsy/blank values, not just missing keys) so both file and env paths
fail fast on blank credentials.

@junhaoliao junhaoliao changed the title refactor(package): Standardize configuration and credentials passing to containers (resolves #1149). refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). Aug 19, 2025
@junhaoliao junhaoliao merged commit dab1010 into y-scope:main Aug 19, 2025
8 of 9 checks passed
@junhaoliao junhaoliao deleted the db-config-file branch August 19, 2025 19:23
"--log-driver", "local"
]
env_vars = {
"PYTHONPATH": clp_site_packages_dir,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above "-e", f"PYTHONPATH={clp_site_packages_dir}", line was not removed, as a mistake possibly during branch syncs with main, which causes the environment variable to be defined twice. The duplication isn't harmful in the correctness though.

@coderabbitai create an issue to track the removal of the deplicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Create GitHub issue to track duplicate PYTHONPATH removal
gh issue create \
  --title "Remove duplicate PYTHONPATH environment variable in generate_container_start_cmd" \
  --body "## Description

In \`components/clp-package-utils/clp_package_utils/general.py\`, the \`generate_container_start_cmd\` function defines the PYTHONPATH environment variable twice:

1. Directly in the \`container_start_cmd\` array: \`\"-e\", f\"PYTHONPATH={clp_site_packages_dir}\"\`
2. In the \`env_vars\` dictionary, which is then iterated to add environment variables to the command

This duplication occurred during branch syncs and should be cleaned up. While not harmful to correctness, only one definition is necessary.

## References

- PR: https://github.com/y-scope/clp/pull/1152
- Comment: https://github.com/y-scope/clp/pull/1152#discussion_r2468365115
- Reported by: @junhaoliao" \
  --assignee junhaoliao \
  --repo y-scope/clp

Length of output: 42


@junhaoliao, I've created issue #1513 to track the removal of the duplicate PYTHONPATH environment variable definition in generate_container_start_cmd. The issue has been assigned to you.


🧬 Code Graph Analysis Results

[clp-package-utils/general.py] Summary: generate_container_start_cmd

  • Description: Creates the docker run command to start a container with mounts, environment variables, and a name.
  • Parameters:
    • container_name: str
    • container_mounts: List[Optional[DockerMount]]
    • container_image: str
    • extra_env_vars: Optional[Dict[str, str]]
  • Returns: List[str] representing the docker run command.
  • Key behavior:
    • Defines clp_site_packages_dir as CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages".
    • Includes an environment variable "-e PYTHONPATH={clp_site_packages_dir}" unconditionally.
    • Merges extra_env_vars (if provided) into the command as additional -e KEY=VALUE pairs.
    • Appends mount arguments for non-None mounts.
  • Important details:
    • The command explicitly sets PYTHONPATH in the container startup command, which may duplicate with other environment-variable provisioning (e.g., a separate list of environment vars that also sets PYTHONPATH).
    • This function is central to how the container’s environment is prepared and could interact with other env var provisioning logic.

[clp-package-utils/general.py] Summary: get_common_env_vars_list

  • Description: Returns a list of common environment variables for Docker containers.
  • Parameters:
    • include_clp_home_env_var: bool (default True)
  • Returns: List[str] where each item is in the form KEY=VALUE.
  • Key behavior:
    • Always includes PYTHONPATH={CONTAINER_CLP_HOME}/lib/python3/site-packages.
    • If include_clp_home_env_var is True, also includes CLP_HOME={CONTAINER_CLP_HOME}.
  • Important details:
    • This list provides a standard set of env vars for containers.
    • Note: PYTHONPATH is defined here, which can interact with the PYTHONPATH set in generate_container_start_cmd (potential duplication if both methods are used).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants