-
Notifications
You must be signed in to change notification settings - Fork 83
refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@junhaoliao has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 41 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds shared generated-config file support and mounting, moves secret credentials to environment-variable flows with new env-name constants and load-from-env methods, extends container start to accept extra environment variables, centralizes config dumping, and updates many scripts and the CLPConfig model to use the new plumbing. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Script
participant CLPConfig
participant OS
participant Docker
User->>Script: invoke CLI (start_clp / compress / search / decompress / native)
Script->>CLPConfig: parse YAML -> CLPConfig
Script->>OS: read env (CLP_DB_*, CLP_QUEUE_*, CLP_REDIS_*)
OS-->>Script: env values
Script->>CLPConfig: clp_config.database.load_credentials_from_env()
alt shared config required
Script->>OS: dump_shared_container_config / dump_container_config(config_filename)
OS-->>Script: host config path
Script->>Docker: ensure generated_config_file mount present
end
Script->>Script: build extra_env_vars (KEY=VALUE) + common env list
Script->>Docker: generate_container_start_cmd(..., extra_env_vars)
Docker-->>Script: container started with mounts and injected env
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
732-752: Missinggenerated_config_dirmount in worker configuration.The function sets
CLP_CONFIG_PATHenvironment variable pointing to the generated config file but doesn't includemounts.generated_config_dirin the necessary mounts. This could cause the worker to fail accessing the config file.necessary_mounts = [ mounts.clp_home, mounts.data_dir, mounts.logs_dir, + mounts.generated_config_dir, ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (17)
components/clp-package-utils/clp_package_utils/general.py(8 hunks)components/clp-package-utils/clp_package_utils/scripts/archive_manager.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/decompress.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/native/decompress.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/native/search.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/search.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(10 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(5 hunks)components/clp-py-utils/clp_py_utils/create-db-tables.py(1 hunks)components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py(3 hunks)components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py(3 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py(1 hunks)
🧰 Additional context used
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: in `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variabl...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/native/search.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/native/search.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: in clp installation scripts, consistency across platform scripts is prioritized over defensive progr...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: for installation scripts in the clp project, prefer explicit error handling over automatic dependenc...
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/search.py
📚 Learning: in the current clp codebase implementation, dataset validation using validate_dataset() is performed...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/search.py
📚 Learning: in the clp project, when reviewing cmakelists.txt changes that introduce new compression library dep...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in the c++ `filedecompressor` implementations at `components/core/src/clp/clp/filedecompressor.cpp` ...
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (40)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
34-34: LGTM: Import addition supports configuration standardization.The import of
load_worker_configaligns with the PR's goal of standardizing configuration and credential delivery to containers.components/clp-package-utils/clp_package_utils/scripts/search.py (1)
120-126: LGTM: Improved security by passing database credentials via environment variables.The addition of
extra_env_varscontaining database credentials and updating thegenerate_container_start_cmdcall aligns with the PR's security improvements. This approach prevents credentials from being embedded in temporary configuration files.components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
209-219: LGTM: Consistent credential handling standardization.The addition of
extra_env_varsfor database credentials follows the same secure pattern implemented across other scripts. This ensures consistent and secure credential delivery to containers.components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
105-131: LGTM: Secure credential handling in file extraction function.The addition of
extra_env_varscontaining database credentials and updating thegenerate_container_start_cmdcall ensures secure credential delivery for file extraction operations.
211-217: LGTM: Consistent credential handling in stream extraction function.The implementation matches the pattern used in the file extraction function, ensuring consistent and secure credential handling across both extraction methods.
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
22-22: Confirm configuration-loading updates in create-db-tables.pyI see the new
-calias and updated help text (“CLP configuration file.”) on line 22; that part looks good. However, I wasn’t able to find any use ofCLPConfig.parse_obj()orload_database_credentials_from_env()in this script—only the help text change is present. If you intended to expand the configuration-loading logic here, please verify those changes have been applied or update the summary accordingly.components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
208-211: LGTM - Credential extraction for environment variablesThe extraction of database credentials into environment variables aligns with the PR's goal to standardize credential delivery to containers.
231-231: LGTM - Environment variable injection into container start commandThe modification to pass
extra_env_varstogenerate_container_start_cmdcorrectly implements the standardized approach for credential delivery to containers.components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2)
3-3: LGTM - Import addition for environment variable accessThe
osmodule import is necessary for accessing environment variables in the credential override block.
192-197: LGTM - Environment variable override for database credentialsThe implementation correctly overrides database credentials from environment variables and provides appropriate error handling for missing variables. This aligns with the PR's standardization of credential delivery.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
4-4: LGTM - Import addition for environment variable accessThe
osmodule import is necessary for accessing environment variables in the credential override logic.
233-238: LGTM - Environment variable override for database credentialsThe implementation correctly overrides database credentials from environment variables with proper error handling. This follows the established pattern across native scripts for credential standardization.
components/clp-package-utils/clp_package_utils/scripts/native/search.py (2)
7-7: LGTM - Import addition for environment variable accessThe
osmodule import enables access to environment variables for the credential override functionality.
301-306: LGTM - Environment variable override for database credentialsThe implementation correctly loads database credentials from environment variables with appropriate error handling. This is consistent with the pattern used across other native scripts in this PR.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)
1136-1137: LGTM - Standardized configuration loading with environment variable credentialsThe change to use
CLPConfig.load_from_generated_config_file()andload_database_credentials_from_env()correctly implements the new standardized approach for configuration and credential loading.
1138-1143: LGTM - Improved exception handlingThe exception handling is now more specific, catching
ValidationErrorandKeyErrorseparately from generic exceptions, which provides better error reporting and aligns with the new configuration loading approach.components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
406-407: LGTM! Improved configuration loading pattern.The switch to
CLPConfig.load_from_generated_config_file()and environment-based credential loading aligns well with the PR's goal to standardize configuration delivery and enhance security by separating credentials from config files.
408-414: Enhanced error handling with proper exception categorization.Good improvement in exception handling - catching specific
ValidationErrorandKeyErrorexceptions separately from generic exceptions, with appropriate logging for debugging purposes.components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (2)
42-42: Consistent command-line interface improvement.Adding the
-cshort option alongside--configimproves usability and maintains consistency with other scripts in the codebase.
48-56: Excellent standardization of configuration loading.The shift from parsing only the Database section to parsing the full CLPConfig with environment variable credential overrides aligns perfectly with the PR's security and standardization objectives. The error handling appropriately catches validation and configuration errors.
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2)
47-56: Consistent configuration standardization pattern.This follows the same secure pattern established in other initialization scripts - using full CLPConfig parsing with environment variable credential overrides. The error handling is comprehensive and appropriate.
59-60: Proper configuration object usage.Using
clp_config.databasedirectly after the unified configuration loading is the correct approach and maintains consistency with the new configuration management pattern.components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (2)
122-127: Secure credential override implementation.The environment variable credential override pattern enhances security by separating sensitive data from configuration files. The error handling appropriately catches missing environment variables.
234-239: Consistent credential override pattern.Good consistency - the same environment variable credential override pattern is applied in both extraction functions, ensuring uniform security handling throughout the script.
components/clp-py-utils/clp_py_utils/clp_config.py (6)
612-612: Well-designed generated config file path.The
generated_config_file_pathfield provides a centralized location for container configuration files, supporting the PR's goal of standardizing configuration delivery.
624-626: Proper path resolution for generated config.The generated config file path is correctly made absolute, ensuring consistent path handling across different execution contexts.
765-771: Secure environment variable credential loading.The database credential loading method properly validates that required environment variables are present and provides clear error messages for missing credentials.
772-778: Consistent queue credential handling.The queue credential loading follows the same secure pattern as the database credentials, maintaining consistency across different service credentials.
779-784: Complete credential environment variable support.The Redis credential loading completes the environment variable support for all major service credentials, enabling full separation of sensitive data from configuration files.
799-808: Enhanced security in configuration export.Excellent security improvement - removing sensitive credential fields and the generated config file path from exported configuration prevents accidental exposure of sensitive information.
components/clp-package-utils/clp_package_utils/general.py (5)
100-100: LGTM! Consistent with existing mount attributes.The addition of
generated_config_dirfollows the established pattern for optional Docker mounts in the class.
286-297: LGTM! Mount handling follows established patterns.The logic correctly checks if the path is already mounted before creating a new mount, consistent with how other paths are handled.
403-407: LGTM! Node-specific path handling is consistent.The modification correctly makes the generated config file path node-specific by including the hostname, following the same pattern as data and logs directories.
623-676: LGTM! Well-structured centralized environment variable generation.The function effectively centralizes environment variable generation with clear flag-based controls. The implementation correctly formats credentials and connection strings.
Note: Ensure that Docker logs and process listings are properly secured since credentials are passed via environment variables.
333-336: No compatibility issue withdict[str, str]
The project’s Poetry configuration specifiespython = ">= 3.9"(inclp-package-utils, as well as other modules), so built-in generics likedict[str, str]are fully supported. No changes needed.components/clp-package-utils/clp_package_utils/scripts/start_clp.py (5)
244-252: LGTM! Proper credential handling for database table creation.The function correctly uses the centralized environment variable generation with database credentials and includes all necessary mounts.
602-610: LGTM! Scheduler configuration properly centralized.The function correctly generates environment variables with Celery connection parameters and includes component-specific variables for logging.
998-1012: LGTM! Reducer configuration includes all necessary credentials.The function correctly generates environment variables with database, queue, and Redis credentials, and properly includes the generated config mount.
925-938: LGTM! WebUI configuration properly handles database credentials.The function correctly generates environment variables with database credentials and web-specific variables. The absence of
generated_config_dirmount appears intentional as WebUI uses settings.json files rather than the generated config.
1146-1148: LGTM! Config generation properly centralized.The config file is correctly generated once before starting any components, ensuring all containers use the same configuration file as intended by the PR objectives.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py(10 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: in the compression scheduler, the team prefers initializing in-memory caches from the database at st...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: the `query_scheduler.py` file operates with single threading, and multithreading is not used....
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
🧬 Code Graph Analysis (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
load_database_credentials_from_env(765-770)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
load_database_credentials_from_env(765-770)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (10)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)
1137-1137: LGTM! Credential loading from environment variables implemented correctly.The addition of
clp_config.load_database_credentials_from_env()aligns with the PR objective to pass credentials via environment variables instead of configuration files.components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
407-407: LGTM! Credential loading from environment variables implemented correctly.The addition of
clp_config.load_database_credentials_from_env()aligns with the PR objective to pass credentials via environment variables instead of configuration files.
412-412: Good improvement! Enhanced error logging with stack traces.Changing from
logger.error(ex)tologger.exception(ex)provides valuable stack trace information for debugging generic exceptions.components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7)
50-51: LGTM! Import additions align with configuration standardization.The new imports for
dump_container_configandgenerate_common_environment_variablessupport the centralized configuration management approach.
244-261: Well-implemented credential separation for database table creation.The changes correctly:
- Pass database credentials via environment variables using the standardized utility
- Include the generated config directory in mounts
- Reference the centralized configuration file
This improves security by avoiding credential exposure through temporary files.
602-629: Excellent standardization of scheduler configuration handling.The refactoring correctly:
- Centralizes environment variable generation with appropriate credential flags
- Maintains all necessary environment variables (logs directory, logging level)
- Adds the generated config directory to mounts
- Uses the pre-generated configuration file path
This aligns perfectly with the PR's goal of standardizing configuration delivery.
733-748: Verify if the generated config directory mount is needed for workers.The environment variable generation is correctly implemented. However, I notice that unlike other components, the
mounts.generated_config_diris not added tonecessary_mounts.Since the config path is passed via
CLP_CONFIG_PATHenvironment variable, please confirm whether workers need direct access to the generated config directory.
926-935: Appropriate credential handling for WebUI component.The changes correctly pass database credentials via environment variables while maintaining the WebUI-specific configuration approach through settings.json files.
999-1022: Comprehensive credential separation for reducer component.Excellent implementation that:
- Passes all required credentials (database, queue, redis) via environment variables
- Includes the generated config directory in mounts
- Uses the centralized configuration file
This provides a secure and standardized approach for the reducer component.
1147-1149: Perfect placement of centralized configuration generation.The
dump_container_configcall is correctly positioned after all validations and directory creation, ensuring a single configuration file is generated before any components start. This is the cornerstone of the configuration standardization effort.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
haiqi96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best effort review for now
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py
Outdated
Show resolved
Hide resolved
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py
Outdated
Show resolved
Hide resolved
…ntainer_environment_variables
… of empty dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
611-611: Consider using a getter method instead of a member variable.Based on past feedback, having
generated_config_file_pathas a member variable allows users to potentially overwrite the file path, which may not be desirable. Consider implementing this as a getter method instead:def get_generated_config_file_path(self) -> pathlib.Path: return self.logs_directory / ".clp-config.yml"This approach would ensure the path is always computed relative to the current logs_directory and prevents unwanted modifications.
765-783: Consider creating a reusable helper method for environment variable loading.The three methods have similar structure with repeated try-catch blocks for environment variable loading. As suggested in past feedback, consider extracting this into a reusable helper method:
def _get_env_var(self, name: str) -> str: try: return os.environ[name] except KeyError as ex: raise ValueError(f"Missing environment variable: {ex}")Then refactor the methods:
def load_database_credentials_from_env(self): self.database.username = self._get_env_var("CLP_DB_USER") self.database.password = self._get_env_var("CLP_DB_PASS")This reduces code duplication and improves maintainability.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
121-127: Extract duplicated credential override logic to utils.The logic for overriding database credentials from environment variables is duplicated between
handle_extract_stream_cmdandhandle_extract_file_cmd. As previously discussed, this should be extracted into a helper function innative/utils.py.Also applies to: 233-239
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (8)
components/clp-package-utils/clp_package_utils/general.py(9 hunks)components/clp-package-utils/clp_package_utils/scripts/archive_manager.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/decompress.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/native/decompress.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/search.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(11 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(5 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/archive_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-07-01T14:51:19.172Z
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.pycomponents/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-07-03T20:10:43.789Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2025-07-03T12:58:18.407Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2025-07-28T08:33:57.487Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
📚 Learning: 2024-10-10T05:50:15.348Z
Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/SchemaTreeNode.hpp:0-0
Timestamp: 2024-10-10T05:50:15.348Z
Learning: In this codebase, assertions are not used; the project policy is to avoid adding assertions in the code.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-07-25T21:29:48.947Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1126
File: .gitignore:5-5
Timestamp: 2025-07-25T21:29:48.947Z
Learning: In the CLP project, the .clang-format file is maintained in the yscope-dev-utils submodule and copied over to the main CLP repository, so it should be ignored in .gitignore to prevent accidental commits of the copied file and maintain the single source of truth in the submodule.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (6)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(330-366)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(330-366)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(330-366)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(330-366)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
make_config_path_absolute(42-53)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
CLPConfig(589-809)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
624-626: LGTM!The generated config file path is correctly made absolute using the existing
make_config_path_absoluteutility function, which is consistent with how other paths are handled in this method.
798-808: LGTM!The sensitive credential removal is correctly implemented. Excluding
generated_config_file_pathfrom the exported configuration and removing all credential fields (username,passwordfor database/queue, andusernamefor redis) ensures no sensitive information leaks through configuration exports.components/clp-package-utils/clp_package_utils/scripts/search.py (1)
121-127: LGTM!The database credentials are correctly passed as environment variables to the container. This implementation aligns with the PR objective of separating credentials from configuration files and follows the same pattern used in other scripts (
archive_manager.py,compress.py,decompress.py).components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
214-220: LGTM!The database credentials are correctly passed as environment variables to the container start command. This follows the same consistent pattern implemented across other scripts and aligns with the PR objective of externalizing sensitive credentials from configuration files.
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
126-132: LGTM!The database credentials are correctly passed as environment variables in the file extraction command. The implementation follows the same pattern used consistently across other scripts.
211-218: LGTM!The database credentials are correctly passed as environment variables in the stream extraction command. This maintains consistency with the file extraction function and other scripts in the codebase.
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
225-231: LGTM!The database credentials are correctly passed as environment variables to the container start command. This implementation is consistent with the pattern applied across all other scripts (
search.py,archive_manager.py,decompress.py) and successfully achieves the PR objective of externalizing sensitive credentials from configuration files.
| if not is_path_already_mounted( | ||
| clp_home, | ||
| CONTAINER_CLP_HOME, | ||
| clp_config.generated_config_file_path, | ||
| container_clp_config.generated_config_file_path, | ||
| ): | ||
| docker_mounts.generated_config_dir = DockerMount( | ||
| DockerMountType.BIND, | ||
| clp_config.generated_config_file_path, | ||
| container_clp_config.generated_config_file_path, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mount the parent directory of the config file, not the file itself.
The code attempts to mount generated_config_file_path which is a file path, but Docker requires mounting directories. You should mount the parent directory instead.
Apply this diff to fix the mounting issue:
if not is_path_already_mounted(
clp_home,
CONTAINER_CLP_HOME,
- clp_config.generated_config_file_path,
- container_clp_config.generated_config_file_path,
+ clp_config.generated_config_file_path.parent,
+ container_clp_config.generated_config_file_path.parent,
):
docker_mounts.generated_config_dir = DockerMount(
DockerMountType.BIND,
- clp_config.generated_config_file_path,
- container_clp_config.generated_config_file_path,
+ clp_config.generated_config_file_path.parent,
+ container_clp_config.generated_config_file_path.parent,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not is_path_already_mounted( | |
| clp_home, | |
| CONTAINER_CLP_HOME, | |
| clp_config.generated_config_file_path, | |
| container_clp_config.generated_config_file_path, | |
| ): | |
| docker_mounts.generated_config_dir = DockerMount( | |
| DockerMountType.BIND, | |
| clp_config.generated_config_file_path, | |
| container_clp_config.generated_config_file_path, | |
| ) | |
| if not is_path_already_mounted( | |
| clp_home, | |
| CONTAINER_CLP_HOME, | |
| clp_config.generated_config_file_path.parent, | |
| container_clp_config.generated_config_file_path.parent, | |
| ): | |
| docker_mounts.generated_config_dir = DockerMount( | |
| DockerMountType.BIND, | |
| clp_config.generated_config_file_path.parent, | |
| container_clp_config.generated_config_file_path.parent, | |
| ) |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 275 to
286, the code mounts the config file path directly, but Docker requires mounting
directories. Modify the mount source and target paths to use the parent
directory of the config file paths instead of the file paths themselves,
ensuring the mount points are directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using --mount which seems to work for file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haiqi96 You're right—Docker bind mounts support files with --mount as long as the source exists. No change is required to switch to the parent directory.
Two small hardening tweaks you might consider:
- Make the config bind read-only to avoid accidental writes:
- pass is_read_only=True when constructing the DockerMount for the generated config file.
- Optionally skip the file-level mount if the logs directory is already mounted (the file will be available via that directory), to avoid redundant mounts.
Otherwise, keeping the file mount is fine given dump_container_config runs before container start.
🧠 Learnings used
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.409Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
| def generate_common_environment_variables( | ||
| include_clp_home=True, | ||
| ) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Function signature mismatch with usage in start_clp.py.
The function generate_common_environment_variables doesn't accept container_clp_config as a parameter, but it's being called with this argument at line 1002 in start_clp.py. This will cause a TypeError at runtime.
🏁 Script executed:
#!/bin/bash
# Verify all calls to generate_common_environment_variables
rg -A 2 "generate_common_environment_variables\(" --type pyLength of output: 2317
We should confirm the exact locations in start_clp.py where the incorrect calls occur so we can update them precisely. Let’s list those lines with context:
🏁 Script executed:
#!/bin/bash
rg -n -C3 "generate_common_environment_variables" components/clp-package-utils/clp_package_utils/scripts/start_clp.pyLength of output: 1838
Fix incorrect calls to generate_common_environment_variables in start_clp.py.
The function signature in components/clp-package-utils/clp_package_utils/general.py is
def generate_common_environment_variables(
include_clp_home: bool = True,
) -> List[str]:
...It does not accept a container_clp_config argument, yet in start_clp.py it’s being called like this:
• start_clp.py:247
*generate_common_environment_variables(container_clp_config),
• start_clp.py:608
*generate_common_environment_variables(container_clp_config),
• start_clp.py:929
*generate_common_environment_variables(container_clp_config),
• start_clp.py:1002
*generate_common_environment_variables(container_clp_config, include_clp_home=True),
To resolve the TypeError at runtime, update these invocations so they only pass the include_clp_home flag (or rely on its default), for example:
- *generate_common_environment_variables(container_clp_config),
+ *generate_common_environment_variables(),
- *generate_common_environment_variables(container_clp_config, include_clp_home=True),
+ *generate_common_environment_variables(include_clp_home=True),Please apply these changes at the locations above.
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py at lines 622 to
624, the function generate_common_environment_variables only accepts the
include_clp_home boolean argument. In start_clp.py at lines 247, 608, 929, and
1002, calls to this function incorrectly pass container_clp_config as an
argument. To fix the TypeError, update these calls to remove
container_clp_config and only pass include_clp_home if needed, or rely on its
default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the code again? i don't think we are calling generate_common_environment_variables with container_clp_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -e
echo "== Definition =="
rg -n -C1 $'^def generate_common_environment_variables\\(' components
echo ""
echo "== All call sites =="
rg -n -C3 "generate_common_environment_variables\\(" --type py
echo ""
echo "== start_clp.py call sites =="
fd start_clp.py components | xargs -I{} rg -n -C3 "generate_common_environment_variables\\(" {}
Length of output: 6350
@coderabbit-user Thanks, @haiqi96 — I rechecked the code with a repo-wide search. All calls to generate_common_environment_variables are correct; none passes container_clp_config.
Verified call sites in start_clp.py:
- Line 246: *generate_common_environment_variables()
- Line 607: *generate_common_environment_variables()
- Line 738: *generate_common_environment_variables(include_clp_home=True)
- Line 930: *generate_common_environment_variables()
- Line 1005: *generate_common_environment_variables(include_clp_home=True)
Conclusion: No TypeError risk here. The earlier note was incorrect; no changes needed.
🧬 Code Graph Analysis Results
Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 313-346):
def generate_common_environment_variables(
include_clp_home=True,
) -> List[str]:
"""
Generate a list of common environment variables for Docker containers.
:param include_clp_home:
:return: List of environment variable strings in the format "KEY=VALUE".
"""
clp_site_packages_dir = CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages"
env_vars = [f"PYTHONPATH={clp_site_packages_dir}"]
if include_clp_home:
env_vars.append(f"CLP_HOME={CONTAINER_CLP_HOME}")
return env_varsThis function generate_common_environment_variables does not take any container_clp_config parameter. It only optionally includes the CLP_HOME environment variable and always includes PYTHONPATH.
Relevant snippet from components/clp-package-utils/clp_package_utils/general.py (lines 348-374):
def generate_credential_environment_variables(
container_clp_config: CLPConfig,
include_db_credentials=False,
include_queue_credentials=False,
include_redis_credentials=False,
) -> List[str]:
"""
Generate a list of credential environment variables for Docker containers.
:param container_clp_config:
:param include_db_credentials:
:param include_queue_credentials:
:param include_redis_credentials:
:return: List of environment variable strings in the format "KEY=VALUE".
"""
env_vars = []
if include_db_credentials:
env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}")
env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
if include_queue_credentials:
env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}")
env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}")
if include_redis_credentials:
env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
return env_varsThis function generate_credential_environment_variables explicitly takes container_clp_config as a parameter and uses it to generate environment variables related to credentials.
Summary of generate_container_config function (lines 180-236):
This function takes a clp_config and clp_home and returns a tuple (container_clp_config, docker_mounts) where:
container_clp_configis a deep copy ofclp_configwith paths adjusted for container usage.docker_mountsis an instance ofCLPDockerMountscontainingDockerMountobjects for various directories.
This function prepares the container configuration but does not call generate_common_environment_variables.
Summary of generate_container_start_cmd function (lines 270-300):
This function generates the Docker run command line as a list of strings. It accepts:
container_namecontainer_mounts(list ofDockerMountor None)container_imageextra_env_vars(optional dictionary of environment variables)
It sets some environment variables like PYTHONPATH and CLP_HOME internally, but it does not call generate_common_environment_variables with a container_clp_config argument.
Summary of generate_common_environment_variables usage:
- The function
generate_common_environment_variablesdoes not accept acontainer_clp_configargument. - It only optionally includes
CLP_HOMEand always includesPYTHONPATH. - Credential environment variables are generated separately by
generate_credential_environment_variableswhich does acceptcontainer_clp_config.
Summary of CLPConfig class (from clp_py_utils/clp_config.py):
CLPConfigis a configuration class with attributes likedatabase,queue,redis, etc.- It has methods to load credentials from files or environment variables.
- It is used as the type for
container_clp_configin the environment variable generation functions.
These snippets and summaries provide context on the function signatures and usage of generate_common_environment_variables and related functions in the file general.py. They clarify that generate_common_environment_variables is not called with container_clp_config anywhere in this file.
🧠 Learnings used
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.409Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
components/clp-package-utils/clp_package_utils/general.py (6)
99-99: Nit: align typing style (use Optional instead of typing.Optional)You already import Optional; use it here for consistency.
Apply this diff:
- self.generated_config_file: typing.Optional[DockerMount] = None + self.generated_config_file: Optional[DockerMount] = None
290-301: Make generated config bind mount read-onlyThe generated config should not be modified by containers. Mark the mount as read-only.
Apply this diff:
- docker_mounts.generated_config_file = DockerMount( - DockerMountType.BIND, - clp_config.get_generated_config_file_path(), - container_clp_config.get_generated_config_file_path(), - ) + docker_mounts.generated_config_file = DockerMount( + DockerMountType.BIND, + clp_config.get_generated_config_file_path(), + container_clp_config.get_generated_config_file_path(), + True, + )
391-397: Security: avoid leaking secrets via CLI env “-e KEY=VALUE”Passing credentials via “-e KEY=VALUE” exposes them in process lists, shell history, and build logs. Prefer emitting only keys in the Docker CLI and pass the actual values via subprocess.run(..., env=...) so values don’t appear on the command line.
Two-step hardening (minimal surface change here, update call sites accordingly):
- Build docker cmd with “-e KEY” only.
- At container start, merge os.environ with the env map and pass it via subprocess’s env parameter.
Apply this diff for the builder:
- env_vars = { - "PYTHONPATH": clp_site_packages_dir, - **(extra_env_vars if extra_env_vars is not None else {}), - } - for key, value in env_vars.items(): - container_start_cmd.extend(["-e", f"{key}={value}"]) + env_vars = { + "PYTHONPATH": str(clp_site_packages_dir), + **(extra_env_vars if extra_env_vars is not None else {}), + } + for key in env_vars.keys(): + container_start_cmd.extend(["-e", key])Follow-up: I can generate the required call-site changes (merging env dicts and passing env=...) across scripts if you’d like.
681-690: Defensive checks: avoid emitting None-valued credentials into envIf credentials aren’t loaded, this can produce “KEY=None” entries. Raise early with a clear message.
Apply this diff:
- if include_db_credentials: - env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") - env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") + if include_db_credentials: + if ( + container_clp_config.database.username is None + or container_clp_config.database.password is None + ): + raise ValueError("Database credentials not loaded.") + env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") + env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") @@ - if include_queue_credentials: - env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") - env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") + if include_queue_credentials: + if ( + container_clp_config.queue.username is None + or container_clp_config.queue.password is None + ): + raise ValueError("Queue credentials not loaded.") + env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") + env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") @@ - if include_redis_credentials: - env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}") + if include_redis_credentials: + if container_clp_config.redis.password is None: + raise ValueError("Redis password not loaded.") + env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
362-377: Nit: clarify docstring about extra_env_vars and reserved keysDocument that PYTHONPATH is pre-set and that callers should avoid overriding container-managed keys unless intentionally doing so.
Apply this diff:
- :param extra_env_vars: Environment variables to set on top of the predefined ones. + :param extra_env_vars: Environment variables to set on top of the predefined ones. + Note: PYTHONPATH is set by this function; avoid overriding unless + intentional. Prefer not to pass secrets here if command-line visibility + is a concern; pass them via subprocess env instead.
701-709: URL-encode credentials in Celery URLs; handle optional Redis authRaw credentials can break URLs if they contain reserved characters. Also omit the “:password@” segment when Redis has no password.
Apply this diff:
- env_vars = [ - f"BROKER_URL=amqp://" - f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" - f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", - f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" - f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" - f"{container_clp_config.redis.query_backend_database}", - ] + from urllib.parse import quote + q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="") + q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="") + broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}" + + r_pass = container_clp_config.redis.password + redis_db = container_clp_config.redis.query_backend_database + if r_pass: + backend = ( + "redis://default:" + f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" + ) + else: + backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" + + env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/general.py(7 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (14)
get_generated_config_file_path(925-926)dump_to_primitive_dict(249-250)dump_to_primitive_dict(361-362)dump_to_primitive_dict(456-457)dump_to_primitive_dict(552-553)dump_to_primitive_dict(569-572)dump_to_primitive_dict(601-604)dump_to_primitive_dict(702-705)dump_to_primitive_dict(724-727)dump_to_primitive_dict(934-950)dump_to_primitive_dict(962-970)load_credentials_from_file(252-262)load_credentials_from_file(364-373)load_credentials_from_file(459-469)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (5)
components/clp-package-utils/clp_package_utils/general.py (5)
12-12: LGTM: typing imports expanded appropriatelyImporting Dict, List, Optional, Tuple is fine and used below.
17-18: LGTM: CLP_GENERATED_CONFIG_FILENAME importMatches new shared-config workflow.
343-347: LGTM: thin wrapper for per-container config dumpClear and reuses shared dump logic.
349-360: LGTM: shared generated-config dumperDocstring and behaviour align with the new “single generated config file” approach.
475-490: LGTM: switch to component-level load_credentials_from_fileConsistent with refactor centralizing credential loading on the sub-configs.
| def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str): | ||
| """ | ||
| Writes the given config to the logs directory so that it's accessible in the container. | ||
| Writes the given container config to the logs directory, so that it's accessible in the | ||
| container. | ||
| :param container_clp_config: The config to write. | ||
| :param clp_config: The corresponding config on the host (used to determine the logs directory). | ||
| :param container_name: | ||
| :param config_filename: | ||
| :return: The path to the config file in the container and on the host. | ||
| """ | ||
| container_config_filename = f".{container_name}-config.yml" | ||
| config_file_path_on_host = clp_config.logs_directory / container_config_filename | ||
| config_file_path_on_container = container_clp_config.logs_directory / container_config_filename | ||
| config_file_path_on_host = clp_config.logs_directory / config_filename | ||
| config_file_path_on_container = container_clp_config.logs_directory / config_filename | ||
| with open(config_file_path_on_host, "w") as f: | ||
| yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f) | ||
|
|
||
| return config_file_path_on_container, config_file_path_on_host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure directory exists before writing config; add return type
Prevent failures when the node-specific logs directory hasn’t been created yet. Also, the function returns a tuple—add its annotation.
Apply this diff:
-def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str):
+def dump_config(
+ container_clp_config: CLPConfig,
+ clp_config: CLPConfig,
+ config_filename: str,
+) -> Tuple[pathlib.Path, pathlib.Path]:
@@
- with open(config_file_path_on_host, "w") as f:
+ config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
+ with open(config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def dump_config(container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str): | |
| """ | |
| Writes the given config to the logs directory so that it's accessible in the container. | |
| Writes the given container config to the logs directory, so that it's accessible in the | |
| container. | |
| :param container_clp_config: The config to write. | |
| :param clp_config: The corresponding config on the host (used to determine the logs directory). | |
| :param container_name: | |
| :param config_filename: | |
| :return: The path to the config file in the container and on the host. | |
| """ | |
| container_config_filename = f".{container_name}-config.yml" | |
| config_file_path_on_host = clp_config.logs_directory / container_config_filename | |
| config_file_path_on_container = container_clp_config.logs_directory / container_config_filename | |
| config_file_path_on_host = clp_config.logs_directory / config_filename | |
| config_file_path_on_container = container_clp_config.logs_directory / config_filename | |
| with open(config_file_path_on_host, "w") as f: | |
| yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f) | |
| return config_file_path_on_container, config_file_path_on_host | |
| def dump_config( | |
| container_clp_config: CLPConfig, | |
| clp_config: CLPConfig, | |
| config_filename: str, | |
| ) -> Tuple[pathlib.Path, pathlib.Path]: | |
| """ | |
| Writes the given container config to the logs directory, so that it's accessible in the | |
| container. | |
| :param container_clp_config: The config to write. | |
| :param clp_config: The corresponding config on the host (used to determine the logs directory). | |
| :param config_filename: | |
| :return: The path to the config file in the container and on the host. | |
| """ | |
| config_file_path_on_host = clp_config.logs_directory / config_filename | |
| config_file_path_on_container = container_clp_config.logs_directory / config_filename | |
| config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True) | |
| with open(config_file_path_on_host, "w") as f: | |
| yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f) | |
| return config_file_path_on_container, config_file_path_on_host |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 325 to
340, ensure the target logs directory exists before opening the file and add an
explicit return type annotation for the function; to fix, call
mkdir(parents=True, exist_ok=True) on clp_config.logs_directory (or the parent
directory of config_file_path_on_host) before opening the file to write, and
update the function signature to include the return type -> Tuple[Path, Path]
(and import Tuple and Path from typing/pathlib if not already imported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the parent path doesn't exist, it's probably fine to let it fail to show any implementation error during development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junhaoliao That's a fair point about letting it fail fast to catch implementation errors during development. The explicit failure when the directory doesn't exist will indeed help identify setup issues early rather than masking them with automatic directory creation.
The return type annotation suggestion is still valid though, as it's separate from the directory creation behavior and improves type safety:
def dump_config(
container_clp_config: CLPConfig,
clp_config: CLPConfig,
config_filename: str,
) -> Tuple[pathlib.Path, pathlib.Path]:✏️ Learnings added
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…`dump_shared_container_config`
|
@kirkrodrigues i just revalidated. it's ready for review again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
123-151: Bug: DB credentials are not passed to the containerUnlike other scripts, this wrapper does not pass CLP_DB_USER/CLP_DB_PASS to the container, so native.dataset_manager inside the container likely cannot connect to the DB.
Apply this change locally (minimal fix consistent with current general.generate_container_start_cmd API):
@@ container_name = generate_container_name("dataset-manager") container_clp_config, mounts = generate_container_config(clp_config, clp_home) generated_config_path_on_container, generated_config_path_on_host = dump_container_config( - container_clp_config, clp_config, f".{container_name}-config.yml" + container_clp_config, clp_config, f".{container_name}-config.yml" ) @@ - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container - ) + # Pass DB credentials to the container + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, necessary_mounts, clp_config.execution_container, extra_env_vars + )Outside this range, add the imports at the top:
from clp_py_utils.clp_config import CLP_DB_USER_ENV_VAR_NAME, CLP_DB_PASS_ENV_VAR_NAMEOptional hardening (recommended across the repo): avoid leaking secrets via Docker argv by emitting only “-e KEY” and injecting values via subprocess env; see related comments on archive_manager/compress/decompress and the generate_container_start_cmd refactor.
♻️ Duplicate comments (20)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
252-263: Deprecate file-based DB credential loader to align with env-only policyThis PR’s objective is to pass credentials exclusively via environment variables for containers. Keep file-based loading only for non-container/legacy workflows and add a clear deprecation note to guide callers.
Apply:
def load_credentials_from_file(self, credentials_file_path: pathlib.Path): + """ + DEPRECATED: Prefer environment variables via load_credentials_from_env(). + Retained for non-container workflows only; slated for removal in a future release. + """ config = read_yaml_config_file(credentials_file_path)
264-270: Fail fast on empty DB creds loaded from env_get_env_var() guards missing vars, but empty strings still pass. Validate non-empty here to surface misconfig early.
Apply:
def load_credentials_from_env(self): """ :raise ValueError: if any expected environment variable is not set. """ - self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME) - self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME) + self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME) + self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME) + if not self.username: + raise ValueError(f"{CLP_DB_USER_ENV_VAR_NAME} cannot be empty") + if not self.password: + raise ValueError(f"{CLP_DB_PASS_ENV_VAR_NAME} cannot be empty")
364-374: Deprecate file-based Redis credential loader (containers should use env)Mark this as deprecated for parity with DB/Queue and to steer callers away from file-based secrets.
Apply:
def load_credentials_from_file(self, credentials_file_path: pathlib.Path): + """ + DEPRECATED: Prefer environment variables via load_credentials_from_env(). + Retained for non-container workflows only; slated for removal in a future release. + """ config = read_yaml_config_file(credentials_file_path)
459-470: Deprecate file-based Queue credential loaderSame rationale as DB/Redis — containers should use env-vars.
Apply:
def load_credentials_from_file(self, credentials_file_path: pathlib.Path): + """ + DEPRECATED: Prefer environment variables via load_credentials_from_env(). + Retained for non-container workflows only; slated for removal in a future release. + """ config = read_yaml_config_file(credentials_file_path)
471-477: Fail fast on empty Queue creds loaded from envPrevent silent propagation of blank credentials.
Apply:
def load_credentials_from_env(self): """ :raise ValueError: if any expected environment variable is not set. """ - self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME) - self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME) + self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME) + self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME) + if not self.username: + raise ValueError(f"{CLP_QUEUE_USER_ENV_VAR_NAME} cannot be empty") + if not self.password: + raise ValueError(f"{CLP_QUEUE_PASS_ENV_VAR_NAME} cannot be empty")components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
225-231: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)Passing secrets as “-e KEY=VALUE” exposes them via ps and error logs. Emit only names on the CLI and inject values via subprocess env.
Apply within this file after updating general.generate_container_start_cmd to accept names only:
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd: List[str] = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + # Provide only names to the Docker CLI; inject values via subprocess env. + container_start_cmd: List[str] = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Then execute with injected env (outside this range):
import os # ensure imported at top # ... proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars})Additionally, avoid logging full command lines that include env values.
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
227-233: Security: DB credentials leak via Docker CLI (-e KEY=VALUE)Same issue: secrets are exposed via argv. Switch to passing only names on the CLI and values via subprocess env.
Apply here once general.generate_container_start_cmd is refactored to take env var names:
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this range:
import os # at top if missing # ... subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
128-134: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in file extraction pathDo not embed secret values in argv; pass only names on the CLI and inject values via subprocess env.
Apply after refactoring general.generate_container_start_cmd:
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this range:
import os # at top if missing # ... subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})
201-210: Security: DB credentials leak via Docker CLI (-e KEY=VALUE) in stream extraction pathSame fix as above for the second path in this script.
Apply:
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this range:
# Ensure subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars})components/clp-package-utils/clp_package_utils/scripts/search.py (1)
120-126: Security: Avoid exposing DB credentials via Docker CLIDatabase credentials are currently exposed in the Docker command line through
-e KEY=VALUE, making them visible viapsand in logs. This is the same security issue flagged in previous reviews.The extra_env_vars dictionary contains sensitive database credentials that will be embedded in the Docker command. This needs to be fixed by either:
- Passing only environment variable names to the Docker CLI (e.g.,
-e KEY) and providing values via subprocess environment- Using Docker secrets or config files for credential management
See the previous review comment for the detailed fix.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)
610-616: Scheduler logging level misconfiguration persistsThe compression scheduler still incorrectly uses the query scheduler's logging level. This known issue is already tracked.
As previously identified and tracked in issue #1220, Line 615 sets
CLP_LOGGING_LEVELfromclp_config.query_scheduler.logging_levelfor both schedulers.
739-763: Redis database parameter ignored in generic_start_workerThe
redis_databaseparameter passed togeneric_start_workerisn't used when callingget_celery_connection_env_vars_list, causing workers to always use the query backend database.Line 741 calls
get_celery_connection_env_vars_list(container_clp_config)without passing theredis_databaseparameter, so the RESULT_BACKEND will always point to the query backend database instead of the worker-specific one.
932-960: Security: WebUI exposes DB and AWS credentials on Docker CLIDatabase credentials and AWS credentials are embedded in the Docker command line as
KEY=VALUEpairs.Lines 934 and 947-948 add credentials directly to the command line. These should be passed via subprocess environment variables instead.
1074-1093: Security: Garbage collector leaks credentials on Docker CLIDatabase and AWS credentials are exposed in the Docker command line.
Line 1076 adds database credentials to the Docker command via
get_credential_env_vars_list, exposing them in process listings. The same issue exists with AWS credentials on line 1091.
251-261: Security: DB credentials leak via Docker CLI in create_db_tablesDatabase credentials are being embedded directly in the Docker command line, exposing them via process listings and debug logs.
The env_vars list contains
CLP_DB_USER=<value>andCLP_DB_PASS=<value>which will be visible in process listings. This needs to be fixed as noted in previous reviews.
610-623: Security: Credentials exposed on Docker CLI for schedulersDatabase and AWS credentials are passed as
-e KEY=VALUEon the Docker command line, making them visible via process listings.Both database credentials (via
get_credential_env_vars_list) and AWS credentials (when applicable) are exposed in the Docker command. This needs the same fix as other components.components/clp-package-utils/clp_package_utils/general.py (4)
385-391: Security: Container start command embeds secrets in Docker CLIThe
generate_container_start_cmdfunction adds environment variables as-e KEY=VALUEto the Docker command, exposing any secrets passed viaextra_env_vars.Lines 389-390 embed the full
KEY=VALUEpairs in the Docker command line. To avoid leaking secrets:
- Emit only
-e KEY(without values) in the Docker command- Pass actual values via the
envparameter ofsubprocess.run()This requires updating all callers to provide environment variables through subprocess rather than the command line.
689-704: Celery connection always uses query backend databaseThe function hardcodes
container_clp_config.redis.query_backend_databasefor the RESULT_BACKEND, preventing proper database isolation between compression and query operations.Line 701 always uses
query_backend_database. The function needs to accept aredis_databaseparameter to allow callers to specify the correct database for each scheduler/worker type.
675-677: Add defensive checks for database credentialsThe function should verify credentials are loaded before using them to avoid propagating
Nonevalues.Add validation before using credentials:
if include_db_credentials: + container_clp_config.database.ensure_credentials_loaded() env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}")
335-338: Ensure parent directory exists before writing config fileThe function writes to
clp_config.logs_directory / config_filenamewithout ensuring the directory exists.Create the parent directory before writing:
config_file_path_on_host = clp_config.logs_directory / config_filename config_file_path_on_container = container_clp_config.logs_directory / config_filename +config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True) with open(config_file_path_on_host, "w") as f: yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
components/clp-package-utils/clp_package_utils/general.py(7 hunks)components/clp-package-utils/clp_package_utils/scripts/archive_manager.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/decompress.py(4 hunks)components/clp-package-utils/clp_package_utils/scripts/search.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(13 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(9 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (7)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
DockerMount(66-85)generate_container_start_cmd(356-397)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(356-397)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(356-397)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
get_shared_config_file_path(925-926)load_credentials_from_file(252-262)load_credentials_from_file(364-373)load_credentials_from_file(459-469)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(356-397)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
read_yaml_config_file(56-62)get_config_value(25-39)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (4)
dump_shared_container_config(343-353)get_celery_connection_env_vars_list(689-704)get_credential_env_vars_list(659-686)get_common_env_vars_list(642-656)components/clp-py-utils/clp_py_utils/clp_config.py (2)
get_shared_config_file_path(925-926)StorageType(107-109)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
generate_container_auth_options(101-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (13)
components/clp-py-utils/clp_py_utils/clp_config.py (9)
84-84: Good: shared generated-config filename is centralizedDefining CLP_SHARED_CONFIG_FILENAME avoids string drift across call sites.
88-94: Good: env-var names hoisted to constantsCentralizing these reduces typos and eases reuse across scripts and services.
249-251: Good: succinct secret stripping in DB dumpUsing Pydantic’s exclude keeps credentials out of the serialized config.
361-363: Good: Redis secret excluded from dumpKeeps password out of generated/shared config.
375-380: Clarify Redis password policy vs. implementationModel marks password Optional[str] = None, but loader requires CLP_REDIS_PASS to be set. Confirm intent:
- If Redis auth is mandatory in container flows, keep as-is and update docstrings elsewhere accordingly.
- If password is truly optional, consider allowing unset env without raising and rely on None to mean “no-auth”.
Would you like me to patch load_credentials_from_env to treat a missing CLP_REDIS_PASS as unauthenticated Redis?
453-458: Good: Queue secrets excluded from dumpAvoids accidental leakage into generated config.
780-785: Good: simplified _get_env_var with os.getenvThis is concise and produces cleaner error messages than KeyError.
925-926: Good: shared-config path uses the declared constantAvoids hardcoding ".clp-config.yml" in multiple places.
935-946: Good: maintainable top-level dump via excluded keys + loopReduces duplication and keeps serialization cohesive for nested configs.
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
127-128: LGTM: dot-prefixed generated config filenameMatches the new convention for hidden, on-host generated config files.
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
217-218: LGTM: dot-prefixed generated config filenameConsistent with the new hidden-on-host convention for generated configs.
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
207-208: LGTM: dot-prefixed generated config filenameConsistent with the standardized approach for generated configs.
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
104-105: LGTM: dot-prefixed generated config filenameMatches the new hidden filename convention and keeps host filesystem tidy.
| env_vars = [ | ||
| f"BROKER_URL=amqp://" | ||
| f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" | ||
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", | ||
| f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" | ||
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" | ||
| f"{container_clp_config.redis.query_backend_database}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
URL-encode credentials in Celery connection strings
Credentials containing special characters (e.g., @, :, /) will break the connection URLs.
Apply URL encoding to credentials before embedding them in connection strings:
+from urllib.parse import quote
def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
+ queue_user = quote(container_clp_config.queue.username, safe='')
+ queue_pass = quote(container_clp_config.queue.password, safe='')
+ redis_pass = quote(container_clp_config.redis.password, safe='')
env_vars = [
f"BROKER_URL=amqp://"
- f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
+ f"{queue_user}:{queue_pass}@"
f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
- f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
+ f"RESULT_BACKEND=redis://default:{redis_pass}@"
f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
f"{container_clp_config.redis.query_backend_database}",
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| env_vars = [ | |
| f"BROKER_URL=amqp://" | |
| f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" | |
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", | |
| f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" | |
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" | |
| f"{container_clp_config.redis.query_backend_database}", | |
| from urllib.parse import quote | |
| def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: | |
| """ | |
| :param container_clp_config: | |
| :return: A list of Celery connection environment variables for Docker containers, in the format | |
| "KEY=VALUE". | |
| """ | |
| queue_user = quote(container_clp_config.queue.username, safe='') | |
| queue_pass = quote(container_clp_config.queue.password, safe='') | |
| redis_pass = quote(container_clp_config.redis.password, safe='') | |
| env_vars = [ | |
| f"BROKER_URL=amqp://" | |
| f"{queue_user}:{queue_pass}@" | |
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", | |
| f"RESULT_BACKEND=redis://default:{redis_pass}@" | |
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" | |
| f"{container_clp_config.redis.query_backend_database}", | |
| ] | |
| return env_vars |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 695 to
701, the Celery BROKER_URL and RESULT_BACKEND strings embed raw usernames and
passwords which will break when credentials contain special characters;
URL-encode the queue username and password and the redis password (using
urllib.parse.quote_plus or quote) before interpolating them into the connection
URLs so special characters are percent-encoded, and update the env_vars
construction to use the encoded credential variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
739-746: Workers: wire redis_database through to Celery env
generic_start_workeracceptsredis_databasebut it isn’t used when building env vars;RESULT_BACKENDwill default to the query DB. Pass it through:- env_vars = [ - *get_common_env_vars_list(include_clp_home_env_var=True), - *get_celery_connection_env_vars_list(container_clp_config), + env_vars = [ + *get_common_env_vars_list(include_clp_home_env_var=True), + *get_celery_connection_env_vars_list( + container_clp_config, + redis_database=redis_database, + ), f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}", f"CLP_LOGS_DIR={container_logs_dir}", f"CLP_LOGGING_LEVEL={worker_config.logging_level}", f"CLP_WORKER_LOG_PATH={container_worker_log_path}", ]components/clp-package-utils/clp_package_utils/general.py (1)
358-399: Optional hardening: avoid embedding env values in generated docker cmd
generate_container_start_cmdcurrently emits-e KEY=VALUEfor all extras, which can leak secrets on the CLI. Consider emitting keys only and requiring callers to pass values viasubprocess.run(env=...). If you adopt this, update docstrings accordingly and adjust call sites.Example change:
- if extra_env_vars is not None: - for key, value in extra_env_vars.items(): - container_start_cmd.extend(["-e", f"{key}={value}"]) + if extra_env_vars is not None: + for key in extra_env_vars.keys(): + container_start_cmd.extend(["-e", key])
♻️ Duplicate comments (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
615-615: Compression scheduler logging level still taken from query scheduler (tracked)
CLP_LOGGING_LEVELusesclp_config.query_scheduler.logging_levelfor both schedulers. This is the known misconfiguration tracked in your issue; leaving as-is if you prefer to fix in a follow-up.If you want to patch now, replace with:
- f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}", + f"CLP_LOGGING_LEVEL={( + clp_config.compression_scheduler.logging_level + if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME + else clp_config.query_scheduler.logging_level + )}",
610-616: RESULT_BACKEND should use the scheduler-specific Redis DBThe Celery result backend currently always targets the query DB. Compression scheduler should use its own DB to avoid mixing result sets.
Update the call to parameterise the Redis DB (paired with a small helper change in general.py):
- *get_celery_connection_env_vars_list(container_clp_config), + *get_celery_connection_env_vars_list( + container_clp_config, + redis_database=( + container_clp_config.redis.compression_backend_database + if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME + else container_clp_config.redis.query_backend_database + ), + ),components/clp-package-utils/clp_package_utils/general.py (3)
99-99: Nit: prefer Optional over typing.OptionalYou already import Optional; align to
Optional[DockerMount]for consistency.- self.generated_config_file: typing.Optional[DockerMount] = None + self.generated_config_file: Optional[DockerMount] = None
325-336: Add return type to dump_container_configThe function returns a tuple of paths; annotate for clarity and type-safety.
-def dump_container_config( - container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str -): +def dump_container_config( + container_clp_config: CLPConfig, clp_config: CLPConfig, config_filename: str +) -> Tuple[pathlib.Path, pathlib.Path]:
661-689: Guard against missing credentials in env builderAvoid emitting strings with “None” if a loader wasn’t called; add defensive checks.
- if include_db_credentials: - env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") - env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") + if include_db_credentials: + if ( + container_clp_config.database.username is None + or container_clp_config.database.password is None + ): + raise ValueError("Database credentials not loaded.") + env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") + env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") @@ - if include_queue_credentials: - env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") - env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") + if include_queue_credentials: + if ( + container_clp_config.queue.username is None + or container_clp_config.queue.password is None + ): + raise ValueError("Queue credentials not loaded.") + env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") + env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") @@ - if include_redis_credentials: - env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}") + if include_redis_credentials: + if container_clp_config.redis.password is None: + raise ValueError("Redis password not loaded.") + env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/general.py(7 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(13 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.pycomponents/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.pycomponents/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.pycomponents/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-19T14:41:28.856Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (6)
dump_shared_container_config(345-355)generate_container_config(229-310)get_celery_connection_env_vars_list(691-706)get_clp_home(109-125)get_common_env_vars_list(644-658)get_credential_env_vars_list(661-688)components/clp-py-utils/clp_py_utils/clp_config.py (2)
get_shared_config_file_path(925-926)StorageType(107-109)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
generate_container_auth_options(101-177)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
get_shared_config_file_path(925-926)CLPConfig(787-956)load_credentials_from_file(252-262)load_credentials_from_file(364-373)load_credentials_from_file(459-469)
🔇 Additional comments (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)
52-58: Imports cleanup aligns with the new helpersGood move to import and adopt the shared config dump and env-var helpers; this improves consistency and reduces duplication.
269-269: Use of shared config file path is correctSwitching to the unified shared config path is consistent with the PR objective and reduces per-component YAML churn.
1006-1014: Reducer: shared config mount and path usage look right
- Including
mounts.generated_config_fileand usingget_shared_config_file_pathis correct and aligns with the unified-config flow.Also applies to: 1019-1026
1246-1248: Centralised shared-config dumpDumping the shared config once and mounting it everywhere removes duplication and the risk of per-component config drift. Good change.
components/clp-package-utils/clp_package_utils/general.py (1)
644-658: LGTM: concise and reusable common env builderGood split; the
include_clp_home_env_varflag makes intent explicit and keeps call sites tidy.
| if not is_path_already_mounted( | ||
| clp_home, | ||
| CONTAINER_CLP_HOME, | ||
| clp_config.get_shared_config_file_path(), | ||
| container_clp_config.get_shared_config_file_path(), | ||
| ): | ||
| docker_mounts.generated_config_file = DockerMount( | ||
| DockerMountType.BIND, | ||
| clp_config.get_shared_config_file_path(), | ||
| container_clp_config.get_shared_config_file_path(), | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Mark the shared config file mount read-only
The shared config should not be mutated in-container. Make the bind mount ro.
- docker_mounts.generated_config_file = DockerMount(
- DockerMountType.BIND,
- clp_config.get_shared_config_file_path(),
- container_clp_config.get_shared_config_file_path(),
- )
+ docker_mounts.generated_config_file = DockerMount(
+ DockerMountType.BIND,
+ clp_config.get_shared_config_file_path(),
+ container_clp_config.get_shared_config_file_path(),
+ True,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not is_path_already_mounted( | |
| clp_home, | |
| CONTAINER_CLP_HOME, | |
| clp_config.get_shared_config_file_path(), | |
| container_clp_config.get_shared_config_file_path(), | |
| ): | |
| docker_mounts.generated_config_file = DockerMount( | |
| DockerMountType.BIND, | |
| clp_config.get_shared_config_file_path(), | |
| container_clp_config.get_shared_config_file_path(), | |
| ) | |
| if not is_path_already_mounted( | |
| clp_home, | |
| CONTAINER_CLP_HOME, | |
| clp_config.get_shared_config_file_path(), | |
| container_clp_config.get_shared_config_file_path(), | |
| ): | |
| docker_mounts.generated_config_file = DockerMount( | |
| DockerMountType.BIND, | |
| clp_config.get_shared_config_file_path(), | |
| container_clp_config.get_shared_config_file_path(), | |
| True, | |
| ) |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 290 to
301, the shared config file is currently bind-mounted into the container as
writable; change the DockerMount used for docker_mounts.generated_config_file to
be read-only so the in-container process cannot mutate it. Modify the
DockerMount construction to include the read-only option (for example by adding
mode='ro' or readonly=True or the equivalent parameter your DockerMount class
expects) when creating the bind mount, ensuring the source and target paths
remain the same.
| def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: | ||
| """ | ||
| :param container_clp_config: | ||
| :return: A list of Celery connection environment variables for Docker containers, in the format | ||
| "KEY=VALUE". | ||
| """ | ||
| env_vars = [ | ||
| f"BROKER_URL=amqp://" | ||
| f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" | ||
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", | ||
| f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" | ||
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" | ||
| f"{container_clp_config.redis.query_backend_database}", | ||
| ] | ||
|
|
||
| return env_vars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Celery env: parameterise Redis DB and URL-encode credentials
This helper hardcodes the query DB and embeds raw credentials. Parameterise the DB index and encode user/pass to avoid URL parsing issues.
-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]:
+def get_celery_connection_env_vars_list(
+ container_clp_config: CLPConfig,
+ redis_database: Optional[int] = None,
+) -> List[str]:
"""
:param container_clp_config:
:return: A list of Celery connection environment variables for Docker containers, in the format
"KEY=VALUE".
"""
- env_vars = [
- f"BROKER_URL=amqp://"
- f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
- f"{container_clp_config.queue.host}:{container_clp_config.queue.port}",
- f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
- f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
- f"{container_clp_config.redis.query_backend_database}",
- ]
+ from urllib.parse import quote
+ q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="")
+ q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="")
+ r_pass = container_clp_config.redis.password
+ redis_db = (
+ container_clp_config.redis.query_backend_database
+ if redis_database is None
+ else redis_database
+ )
+ broker = (
+ f"amqp://{q_user}:{q_pass}@"
+ f"{container_clp_config.queue.host}:{container_clp_config.queue.port}"
+ )
+ if r_pass:
+ backend = (
+ "redis://default:"
+ f"{quote(r_pass, safe='')}@"
+ f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+ )
+ else:
+ backend = (
+ f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}"
+ )
+ env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"]
return env_vars📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: | |
| """ | |
| :param container_clp_config: | |
| :return: A list of Celery connection environment variables for Docker containers, in the format | |
| "KEY=VALUE". | |
| """ | |
| env_vars = [ | |
| f"BROKER_URL=amqp://" | |
| f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" | |
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", | |
| f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" | |
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" | |
| f"{container_clp_config.redis.query_backend_database}", | |
| ] | |
| return env_vars | |
| def get_celery_connection_env_vars_list( | |
| container_clp_config: CLPConfig, | |
| redis_database: Optional[int] = None, | |
| ) -> List[str]: | |
| """ | |
| :param container_clp_config: | |
| :return: A list of Celery connection environment variables for Docker containers, in the format | |
| "KEY=VALUE". | |
| """ | |
| from urllib.parse import quote | |
| q_user = "" if container_clp_config.queue.username is None else quote( | |
| container_clp_config.queue.username, safe="" | |
| ) | |
| q_pass = "" if container_clp_config.queue.password is None else quote( | |
| container_clp_config.queue.password, safe="" | |
| ) | |
| r_pass = container_clp_config.redis.password | |
| redis_db = ( | |
| container_clp_config.redis.query_backend_database | |
| if redis_database is None | |
| else redis_database | |
| ) | |
| broker = ( | |
| f"amqp://{q_user}:{q_pass}@" | |
| f"{container_clp_config.queue.host}:{container_clp_config.queue.port}" | |
| ) | |
| if r_pass: | |
| backend = ( | |
| "redis://default:" | |
| f"{quote(r_pass, safe='')}@" | |
| f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" | |
| ) | |
| else: | |
| backend = ( | |
| f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{redis_db}" | |
| ) | |
| env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"] | |
| return env_vars |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 691 to
706, the Celery env helper currently hardcodes the Redis DB index and injects
raw credentials into URLs; update it to parameterise the Redis DB index (use the
existing config field, e.g., redis.query_backend_database or rename to redis.db
if needed) and URL-encode both broker and redis username/password to avoid URL
parsing issues. Import and use urllib.parse.quote_plus to encode
container_clp_config.queue.username/container_clp_config.queue.password and
container_clp_config.redis.username/container_clp_config.redis.password, then
build BROKER_URL and RESULT_BACKEND using the encoded credentials and the config
DB index instead of a hardcoded value; return the env_vars list as before.
| necessary_mounts = [mounts.clp_home, mounts.logs_dir, mounts.generated_config_file] | ||
| aws_mount, aws_env_vars = generate_container_auth_options(clp_config, component_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Scheduler: add-on — consider not embedding queue/redis secrets on CLI
As a follow-up hardening, avoid BROKER_URL/RESULT_BACKEND values on the CLI; pass only keys via -e and inject values via subprocess.run(env=...). This mirrors the pattern in create_db_tables/webui/garbage_collector suggestions.
Also applies to: 622-623, 628-636, 638-639
| mounts.generated_config_file, | ||
| ] | ||
| env_vars = [ | ||
| *get_common_env_vars_list(include_clp_home_env_var=True), | ||
| *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), | ||
| f"CLP_LOGS_DIR={container_logs_dir}", | ||
| f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Garbage collector: remove secrets from CLI; inject via subprocess env
Same leakage pattern as WebUI. Emit keys-only on the CLI and pass actual secret values via env.
Apply:
- necessary_mounts = [
+ necessary_mounts = [
mounts.clp_home,
mounts.logs_dir,
mounts.generated_config_file,
]
- env_vars = [
- *get_common_env_vars_list(include_clp_home_env_var=True),
- *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
- f"CLP_LOGS_DIR={container_logs_dir}",
- f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
- ]
+ env_vars = [
+ *get_common_env_vars_list(include_clp_home_env_var=True),
+ # Keys only; values injected via env below
+ "CLP_DB_USER",
+ "CLP_DB_PASS",
+ f"CLP_LOGS_DIR={container_logs_dir}",
+ f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
+ ]
@@
- if aws_env_vars:
- env_vars.extend(aws_env_vars)
+ if aws_env_vars:
+ # Convert KEY=VALUE entries into keys-only now; stash values for env injection
+ for kv in aws_env_vars:
+ env_vars.append(kv.split("=", 1)[0])
@@
- cmd = container_start_cmd + garbage_collector_cmd
- subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+ cmd = container_start_cmd + garbage_collector_cmd
+ secrets_env = {
+ "CLP_DB_USER": container_clp_config.database.username,
+ "CLP_DB_PASS": container_clp_config.database.password,
+ }
+ if aws_env_vars:
+ for kv in aws_env_vars:
+ k, v = kv.split("=", 1)
+ secrets_env[k] = v
+ subprocess.run(
+ cmd,
+ stdout=subprocess.DEVNULL,
+ check=True,
+ env={**os.environ, **secrets_env},
+ )Also applies to: 1087-1094, 1097-1104
kirkrodrigues
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits.
For the PR title, how about:
refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149).
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
Outdated
Show resolved
Hide resolved
Co-authored-by: kirkrodrigues <[email protected]>
Co-authored-by: kirkrodrigues <[email protected]>
…tion into a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
739-746: Workers: ensure RESULT_BACKEND uses the intended Redis DBgeneric_start_worker receives redis_database but get_celery_connection_env_vars_list ignores it, so workers will point to the query DB. Pass the parameter:
- env_vars = [ - *get_common_env_vars_list(include_clp_home_env_var=True), - *get_celery_connection_env_vars_list(container_clp_config), + env_vars = [ + *get_common_env_vars_list(include_clp_home_env_var=True), + *get_celery_connection_env_vars_list( + container_clp_config, + redis_database=redis_database, + ), f"CLP_CONFIG_PATH={container_clp_config.get_shared_config_file_path()}", f"CLP_LOGS_DIR={container_logs_dir}", f"CLP_LOGGING_LEVEL={worker_config.logging_level}", f"CLP_WORKER_LOG_PATH={container_worker_log_path}", ] @@ - append_docker_options(container_start_cmd, necessary_mounts, env_vars) + append_docker_options(container_start_cmd, necessary_mounts, env_vars)(Helper signature change in general.py required; see my comment there.)
Also applies to: 763-763
components/clp-package-utils/clp_package_utils/general.py (1)
362-397: Avoid embedding KEY=VALUE in docker argv in generate_container_start_cmdThe function emits -e KEY=VALUE which can expose secrets via ps/logs. Prefer -e KEY only and require callers to pass values via subprocess.run(env=...). If acceptable, I can provide a full patch; for now:
- for key, value in env_vars.items(): - container_start_cmd.extend(["-e", f"{key}={value}"]) + for key in env_vars.keys(): + container_start_cmd.extend(["-e", key])Callers should merge env dicts into subprocess.run(env=...).
components/clp-py-utils/clp_py_utils/clp_config.py (1)
796-796: Syntax error in type annotation: remove ‘()’ from ReducerTyping the field as Reducer() = Reducer() is invalid. It must be the class type on the left-hand side.
- reducer: Reducer() = Reducer() + reducer: Reducer = Reducer()
♻️ Duplicate comments (10)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
226-232: Critical: DB credentials are exposed on the Docker CLI (-e KEY=VALUE). Pass only names to Docker and inject values via subprocess env.Current generate_container_start_cmd emits "-e KEY=VALUE", which leaks secrets via process listings and logs. Provide only env var names on the docker CLI and pass the values via subprocess.run(..., env=...). This also prevents accidental leakage in debug logs.
Apply within this hunk (after updating general.generate_container_start_cmd accordingly):
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd: List[str] = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd: List[str] = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this hunk, inject values at execution and avoid logging the full command with secrets:
# Add at top-level imports import os # outside this hunk # Replace the subprocess invocation and debug logging proc = subprocess.run(cmd, env={**os.environ, **extra_env_vars}) ret_code = proc.returncode if 0 != ret_code: logger.error("Archive manager failed.") # Avoid printing the full command string to prevent accidental leakage. logger.debug("Docker command failed (redacted).")Companion change required in components/clp-package-utils/clp_package_utils/general.py: update generate_container_start_cmd to accept a list of env-var names and emit only "-e KEY" (no "=VALUE"). For example:
-def generate_container_start_cmd( - container_name: str, - container_mounts: List[Optional[DockerMount]], - container_image: str, - extra_env_vars: Optional[Dict[str, str]] = None, -) -> List[str]: +def generate_container_start_cmd( + container_name: str, + container_mounts: List[Optional[DockerMount]], + container_image: str, + env_var_names: Optional[List[str]] = None, +) -> List[str]: @@ - env_vars = { - "PYTHONPATH": clp_site_packages_dir, - **(extra_env_vars if extra_env_vars is not None else {}), - } - for key, value in env_vars.items(): - container_start_cmd.extend(["-e", f"{key}={value}"]) + # Provide only names; values must be injected via subprocess.run(env=...). + container_start_cmd.extend(["-e", f"PYTHONPATH={clp_site_packages_dir}"]) + for key in (env_var_names or []): + container_start_cmd.extend(["-e", key])This pattern should be applied consistently across all scripts that pass secrets.
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
129-135: Critical: DB credentials leak on Docker CLI; pass only env names and inject values via subprocess env. Also ensure cleanup on failure.
- Security: Avoid "-e KEY=VALUE" on the CLI. Provide only names; inject values via subprocess.run(env=...).
- Robustness: The generated config file is not removed on failure because unlink() is after a return in the except path—use a finally.
Apply within this hunk (after updating general.generate_container_start_cmd to accept names):
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this hunk: inject env and guarantee cleanup:
import os # at top if missing cmd = container_start_cmd + extract_cmd try: subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars}) except subprocess.CalledProcessError: logger.exception("Docker or file extraction command failed.") return -1 finally: # Remove generated files even on failure generated_config_path_on_host.unlink(missing_ok=True)Optionally, factor out DB env assembly to avoid duplication:
def _build_db_env(clp_config: CLPConfig) -> dict[str, str]: return { CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, } # ... extra_env_vars = _build_db_env(clp_config)
202-211: Mirror the secret-handling and cleanup fixes in the stream extraction pathApply the same secure env handling and robust cleanup:
- generated_config_path_on_container, generated_config_path_on_host = dump_container_config( - container_clp_config, clp_config, get_container_config_filename(container_name) - ) + generated_config_path_on_container, generated_config_path_on_host = dump_container_config( + container_clp_config, clp_config, get_container_config_filename(container_name) + ) @@ - extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this hunk:
import os # if missing cmd = container_start_cmd + extract_cmd try: subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars}) except subprocess.CalledProcessError: logger.exception("Docker or stream extraction command failed.") return -1 finally: generated_config_path_on_host.unlink(missing_ok=True)components/clp-package-utils/clp_package_utils/scripts/search.py (1)
121-127: Critical: Don’t expose DB creds via Docker CLI; inject via subprocess env and add failure-safe cleanup
- Pass only env var names to Docker; inject values via env in subprocess.run.
- Ensure generated config file is removed even if search fails (currently not removed on CalledProcessError).
Apply within this hunk (after general.generate_container_start_cmd is updated to accept names):
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this hunk:
import os # if missing cmd = container_start_cmd + search_cmd try: subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars}) finally: generated_config_path_on_host.unlink(missing_ok=True)components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
228-234: Critical: Secrets leak via “-e KEY=VALUE”. Pass only names to Docker and inject values via subprocess env; ensure cleanup on failure.
- Security: Avoid including DB creds on the docker argv.
- Reliability: If the docker run fails, the generated config file isn’t removed.
Apply within this hunk (after updating general.generate_container_start_cmd):
- extra_env_vars = { - CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, - CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, - } - container_start_cmd = generate_container_start_cmd( - container_name, necessary_mounts, clp_config.execution_container, extra_env_vars - ) + extra_env_vars = { + CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, + CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, + } + container_start_cmd = generate_container_start_cmd( + container_name, + necessary_mounts, + clp_config.execution_container, + list(extra_env_vars.keys()), + )Outside this hunk, inject env and protect cleanup:
import os # at top if missing cmd = container_start_cmd + compress_cmd try: subprocess.run(cmd, check=True, env={**os.environ, **extra_env_vars}) finally: generated_config_path_on_host.unlink(missing_ok=True)Optional DRY: factor out a small helper to build DB env vars (as suggested previously) and reuse across scripts.
def _build_db_env(clp_config: CLPConfig) -> dict[str, str]: return { CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, }components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
610-616: Fix scheduler logging level and use the correct Redis DB for RESULT_BACKENDTwo issues:
- CLP_LOGGING_LEVEL always uses the query scheduler’s logging level.
- get_celery_connection_env_vars_list always uses the query backend DB; compression should use its own DB.
Apply:
env_vars = [ *get_common_env_vars_list(), *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), - *get_celery_connection_env_vars_list(container_clp_config), + *get_celery_connection_env_vars_list( + container_clp_config, + redis_database=( + container_clp_config.redis.compression_backend_database + if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME + else container_clp_config.redis.query_backend_database + ), + ), f"CLP_LOGS_DIR={container_logs_dir}", - f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}", + f"CLP_LOGGING_LEVEL={( + clp_config.compression_scheduler.logging_level + if component_name == COMPRESSION_SCHEDULER_COMPONENT_NAME + else clp_config.query_scheduler.logging_level + )}", ]Note: This requires updating the helper signature in general.py (see my comment there).
components/clp-package-utils/clp_package_utils/general.py (4)
99-99: Typing style: use Optional instead of typing.OptionalYou already import Optional. Keep annotations consistent.
- self.generated_config_file: typing.Optional[DockerMount] = None + self.generated_config_file: Optional[DockerMount] = None
665-693: Defensively ensure credentials are loaded before emitting env varsPrevent “None” values from being emitted when loaders weren’t called; fail fast for required creds.
if include_db_credentials: + container_clp_config.database.ensure_credentials_loaded() env_vars.append(f"CLP_DB_USER={container_clp_config.database.username}") env_vars.append(f"CLP_DB_PASS={container_clp_config.database.password}") if include_queue_credentials: + if ( + container_clp_config.queue.username is None + or container_clp_config.queue.password is None + ): + raise ValueError("Queue credentials not loaded.") env_vars.append(f"CLP_QUEUE_USER={container_clp_config.queue.username}") env_vars.append(f"CLP_QUEUE_PASS={container_clp_config.queue.password}") if include_redis_credentials: + if container_clp_config.redis.password is None: + raise ValueError("Redis password not loaded.") env_vars.append(f"CLP_REDIS_PASS={container_clp_config.redis.password}")
290-301: Mount the shared config file read-onlyThe shared config should not be mutated in-container. Mark the bind mount readonly.
- docker_mounts.generated_config_file = DockerMount( - DockerMountType.BIND, - clp_config.get_shared_config_file_path(), - container_clp_config.get_shared_config_file_path(), - ) + docker_mounts.generated_config_file = DockerMount( + DockerMountType.BIND, + clp_config.get_shared_config_file_path(), + container_clp_config.get_shared_config_file_path(), + True, + )
695-710: Celery env: parameterize Redis DB and URL-encode credentialsHardcoding the query DB breaks compression components; also embed raw credentials which can break URL parsing. Add a redis_database parameter and encode credentials.
-def get_celery_connection_env_vars_list(container_clp_config: CLPConfig) -> List[str]: +def get_celery_connection_env_vars_list( + container_clp_config: CLPConfig, + redis_database: Optional[int] = None, +) -> List[str]: @@ - env_vars = [ - f"BROKER_URL=amqp://" - f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@" - f"{container_clp_config.queue.host}:{container_clp_config.queue.port}", - f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@" - f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/" - f"{container_clp_config.redis.query_backend_database}", - ] + from urllib.parse import quote + q_user = "" if container_clp_config.queue.username is None else quote(container_clp_config.queue.username, safe="") + q_pass = "" if container_clp_config.queue.password is None else quote(container_clp_config.queue.password, safe="") + r_pass = container_clp_config.redis.password + db_index = ( + container_clp_config.redis.query_backend_database + if redis_database is None + else redis_database + ) + broker = f"amqp://{q_user}:{q_pass}@{container_clp_config.queue.host}:{container_clp_config.queue.port}" + if r_pass: + backend = ( + "redis://default:" + f"{quote(r_pass, safe='')}@{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}" + ) + else: + backend = f"redis://{container_clp_config.redis.host}:{container_clp_config.redis.port}/{db_index}" + env_vars = [f"BROKER_URL={broker}", f"RESULT_BACKEND={backend}"] return env_varsUpdate call sites in start_clp.py (schedulers/workers) accordingly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
components/clp-package-utils/clp_package_utils/general.py(7 hunks)components/clp-package-utils/clp_package_utils/scripts/archive_manager.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(4 hunks)components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/decompress.py(5 hunks)components/clp-package-utils/clp_package_utils/scripts/search.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(13 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(9 hunks)
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
📚 Learning: 2024-11-15T16:22:33.635Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/clp-package-utils/clp_package_utils/general.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-19T14:41:28.856Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.856Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
🧬 Code Graph Analysis (8)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
get_container_config_filename(325-326)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
get_container_config_filename(325-326)generate_container_start_cmd(362-403)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
get_container_config_filename(325-326)generate_container_start_cmd(362-403)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (3)
get_container_config_filename(325-326)DockerMount(66-85)generate_container_start_cmd(362-403)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
read_yaml_config_file(56-62)get_config_value(25-39)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (14)
get_shared_config_file_path(925-926)dump_to_primitive_dict(249-250)dump_to_primitive_dict(361-362)dump_to_primitive_dict(456-457)dump_to_primitive_dict(552-553)dump_to_primitive_dict(569-572)dump_to_primitive_dict(601-604)dump_to_primitive_dict(702-705)dump_to_primitive_dict(724-727)dump_to_primitive_dict(934-956)dump_to_primitive_dict(968-976)load_credentials_from_file(252-262)load_credentials_from_file(364-373)load_credentials_from_file(459-469)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
components/clp-package-utils/clp_package_utils/general.py (6)
dump_shared_container_config(349-359)generate_container_config(229-310)get_celery_connection_env_vars_list(695-710)get_clp_home(109-125)get_common_env_vars_list(648-662)get_credential_env_vars_list(665-692)components/clp-py-utils/clp_py_utils/clp_config.py (2)
get_shared_config_file_path(925-926)StorageType(107-109)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
generate_container_auth_options(101-177)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
get_container_config_filename(325-326)generate_container_start_cmd(362-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (22)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
22-22: Switch to get_container_config_filename + updated dump signature: LGTMImporting and using get_container_config_filename(container_name) to drive dump_container_config is consistent with the unified-config approach and avoids per-component filename logic spread across call sites.
Also aligns with our previous learning that dataset validation should be done at the wrapper level; this file maintains that pattern.
Also applies to: 128-129
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
218-219: Config filename resolution via get_container_config_filename: LGTMDumping config with an explicit filename derived from the container name makes the flow clearer and standardizes generated file naming across scripts.
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
105-106: Config filename via get_container_config_filename: LGTMConsistent with the new unified config flow.
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
118-119: Using get_container_config_filename when dumping config: LGTMMatches the standardization across scripts.
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
208-209: Config filename standardization via get_container_config_filename: LGTMGood move towards a single pattern for container config file creation.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)
52-58: Imports LGTM; new helpers align with refactor goalsThe added imports for dump_shared_container_config, get_celery_connection_env_vars_list, get_common_env_vars_list, and get_credential_env_vars_list match the new standardized env/config flow.
269-269: LGTM: use shared config pathSwitching to the unified shared config file is correct and aligns with the PR objective.
617-617: Shared config mount added — good catchAdding mounts.generated_config_file here prevents reliance on the logs dir mount and avoids drift.
631-636: LGTM: pass unified config to the schedulerUsing the shared config file for schedulers is consistent with the new approach.
1006-1014: LGTM: reducer picks up shared config and mounts itUsing CLP_CONFIG with the shared config file and adding the generated_config_file mount is consistent and correct.
Also applies to: 1019-1025
1246-1247: LGTM: shared config dump on startupDumping the shared container config once at startup ensures consumers can read non-credential settings consistently.
components/clp-package-utils/clp_package_utils/general.py (4)
325-327: LGTM: centralised container-config filename generatorHaving a single helper prevents filename drift across consumers.
329-346: Docstring and return contract are cleardump_container_config’s contract (container path and host path) is clear and matches usage. Good split from dump_shared_container_config.
471-490: File-based credential loaders invoked intentionally — confirm intentThese helpers load credentials from the credentials file. Given the PR’s scope is to avoid writing creds to temporary per-container files (not to remove the credentials file entirely), this usage is fine. If the long-term plan is env-only source of truth, consider deprecating these and switching to load_credentials_from_env().
648-663: LGTM: common env listReturning only PYTHONPATH and optionally CLP_HOME keeps the surface minimal and re-usable.
components/clp-py-utils/clp_py_utils/clp_config.py (7)
85-86: LGTM: single source of truth for shared config filenameUsing CLP_SHARED_CONFIG_FILENAME avoids filename drift and is referenced by start scripts and general.py.
88-94: LGTM: env-var name constants introducedCentralising credential env-var names reduces typos and improves consistency across components.
347-354: Redis password optional is sensibleAllowing None for password matches Redis’ optional auth. Good alignment with downstream URL builder once encoded/omitted accordingly.
453-458: Queue: secret scrubbing looks correctdump_to_primitive_dict excludes username/password as intended.
Also applies to: 456-458
780-785: _get_env_var: concise and clearHelper is minimal and produces clean error messages. Good.
925-927: LGTM: shared config path uses the declared filename constantCorrect and consistent with the rest of the stack.
934-956: Top-level dump: refactor to avoid duplicating nested serializationYour refactor to build d excluding nested keys then appending each nested dump is cleaner and maintainable.
| env_vars = [ | ||
| *get_common_env_vars_list(), | ||
| *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), | ||
| ] | ||
| necessary_mounts = [ | ||
| mounts.clp_home, | ||
| mounts.data_dir, | ||
| mounts.logs_dir, | ||
| mounts.generated_config_file, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop leaking DB credentials in create_db_tables; inject via subprocess env
Env vars currently include CLP_DB_USER/CLP_DB_PASS as KEY=VALUE and you log the full command. This leaks credentials via ps and debug logs. Emit key names only on the CLI and pass actual values via subprocess.run(env=...).
Apply:
@@
- env_vars = [
- *get_common_env_vars_list(),
- *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
- ]
+ from clp_py_utils.clp_config import (
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ )
+ env_vars = [
+ *get_common_env_vars_list(),
+ # Keys only; values injected via subprocess env below
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ ]
@@
- cmd = container_start_cmd + create_tables_cmd
- logger.debug(" ".join(cmd))
- subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+ cmd = container_start_cmd + create_tables_cmd
+ logger.debug(" ".join(cmd))
+ secrets_env = {
+ CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+ CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+ }
+ subprocess.run(
+ cmd,
+ stdout=subprocess.DEVNULL,
+ check=True,
+ env={**os.environ, **secrets_env},
+ )Also applies to: 274-276
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 251-260 (and also apply same change to 274-276), the current env_vars list
contains secret values (CLP_DB_USER/CLP_DB_PASS) which are being concatenated
into the logged CLI string and can leak via ps/debug logs; instead build two
things: (1) a display-only list of env KEY names (e.g., "CLP_DB_USER",
"CLP_DB_PASS", plus other keys) that you can safely join and log, and (2) an
actual env dict mapping keys to their secret values (pulled from
get_credential_env_vars_list or container_clp_config) which you pass into
subprocess.run(..., env=env_dict); update any subprocess.run calls that
currently embed KEY=VALUE in the command to pass only the command arguments and
the env parameter, and ensure all logs print only the key names, not the values.
| if aws_env_vars: | ||
| necessary_env_vars.extend(aws_env_vars) | ||
| env_vars.extend(aws_env_vars) | ||
| if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid embedding AWS secrets on the CLI; prefer keys-only + env injection
aws_env_vars are KEY=VALUE pairs and will leak via ps/logs. Emit only keys and pass values via subprocess.run(env=...). If you want, I can provide a concrete patch after you confirm keys-only mode here.
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 621-623, the code currently extends env_vars with aws_env_vars (KEY=VALUE
pairs) which embeds secret values into CLI args/logs; instead, parse
aws_env_vars into key/value pairs, add only the keys (not "KEY=VALUE") to the
command argument list (or flag list) and pass the actual values via the
subprocess environment parameter (e.g., merge into the env dict passed to
subprocess.run); ensure you handle entries without '=' safely and do not log the
values anywhere.
| env_vars = [ | ||
| *get_common_env_vars_list(), | ||
| *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), | ||
| f"NODE_PATH={node_path}", | ||
| f"HOST={clp_config.webui.host}", | ||
| f"PORT={clp_config.webui.port}", | ||
| f"CLP_DB_USER={clp_config.database.username}", | ||
| f"CLP_DB_PASS={clp_config.database.password}", | ||
| f"NODE_ENV=production", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebUI: remove DB/AWS secrets from CLI; inject via subprocess env
Both DB and AWS credentials are emitted as KEY=VALUE on the docker argv, and then the full command is executed/logged. Switch to keys-only on the CLI and pass values via env.
Apply:
@@
- env_vars = [
- *get_common_env_vars_list(),
- *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
- f"NODE_PATH={node_path}",
- f"HOST={clp_config.webui.host}",
- f"PORT={clp_config.webui.port}",
- f"NODE_ENV=production",
- ]
+ from clp_py_utils.clp_config import (
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ )
+ env_vars = [
+ *get_common_env_vars_list(),
+ # Keys only; values injected via secrets_env below
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ f"NODE_PATH={node_path}",
+ f"HOST={clp_config.webui.host}",
+ f"PORT={clp_config.webui.port}",
+ f"NODE_ENV=production",
+ ]
@@
- if AwsAuthType.credentials == auth.type:
- credentials = auth.credentials
- env_vars.append(f"AWS_ACCESS_KEY_ID={credentials.access_key_id}")
- env_vars.append(f"AWS_SECRET_ACCESS_KEY={credentials.secret_access_key}")
+ if AwsAuthType.credentials == auth.type:
+ # Keys only; values injected via secrets_env below
+ env_vars.append("AWS_ACCESS_KEY_ID")
+ env_vars.append("AWS_SECRET_ACCESS_KEY")
@@
- if aws_env_vars:
- env_vars.extend(aws_env_vars)
+ if aws_env_vars:
+ # Convert KEY=VALUE entries into keys-only; stash values to secrets_env
+ for kv in aws_env_vars:
+ env_vars.append(kv.split("=", 1)[0])
@@
- cmd = container_cmd + node_cmd
- subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+ cmd = container_cmd + node_cmd
+ secrets_env = {
+ CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+ CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+ }
+ if StorageType.S3 == stream_storage.type:
+ auth = stream_storage.s3_config.aws_authentication
+ if AwsAuthType.credentials == auth.type:
+ credentials = auth.credentials
+ secrets_env["AWS_ACCESS_KEY_ID"] = credentials.access_key_id
+ secrets_env["AWS_SECRET_ACCESS_KEY"] = credentials.secret_access_key
+ else:
+ if aws_env_vars:
+ for kv in aws_env_vars:
+ k, v = kv.split("=", 1)
+ secrets_env[k] = v
+ subprocess.run(
+ cmd,
+ stdout=subprocess.DEVNULL,
+ check=True,
+ env={**os.environ, **secrets_env},
+ )Also applies to: 947-948, 956-956, 960-960, 967-969
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 932-939 (also update the similar spots at 947-948, 956, 960, and 967-969),
the current code places full DB/AWS KEY=VALUE pairs directly into the docker
command args and logs them; change this to emit only the keys (e.g., "DB_USER",
"DB_PASS", "AWS_ACCESS_KEY_ID", etc.) on the CLI and move the corresponding
values into the subprocess environment dict passed to the process invocation
(populate env from get_credential_env_vars_list but strip values from the argv
and set them in os.environ or the subprocess env param); ensure the docker argv
contains only KEY names (no =VALUE), and remove sensitive values from any logged
command output by constructing the argv separately from the env with values kept
only in the subprocess env.
| necessary_mounts = [ | ||
| mounts.clp_home, | ||
| mounts.logs_dir, | ||
| mounts.generated_config_file, | ||
| ] | ||
| env_vars = [ | ||
| *get_common_env_vars_list(include_clp_home_env_var=True), | ||
| *get_credential_env_vars_list(container_clp_config, include_db_credentials=True), | ||
| f"CLP_LOGS_DIR={container_logs_dir}", | ||
| f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Garbage collector: remove DB/AWS secrets from CLI; inject via subprocess env
Same leak pattern as create_db_tables/WebUI. Emit keys-only and pass values via env.
Apply:
@@
- env_vars = [
- *get_common_env_vars_list(include_clp_home_env_var=True),
- *get_credential_env_vars_list(container_clp_config, include_db_credentials=True),
- f"CLP_LOGS_DIR={container_logs_dir}",
- f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
- ]
+ from clp_py_utils.clp_config import (
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ )
+ env_vars = [
+ *get_common_env_vars_list(include_clp_home_env_var=True),
+ # Keys only; values injected via env below
+ CLP_DB_USER_ENV_VAR_NAME,
+ CLP_DB_PASS_ENV_VAR_NAME,
+ f"CLP_LOGS_DIR={container_logs_dir}",
+ f"CLP_LOGGING_LEVEL={clp_config.garbage_collector.logging_level}",
+ ]
@@
- if aws_env_vars:
- env_vars.extend(aws_env_vars)
+ if aws_env_vars:
+ for kv in aws_env_vars:
+ env_vars.append(kv.split("=", 1)[0])
@@
- cmd = container_start_cmd + garbage_collector_cmd
- subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+ cmd = container_start_cmd + garbage_collector_cmd
+ secrets_env = {
+ CLP_DB_USER_ENV_VAR_NAME: container_clp_config.database.username,
+ CLP_DB_PASS_ENV_VAR_NAME: container_clp_config.database.password,
+ }
+ if aws_env_vars:
+ for kv in aws_env_vars:
+ k, v = kv.split("=", 1)
+ secrets_env[k] = v
+ subprocess.run(
+ cmd,
+ stdout=subprocess.DEVNULL,
+ check=True,
+ env={**os.environ, **secrets_env},
+ )Also applies to: 1091-1091, 1093-1101
| def dump_to_primitive_dict(self): | ||
| return self.dict(exclude={"username", "password"}) | ||
|
|
||
| def load_credentials_from_file(self, credentials_file_path: pathlib.Path): | ||
| config = read_yaml_config_file(credentials_file_path) | ||
| if config is None: | ||
| raise ValueError(f"Credentials file '{credentials_file_path}' is empty.") | ||
| try: | ||
| self.username = get_config_value(config, f"{DB_COMPONENT_NAME}.user") | ||
| self.password = get_config_value(config, f"{DB_COMPONENT_NAME}.password") | ||
| except KeyError as ex: | ||
| raise ValueError( | ||
| f"Credentials file '{credentials_file_path}' does not contain key '{ex}'." | ||
| ) | ||
|
|
||
| def load_credentials_from_env(self): | ||
| """ | ||
| :raise ValueError: if any expected environment variable is not set. | ||
| """ | ||
| self.username = _get_env_var(CLP_DB_USER_ENV_VAR_NAME) | ||
| self.password = _get_env_var(CLP_DB_PASS_ENV_VAR_NAME) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Database: secret scrubbing + env loader look good; docstring minor
dump_to_primitive_dict correctly excludes secrets; env loader is straightforward. Small docstring nit: you’re using “:raise ValueError:” which is acceptable per team preference; no change required.
🤖 Prompt for AI Agents
components/clp-py-utils/clp_py_utils/clp_config.py around lines 249 to 270: no
functional changes required — dump_to_primitive_dict correctly excludes secrets,
load_credentials_from_file and load_credentials_from_env behave as expected, and
the docstring using ":raise ValueError:" is acceptable per team preference;
leave the code as-is.
| def dump_to_primitive_dict(self): | ||
| return self.dict(exclude={"password"}) | ||
|
|
||
| def load_credentials_from_file(self, credentials_file_path: pathlib.Path): | ||
| config = read_yaml_config_file(credentials_file_path) | ||
| if config is None: | ||
| raise ValueError(f"Credentials file '{credentials_file_path}' is empty.") | ||
| try: | ||
| self.password = get_config_value(config, f"{REDIS_COMPONENT_NAME}.password") | ||
| except KeyError as ex: | ||
| raise ValueError( | ||
| f"Credentials file '{credentials_file_path}' does not contain key '{ex}'." | ||
| ) | ||
|
|
||
| def load_credentials_from_env(self): | ||
| """ | ||
| :raise ValueError: if any expected environment variable is not set. | ||
| """ | ||
| self.password = _get_env_var(CLP_REDIS_PASS_ENV_VAR_NAME) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Redis: secret scrubbing + env loader are fine; optional: docstring “raises”
Behaviour is correct; if you later standardize on “:raises ValueError:” across the file, update here for consistency.
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 361 to 380,
update the load_credentials_from_env docstring to match the project’s docstring
style by changing the ":raise ValueError:" line to ":raises ValueError:"
(plural) so it is consistent with other docstrings in the file; ensure
punctuation and spacing match adjacent docstrings.
| def load_credentials_from_file(self, credentials_file_path: pathlib.Path): | ||
| config = read_yaml_config_file(credentials_file_path) | ||
| if config is None: | ||
| raise ValueError(f"Credentials file '{credentials_file_path}' is empty.") | ||
| try: | ||
| self.username = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.user") | ||
| self.password = get_config_value(config, f"{QUEUE_COMPONENT_NAME}.password") | ||
| except KeyError as ex: | ||
| raise ValueError( | ||
| f"Credentials file '{credentials_file_path}' does not contain key '{ex}'." | ||
| ) | ||
|
|
||
| def load_credentials_from_env(self): | ||
| """ | ||
| :raise ValueError: if any expected environment variable is not set. | ||
| """ | ||
| self.username = _get_env_var(CLP_QUEUE_USER_ENV_VAR_NAME) | ||
| self.password = _get_env_var(CLP_QUEUE_PASS_ENV_VAR_NAME) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Queue: file/env loaders consistent; optional: validate non-empty values
If you want early failure on blank strings (not just missing vars), add non-empty checks here similar to DB. Can be a follow-up.
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 459 to 477,
the credential loaders accept empty strings which can silently allow invalid
credentials; add explicit non-empty validation after reading username and
password in both load_credentials_from_file and load_credentials_from_env, and
raise ValueError with a clear message including the credentials_file_path or
environment variable names when a value is empty; mirror the DB loader behavior
(check for falsy/blank values, not just missing keys) so both file and env paths
fail fast on blank credentials.
| "--log-driver", "local" | ||
| ] | ||
| env_vars = { | ||
| "PYTHONPATH": clp_site_packages_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above "-e", f"PYTHONPATH={clp_site_packages_dir}", line was not removed, as a mistake possibly during branch syncs with main, which causes the environment variable to be defined twice. The duplication isn't harmful in the correctness though.
@coderabbitai create an issue to track the removal of the deplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create GitHub issue to track duplicate PYTHONPATH removal
gh issue create \
--title "Remove duplicate PYTHONPATH environment variable in generate_container_start_cmd" \
--body "## Description
In \`components/clp-package-utils/clp_package_utils/general.py\`, the \`generate_container_start_cmd\` function defines the PYTHONPATH environment variable twice:
1. Directly in the \`container_start_cmd\` array: \`\"-e\", f\"PYTHONPATH={clp_site_packages_dir}\"\`
2. In the \`env_vars\` dictionary, which is then iterated to add environment variables to the command
This duplication occurred during branch syncs and should be cleaned up. While not harmful to correctness, only one definition is necessary.
## References
- PR: https://github.com/y-scope/clp/pull/1152
- Comment: https://github.com/y-scope/clp/pull/1152#discussion_r2468365115
- Reported by: @junhaoliao" \
--assignee junhaoliao \
--repo y-scope/clp
Length of output: 42
@junhaoliao, I've created issue #1513 to track the removal of the duplicate PYTHONPATH environment variable definition in generate_container_start_cmd. The issue has been assigned to you.
🧬 Code Graph Analysis Results
[clp-package-utils/general.py] Summary: generate_container_start_cmd
- Description: Creates the docker run command to start a container with mounts, environment variables, and a name.
- Parameters:
- container_name: str
- container_mounts: List[Optional[DockerMount]]
- container_image: str
- extra_env_vars: Optional[Dict[str, str]]
- Returns: List[str] representing the docker run command.
- Key behavior:
- Defines clp_site_packages_dir as CONTAINER_CLP_HOME / "lib" / "python3" / "site-packages".
- Includes an environment variable "-e PYTHONPATH={clp_site_packages_dir}" unconditionally.
- Merges extra_env_vars (if provided) into the command as additional -e KEY=VALUE pairs.
- Appends mount arguments for non-None mounts.
- Important details:
- The command explicitly sets PYTHONPATH in the container startup command, which may duplicate with other environment-variable provisioning (e.g., a separate list of environment vars that also sets PYTHONPATH).
- This function is central to how the container’s environment is prepared and could interact with other env var provisioning logic.
[clp-package-utils/general.py] Summary: get_common_env_vars_list
- Description: Returns a list of common environment variables for Docker containers.
- Parameters:
- include_clp_home_env_var: bool (default True)
- Returns: List[str] where each item is in the form KEY=VALUE.
- Key behavior:
- Always includes PYTHONPATH={CONTAINER_CLP_HOME}/lib/python3/site-packages.
- If include_clp_home_env_var is True, also includes CLP_HOME={CONTAINER_CLP_HOME}.
- Important details:
- This list provides a standard set of env vars for containers.
- Note: PYTHONPATH is defined here, which can interact with the PYTHONPATH set in generate_container_start_cmd (potential duplication if both methods are used).
Description
This PR standardizes how configuration and credentials are passed to containers, addressing security concerns and code complexity in the
start_clp.pyscript.The key changes are:
Standardize configuration and credential delivery to improve consistency and security by:
See #1149 for more details.
Checklist
breaking change.
Validation performed
Code / File inspection
Manual tests
Tested container startup with the new configuration approach:
cd clp-package/sbin/./start-clp.shTested wrapper scripts:
./compress.sh ~/samples/hive-24hr/./decompress.sh x./search.sh "1"./admin-tools/archive-manager.sh findPerformed a search with query string "1" in the webui and observed results returned successfully.
Enabled
garbage_collectorand observed archives / search results are purged in below configurations:results_cache.retention_period = 1+archive_output.retention_period = 5800000+ performed a search: observed search results were cleaned in around one minute after query completed.results_cache.retention_period = 1+archive_output.retention_period = 1+ performed a search: observed no search result; archives were cleaned in around one minute after compression.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor