-
Notifications
You must be signed in to change notification settings - Fork 83
fix(package): Refactor dataset handling in the package. #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Walkthrough
This change removes hardcoded table suffix constants throughout the codebase and replaces them with dynamic table name retrieval functions that accept an optional dataset parameter. It eliminates storage engine-based conditional logic for table naming and path construction, centralizes table name generation, and updates function signatures and usages to propagate dataset information where needed.
## Changes
| Files/Areas | Change Summary |
|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `clp_py_utils/clp_config.py` | Removed six string constants representing table suffix names for archives, datasets, files, tags, and column metadata. |
| `clp_py_utils/clp_metadata_db_utils.py` | Introduced a helper for dynamic table name construction; replaced all table name concatenations with new getter functions; updated function signatures to accept optional dataset parameters; updated `add_dataset` to use `ArchiveOutput`; centralized table name logic. |
| `clp_py_utils/s3_utils.py` | Renamed `s3_put` parameter from `dest_file_name` to `dest_path` and clarified S3 key construction in the docstring. |
| `clp_package_utils/scripts/native/archive_manager.py`<br>`clp_package_utils/scripts/native/decompress.py`<br>`clp_package_utils/scripts/start_clp.py` | Replaced hardcoded table suffixes with dynamic table name getters; removed storage engine-based table prefix and dataset logic; updated function signatures to accept optional dataset; improved archive directory path handling. |
| `job_orchestration/executor/compress/compression_task.py` | Replaced hardcoded suffixes with table name getter functions; added optional dataset parameter to metadata update functions; centralized S3 upload logic with dataset-aware pathing. |
| `job_orchestration/executor/query/extract_stream_task.py`<br>`job_orchestration/executor/query/fs_search_task.py` | Changed S3 object key construction to avoid mutating config; now builds key with dataset as a variable. |
| `job_orchestration/scheduler/compress/compression_scheduler.py` | Removed storage engine parameter; simplified dataset registration and tags table naming; added helper for dataset caching; used dynamic table name getters. |
| `job_orchestration/scheduler/query/query_scheduler.py` | Replaced fixed table suffixes with dynamic table name getters; removed storage engine parameter from functions; propagated dataset parameter throughout; updated job handle and query logic to use dataset-aware table names. |
| `job_orchestration/scheduler/job_config.py` | Changed `dataset` attributes in input config classes from non-optional to optional string, removing default dataset values. |
| `clp_package_utils/scripts/native/compress.py` | Added optional `dataset` parameter to `_generate_clp_io_config` and passed it to input config constructors; set `dataset` based on storage engine condition in `main`. |
| `clp_package_utils/scripts/native/search.py` | Added optional `dataset` parameter to search-related functions; propagated dataset through job creation and monitoring; assigned dataset based on storage engine condition in `main`. |
| `webui/server/settings.json` | Added `"ClpStorageEngine": "clp"` configuration entry. |
| `webui/server/src/configConstants.ts` | Added constants `CLP_STORAGE_ENGINE_CLP_S` and `CLP_DEFAULT_DATASET_NAME` for storage engine and default dataset identification. |
| `webui/server/src/fastify-v2/routes/api/search/index.ts` | Added conditional dataset assignment in search query submission based on storage engine setting. |
| `webui/server/src/plugins/DbManager.ts` | Added conditional dataset property to job config based on storage engine setting in JSON extraction job submission. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant Scheduler
participant MetadataDBUtils
participant Executor
participant S3
participant DB
Scheduler->>MetadataDBUtils: get_tags_table_name(table_prefix, dataset)
Scheduler->>MetadataDBUtils: add_dataset(..., archive_output)
MetadataDBUtils->>DB: INSERT INTO datasets_table ...
Executor->>MetadataDBUtils: get_archives_table_name(table_prefix, dataset)
Executor->>MetadataDBUtils: get_archive_tags_table_name(table_prefix, dataset)
Executor->>DB: INSERT/SELECT using dynamic table names
Executor->>S3: _upload_archive_to_s3(archive_src_path, archive_id, dataset)
S3-->>Executor: Upload completePossibly related PRs
Suggested reviewers
|
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
84-95: Add docstring and consider using a dataclass for parameters.The function has too many parameters and lacks documentation:
def update_job_metadata_and_tags(db_cursor, job_id, table_prefix, dataset, tag_ids, archive_stats): + """ + Update job metadata and tags for an archive. + + :param db_cursor: Database cursor + :param job_id: Job identifier + :param table_prefix: Table name prefix + :param dataset: Optional dataset name + :param tag_ids: List of tag IDs + :param archive_stats: Archive statistics dictionary + """Consider creating a dataclass to group related parameters and reduce the argument count.
97-110: Add docstring for the update_archive_metadata function.def update_archive_metadata(db_cursor, table_prefix, dataset, archive_stats): + """ + Insert archive metadata into the database. + + :param db_cursor: Database cursor + :param table_prefix: Table name prefix + :param dataset: Optional dataset name + :param archive_stats: Dictionary containing archive statistics + """ archive_stats_defaults = {
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)
288-290: Consider extracting the repeated archive output directory logic.The pattern
archives_dir / dataset if dataset is not None else archives_dirappears in multiple places. While simple, extracting it to a helper function would improve maintainability:+def _get_archive_output_dir(archives_dir: Path, dataset: typing.Optional[str]) -> Path: + """Get the archive output directory, optionally prefixed with dataset.""" + return archives_dir / dataset if dataset is not None else archives_dirThen use it as:
-archive_output_dir: Path = ( - archives_dir / dataset if dataset is not None else archives_dir -) +archive_output_dir: Path = _get_archive_output_dir(archives_dir, dataset)Also applies to: 392-392
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py(10 hunks)components/clp-package-utils/clp_package_utils/scripts/native/decompress.py(0 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py(6 hunks)components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py(2 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py(2 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(8 hunks)components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py(1 hunks)components/job-orchestration/job_orchestration/executor/query/fs_search_task.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(4 hunks)components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py(11 hunks)
💤 Files with no reviewable changes (1)
- components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
🧰 Additional context used
🪛 Pylint (3.3.7)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
[error] 37-37: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
[error] 15-19: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
[convention] 25-25: Missing function or method docstring
(C0116)
[convention] 29-29: Missing function or method docstring
(C0116)
[convention] 33-33: Missing function or method docstring
(C0116)
[convention] 37-37: Missing function or method docstring
(C0116)
[convention] 41-41: Missing function or method docstring
(C0116)
[convention] 45-45: Missing function or method docstring
(C0116)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
[error] 22-25: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
[convention] 73-73: Missing function or method docstring
(C0116)
[convention] 84-84: Missing function or method docstring
(C0116)
[refactor] 84-84: Too many arguments (6/5)
(R0913)
[refactor] 84-84: Too many positional arguments (6/5)
(R0917)
[convention] 97-97: Missing function or method docstring
(C0116)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
[error] 40-46: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
🔇 Additional comments (22)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (4)
98-117: Good refactoring to use centralized table name generation.The change to accept a dataset parameter and use the
get_files_table_namehelper function improves consistency and modularity.
120-130: Consistent refactoring pattern applied.The function now uses the centralized table name helper. Note that this function requires a dataset parameter (not optional), which appears intentional for column metadata tables.
143-186: Proper use of centralized table name helpers.The changes consistently apply the new table name generation pattern. The path construction at line 182 correctly appends the dataset name for dataset-specific storage organization.
203-224: Excellent refactoring to use centralized table name generation.The function now consistently uses the new table name getter functions and properly passes the dataset parameter to the table creation functions. This eliminates direct string manipulation of table prefixes.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (1)
100-103: Good refactoring to avoid mutating shared configuration.The change from mutating
s3_config.key_prefixto creating a locals3_object_keyvariable improves immutability and prevents side effects. This is a more predictable approach that won't affect other uses of the s3_config object.components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (1)
65-68: Consistent S3 key construction refactoring.This change follows the same pattern as in
extract_stream_task.py, avoiding mutation of the shareds3_config.key_prefixby creating a locals3_object_keyvariable. Good consistency across related files.components/clp-py-utils/clp_py_utils/s3_utils.py (1)
282-310: Good semantic improvement with parameter renaming.Renaming
dest_file_nametodest_pathbetter reflects the parameter's purpose, especially with the updated functionality where it can include dataset-specific path components. The documentation update clearly explains the behavior.components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (1)
11-11: Good fix for web UI loading issue.The addition of
CLP_DEFAULT_DATASET_NAMEimport and explicit creation of metadata tables with the default dataset for CLP_S storage engine addresses the web UI loading bug mentioned in the PR objectives. The comment clearly explains the rationale.Also applies to: 59-62
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
37-37: LGTM! Clean refactoring to use centralized table name getters.The changes properly initialize the dataset parameter based on the storage engine and use the new centralized table name getter functions, improving consistency across the codebase.
Also applies to: 868-877
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
196-199: Verify the removal of dataset from archive storage directory path.The archive storage directory path no longer includes the dataset subdirectory. Please confirm this is intentional and won't affect downstream components that may expect the dataset-prefixed path structure.
189-192: Clean implementation of dataset-aware table naming.Good use of the Optional type for dataset_name and proper propagation to the table name getter functions.
Also applies to: 280-292
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)
15-19: Well-structured refactoring to support dataset-aware table naming.The changes properly:
- Import the centralized table name getter functions
- Initialize dataset based on storage engine
- Update function signatures to accept optional dataset parameter
- Use the table name getters consistently throughout
Also applies to: 196-199, 241-241, 268-269, 340-341, 360-362
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
146-156: Good implementation of dataset-aware S3 upload.The helper function properly handles the optional dataset parameter by prefixing the S3 destination path when a dataset is provided.
297-297: Consistent dataset propagation throughout the compression flow.The changes properly initialize and propagate the dataset parameter through all relevant function calls, maintaining consistency with the refactored table naming approach.
Also applies to: 371-371, 389-391, 396-396
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (8)
40-46: Import looks correct, static analysis warning likely a false positive.The import of the new table name helper functions aligns with the refactoring objectives. The static analysis warning about the import error is likely due to the tool not having access to the full package structure.
115-127: Proper propagation of dataset parameter.The addition of the optional dataset parameter and its propagation to
get_archive_and_file_split_ids_for_ir_extractionis consistent with the refactoring pattern.
162-175: Consistent implementation of dataset parameter.The addition of the optional dataset parameter and its propagation to
archive_existsfollows the same pattern as other refactored functions.
391-424: Excellent refactoring to use centralized table name functions.The replacement of hardcoded table suffixes with calls to
get_archives_table_name,get_archive_tags_table_name, andget_tags_table_nameimproves maintainability and follows the centralized table name handling pattern.
450-481: Clean implementation of dataset-aware table name resolution.The use of
get_files_table_namewith the dataset parameter properly centralizes table name construction.
484-498: Proper use of centralized table name function.The implementation correctly uses
get_archives_table_namewith the dataset parameter.
668-694: Well-implemented dataset extraction and validation logic.The code correctly extracts the dataset for CLP_S storage engine and validates its existence. The error handling for invalid datasets is appropriate.
One minor observation: Consider initializing
datasettoNoneexplicitly for clarity, though the current typing annotation is sufficient.
733-739: Consistent dataset propagation to extraction handlers.The dataset parameter is properly passed to both
IrExtractionHandleandJsonExtractionHandleconstructors, maintaining consistency across the refactoring.
| def update_tags(db_cursor, table_prefix, dataset, archive_id, tag_ids): | ||
| tags_table_name = get_tags_table_name(table_prefix, dataset) | ||
| db_cursor.executemany( | ||
| f""" | ||
| INSERT INTO {table_prefix}{ARCHIVE_TAGS_TABLE_SUFFIX} (archive_id, tag_id) | ||
| INSERT INTO {tags_table_name} (archive_id, tag_id) | ||
| VALUES (%s, %s) | ||
| """, | ||
| [(archive_id, tag_id) for tag_id in tag_ids], | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add docstring for the update_tags function.
This function lacks documentation. Please add a docstring explaining its purpose and parameters:
def update_tags(db_cursor, table_prefix, dataset, archive_id, tag_ids):
+ """
+ Insert tags for an archive into the database.
+
+ :param db_cursor: Database cursor
+ :param table_prefix: Table name prefix
+ :param dataset: Optional dataset name
+ :param archive_id: Archive identifier
+ :param tag_ids: List of tag IDs to associate with the archive
+ """
tags_table_name = get_tags_table_name(table_prefix, dataset)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def update_tags(db_cursor, table_prefix, dataset, archive_id, tag_ids): | |
| tags_table_name = get_tags_table_name(table_prefix, dataset) | |
| db_cursor.executemany( | |
| f""" | |
| INSERT INTO {table_prefix}{ARCHIVE_TAGS_TABLE_SUFFIX} (archive_id, tag_id) | |
| INSERT INTO {tags_table_name} (archive_id, tag_id) | |
| VALUES (%s, %s) | |
| """, | |
| [(archive_id, tag_id) for tag_id in tag_ids], | |
| ) | |
| def update_tags(db_cursor, table_prefix, dataset, archive_id, tag_ids): | |
| """ | |
| Insert tags for an archive into the database. | |
| :param db_cursor: Database cursor | |
| :param table_prefix: Table name prefix | |
| :param dataset: Optional dataset name | |
| :param archive_id: Archive identifier | |
| :param tag_ids: List of tag IDs to associate with the archive | |
| """ | |
| tags_table_name = get_tags_table_name(table_prefix, dataset) | |
| db_cursor.executemany( | |
| f""" | |
| INSERT INTO {tags_table_name} (archive_id, tag_id) | |
| VALUES (%s, %s) | |
| """, | |
| [(archive_id, tag_id) for tag_id in tag_ids], | |
| ) |
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 73-73: Missing function or method docstring
(C0116)
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
around lines 73 to 82, the update_tags function lacks a docstring. Add a clear
docstring at the start of the function describing its purpose, which is to
insert multiple tag associations for a given archive into the database, and
explain the parameters: db_cursor (database cursor for executing queries),
table_prefix (prefix for table names), dataset (dataset identifier), archive_id
(ID of the archive), and tag_ids (list of tag IDs to associate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
74-74: Address the missing docstring from previous review.The function signature has been updated with a new
datasetparameter, but the docstring is still missing as noted in the previous review.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (7)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (7)
22-26: LGTM! Clean import organization.The imports for the new utility functions are properly organized and align with the refactoring objectives.
85-87: LGTM! Consistent parameter threading.The dataset parameter is properly added to the function signature and correctly passed through to the
update_tagsfunction.
98-111: LGTM! Clean refactoring of table name handling.The function properly uses the centralized utility function for table name generation and correctly threads the dataset parameter.
147-157: LGTM! Well-structured helper function.The
_upload_archive_to_s3helper function properly encapsulates S3 upload logic with dataset-aware path construction. The conditional path prefixing logic is correct.
298-298: LGTM! Clear variable initialization.The
input_datasetvariable is properly initialized and will be set appropriately for CLP_S storage engine.
372-372: LGTM! Consistent use of helper function.The S3 upload call correctly uses the new helper function with the dataset parameter.
390-400: LGTM! Proper dataset parameter handling.Both function calls correctly pass the
input_datasetparameter, maintaining consistency with the refactoring approach.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
74-74: Add docstring for the update_tags function.This function lacks documentation. Please add a docstring explaining its purpose and parameters.
def update_tags(db_cursor, table_prefix, dataset, archive_id, tag_ids): + """ + Insert tags for an archive into the database. + + :param db_cursor: Database cursor + :param table_prefix: Table name prefix + :param dataset: Optional dataset name + :param archive_id: Archive identifier + :param tag_ids: List of tag IDs to associate with the archive + """ archive_tags_table_name = get_archive_tags_table_name(table_prefix, dataset)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py(8 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
[error] 22-26: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
[warning] 22-26: Unused get_tags_table_name imported from clp_py_utils.clp_metadata_db_utils
(W0611)
[convention] 74-74: Missing function or method docstring
(C0116)
[convention] 85-85: Missing function or method docstring
(C0116)
[refactor] 85-85: Too many arguments (6/5)
(R0913)
[refactor] 85-85: Too many positional arguments (6/5)
(R0917)
[convention] 98-98: Missing function or method docstring
(C0116)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (6)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (6)
74-82: LGTM! Variable naming issue has been resolved.The function correctly uses
archive_tags_table_namein the SQL query, resolving the previously identified variable naming bug.
147-157: LGTM! Well-structured helper function for S3 uploads.The new
_upload_archive_to_s3helper function properly encapsulates S3 upload logic with dataset-aware path construction. The implementation correctly handles the optional dataset parameter by prefixing the destination path when a dataset is provided.
298-298: LGTM! Proper initialization of dataset variable.The
input_datasetvariable is correctly initialized as None and will be set appropriately for CLP_S storage engine later in the function.
372-372: LGTM! Consistent use of new S3 upload helper.The call to
_upload_archive_to_s3properly passes all required parameters including the dataset, maintaining consistency with the refactored approach.
390-392: LGTM! Proper dataset parameter propagation.The call to
update_archive_metadatacorrectly includes theinput_datasetparameter, maintaining consistency with the refactored function signature.
397-397: LGTM! Consistent dataset parameter usage.The call to
update_job_metadata_and_tagsproperly includes theinput_datasetparameter, aligning with the updated function signature.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
| ) as metadata_db_cursor: | ||
| if StorageEngine.CLP_S == storage_engine: | ||
| create_datasets_table(metadata_db_cursor, table_prefix) | ||
| # Note: the webui still expect tables with default dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a small issue: the CLP_DEFAULT_DATASET_NAME is not added to the dataset table until the first compression job with dataset = CLP_DEFAULT_DATASET_NAME.
This behavior will not cause any functional bug so I feel it is acceptable, but we can also explicility add the CLP_DEFAULT_DATASET_NAME to the datasets table here by calling add_dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's added when we compress the first archive that uses the default dataset. You guys can add the set of metadata tables for default dataset here if the webui requires it upon launching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, this change creates the tables for CLP_DEFAULT_DATASET_NAME but does not add CLP_DEFAULT_DATASET_NAME to the clp_datasets table.
It's more proper to use add_dataset( but that would require bringing in clp_config.archive_output and introduce a ripple of change in the argparsers.
@kirkrodrigues what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call add_datasets in start_clp.py? That way we wouldn't need to pass in the config file.
Can you elaborate in the PR description? Can you fill out the checklist? |
Updated PR description and filled out the checklist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3)
73-82: Add docstring for the update_tags function.This function lacks documentation. Please add a docstring explaining its purpose and parameters.
84-94: Add docstring for the update_job_metadata_and_tags function.This function lacks documentation and should include a docstring explaining its purpose and parameters.
97-123: Add docstring for the update_archive_metadata function.This function lacks documentation and should include a docstring explaining its purpose and parameters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py(8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: gibber9809
PR: y-scope/clp#672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (6)
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: The s3_put API in clp_py_utils.s3_utils internally handles timeout and retry mechanisms for S3 operations.
🪛 Pylint (3.3.7)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
[error] 22-25: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
[convention] 73-73: Missing function or method docstring
(C0116)
[convention] 84-84: Missing function or method docstring
(C0116)
[refactor] 84-84: Too many arguments (6/5)
(R0913)
[refactor] 84-84: Too many positional arguments (6/5)
(R0917)
[convention] 97-97: Missing function or method docstring
(C0116)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (5)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (5)
22-25: LGTM! Import changes support dataset-aware table naming.The updated imports align with the refactoring objective to replace hardcoded table name constants with centralized getter functions that support dataset parameters.
310-310: LGTM! Proper dataset variable initialization.The input_dataset variable is correctly initialized as Optional[str] and properly set for CLP_S storage engine workflows.
384-384: LGTM! S3 upload call properly includes dataset parameter.The call to the new helper function correctly passes the input_dataset parameter for dataset-aware S3 path construction.
402-404: LGTM! Archive metadata update properly includes dataset parameter.The function call correctly passes the input_dataset parameter to support dataset-aware table naming.
409-409: LGTM! Job metadata update properly includes dataset parameter.The function call correctly passes the input_dataset parameter to maintain consistency with the dataset-aware refactoring.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Show resolved
Hide resolved
Bill-hbrhbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three major points:
- In the query scheduler, a lot of
datasetarg passing is not needed, because one can directly get the dataset from all classes derived fromQueryJobConfig - As discussed in previous PRs,
ExtractIrJobConfigshould never have a non-Nonedataset. So all dataset handling when we are dealing with IR extraction jobs can be safely removed to reduce confusions. - Make all dataset types in
components/job-orchestration/job_orchestration/scheduler/job_config.pyoptional and with default value ofNone. In this way we can change allif CLP_S == storage_enginecheck toif dataset is not Nonecheck, which is more robust down the road if we support dataset features in other scenarios.
I'll follow this design when modifying my dataset interface PR.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
| ) as metadata_db_cursor: | ||
| if StorageEngine.CLP_S == storage_engine: | ||
| create_datasets_table(metadata_db_cursor, table_prefix) | ||
| # Note: the webui still expect tables with default dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, this change creates the tables for CLP_DEFAULT_DATASET_NAME but does not add CLP_DEFAULT_DATASET_NAME to the clp_datasets table.
It's more proper to use add_dataset( but that would require bringing in clp_config.archive_output and introduce a ripple of change in the argparsers.
@kirkrodrigues what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
876-882: Long line inclient_settings_json_updatesbreaks Black / Pylint limitsStatic analysis still reports >100-char lines (line 878). Re-wrap the offended entry to unblock CI:
- "MongoDbSearchResultsMetadataCollectionName": clp_config.webui.results_metadata_collection_name, + "MongoDbSearchResultsMetadataCollectionName": ( + clp_config.webui.results_metadata_collection_name + ),Everything else in this block (switch to
get_*_table_name) looks solid—nice use of the new helpers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: gibber9809
PR: y-scope/clp#672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.545Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.237Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (10)
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.237Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.545Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (2)
get_archives_table_name(215-216)get_files_table_name(227-228)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
🪛 Pylint (3.3.7)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
[error] 36-39: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
[convention] 878-878: Line too long (104/100)
(C0301)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
[error] 23-23: Unable to import 'clp_py_utils.clp_logging'
(E0401)
[warning] 424-424: TODO: This should be removed once the webui supports aggregating stats from multiple
(W0511)
[refactor] 456-456: Too many arguments (6/5)
(R0913)
[refactor] 456-456: Too many positional arguments (6/5)
(R0917)
🔇 Additional comments (5)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)
17-17: LGTM: Import additions support the refactoring.The new imports for
CLP_DEFAULT_DATASET_NAMEandget_tags_table_nameproperly support the dynamic table name generation and default dataset creation logic.Also applies to: 27-27
187-191: LGTM: Dataset handling logic is correct.The variable renaming from
dataset_nametodatasetimproves consistency with the input config field, and the conditional logic properly checks for non-null dataset before adding it to the existing datasets.
264-264: LGTM: Dynamic table name generation improves maintainability.The use of
get_tags_table_name()function instead of hardcoded table names centralizes table name logic and supports the dataset-aware table naming approach mentioned in the PR objectives.Also applies to: 266-266, 271-271
418-434: LGTM: Default dataset creation fixes the web UI bug.This logic directly addresses the bug described in the PR objectives where the web UI fails to load before the first CLP-S compression job. Creating the default dataset at startup ensures the required tables exist when the web UI queries them.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
36-39: Verifyclp_py_utils.clp_metadata_db_utilsis packaged and importablePlease confirm that the
clp_metadata_db_utils.pyfile makes it into your published artifact and can be imported cleanly:• Build and inspect the wheel:
cd components/clp-py-utils poetry build -f wheel unzip -l dist/clp_py_utils-*.whl | grep clp_metadata_db_utils.pyEnsure the path
clp_py_utils/clp_metadata_db_utils.pyappears in the list.• Test installation in a fresh venv:
pip install dist/clp_py_utils-*.whl python - <<’PY’ import clp_py_utils.clp_metadata_db_utils print("Import OK") PY• Check your packaging config and package exports:
– components/clp-py-utils/pyproject.toml
– components/clp-py-utils/clp_py_utils/init.py (make sure you’re not restricting imports via__all__)If the module is missing, add an explicit include in your
pyproject.tomlunder[tool.poetry.packages], for example:[tool.poetry.packages] include = "clp_py_utils"
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
| dataset: Optional[str] = None | ||
| if StorageEngine.CLP_S == clp_config.package.storage_engine: | ||
| table_prefix = f"{table_prefix}{CLP_DEFAULT_DATASET_NAME}_" | ||
| dataset = CLP_DEFAULT_DATASET_NAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Minor nit: collapse the two-step dataset assignment
The explicit type annotation is helpful, but the two-step pattern is verbose. You could inline it without losing clarity:
- dataset: Optional[str] = None
- if StorageEngine.CLP_S == clp_config.package.storage_engine:
- dataset = CLP_DEFAULT_DATASET_NAME
+ dataset: Optional[str] = (
+ CLP_DEFAULT_DATASET_NAME
+ if StorageEngine.CLP_S == clp_config.package.storage_engine
+ else None
+ )Reduces cognitive load and keeps the flow tight.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dataset: Optional[str] = None | |
| if StorageEngine.CLP_S == clp_config.package.storage_engine: | |
| table_prefix = f"{table_prefix}{CLP_DEFAULT_DATASET_NAME}_" | |
| dataset = CLP_DEFAULT_DATASET_NAME | |
| dataset: Optional[str] = ( | |
| CLP_DEFAULT_DATASET_NAME | |
| if StorageEngine.CLP_S == clp_config.package.storage_engine | |
| else None | |
| ) |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around
lines 873 to 875, the variable 'dataset' is assigned in two steps with an
explicit type annotation followed by a conditional assignment. Simplify this by
combining the declaration and conditional assignment into a single line using a
conditional expression, preserving the Optional[str] type clarity while reducing
verbosity and improving readability.
Bill-hbrhbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
| from sql_adapter import SQL_Adapter | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious. did you revert this change because it's not necessary for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oui
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Bingran Hu <[email protected]>
…CLP sotrage engine; fix bug where query scheduler local cache is not properly updated in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py(4 hunks)components/clp-package-utils/clp_package_utils/scripts/native/search.py(9 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(4 hunks)components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py(7 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: gibber9809
PR: y-scope/clp#672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (10)
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.658Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
components/clp-package-utils/clp_package_utils/scripts/native/search.py (2)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (12)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77
Timestamp: 2024-11-17T23:24:08.862Z
Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:869-870
Timestamp: 2025-06-10T05:56:56.572Z
Learning: The team prefers keeping simple conditional logic inline rather than extracting it into helper functions, especially for straightforward operations like string concatenation based on storage engine type.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.130Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.130Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
🧬 Code Graph Analysis (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (2)
get_tags_table_name(231-232)add_dataset(137-170)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
components/clp-package-utils/clp_package_utils/scripts/native/search.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
Database(80-159)ResultsCache(282-309)StorageEngine(47-49)
🪛 Pylint (3.3.7)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
[error] 10-10: Unable to import 'brotli'
(E0401)
[error] 11-11: Unable to import 'msgpack'
(E0401)
[error] 12-17: Unable to import 'clp_py_utils.clp_config'
(E0401)
[warning] 148-148: Using an f-string that does not have any interpolated variables
(W1309)
components/clp-package-utils/clp_package_utils/scripts/native/search.py
[error] 13-18: Unable to import 'clp_py_utils.clp_config'
(E0401)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
[error] 39-45: Unable to import 'clp_py_utils.clp_metadata_db_utils'
(E0401)
🪛 Flake8 (7.2.0)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
[error] 148-148: f-string is missing placeholders
(F541)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (15)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4)
8-8: LGTM! Import additions align with dataset parameter functionality.The new imports properly support the updated function signatures and dataset handling logic.
Also applies to: 12-17
138-143: LGTM! Function signature correctly updated for dataset parameter.The addition of the optional
datasetparameter maintains backwards compatibility while enabling dataset-aware configuration.
150-150: LGTM! Dataset parameter properly passed to config constructors.The dataset parameter is correctly propagated to both
FsInputConfigandS3InputConfigconstructors.Also applies to: 165-165
236-241: LGTM! Dataset determination logic is correct.The conditional logic properly assigns
CLP_DEFAULT_DATASET_NAMEfor CLP_S storage engine andNoneotherwise, which aligns with the dataset handling strategy across the codebase.components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)
26-26: LGTM! Import addition supports dynamic table naming.The
get_tags_table_nameimport enables the transition from hardcoded table suffixes to centralized table name generation.
186-197: LGTM! Simplified dataset handling improves code clarity.Directly retrieving the dataset from the input config and conditionally adding it to the database eliminates the previous storage engine conditional logic, making the code more straightforward and maintainable.
270-270: LGTM! Dynamic table name generation replaces hardcoded suffixes.Using
get_tags_table_name(table_prefix, dataset)instead of hardcoded table suffix concatenation provides better consistency and centralized table naming logic.Also applies to: 272-272, 277-277
424-424: LGTM! Direct storage engine access is cleaner.Accessing the storage engine directly from
clp_config.package.storage_enginewithout storing it locally simplifies the code and aligns with the removal of theclp_storage_engineparameter.components/clp-package-utils/clp_package_utils/scripts/native/search.py (3)
13-18: LGTM! Import additions support dataset parameter functionality.The new imports properly enable dataset handling and storage engine conditional logic throughout the search workflow.
40-40: LGTM! Dataset parameter consistently propagated through call chain.The dataset parameter is properly threaded through all functions from
maindown tocreate_and_monitor_job_in_db, maintaining clean function signatures and enabling dataset-aware search job configuration.Also applies to: 52-52, 123-123, 158-158, 196-196, 211-211, 225-225
296-300: LGTM! Dataset determination logic aligns with codebase patterns.The conditional assignment of
CLP_DEFAULT_DATASET_NAMEfor CLP_S storage engine andNoneotherwise follows the same pattern established in other components, ensuring consistency across the codebase.Also applies to: 306-306
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (4)
39-45: LGTM! Import additions enable dynamic table name resolution.The imported functions (
fetch_existing_datasets,get_archive_tags_table_name,get_archives_table_name,get_files_table_name,get_tags_table_name) replace hardcoded table suffix logic with centralized table name generation.
171-171: LGTM! Dynamic table name functions replace hardcoded suffixes.The transition to using
get_archives_table_name,get_archive_tags_table_name,get_tags_table_name, andget_files_table_namewith optional dataset parameters provides better consistency and centralized table naming logic throughout the query scheduler.Also applies to: 393-393, 395-395, 403-408, 464-464, 480-484
630-633: LGTM! Improved dataset validation logic.Using
existing_datasets.update(new_datasets)instead of overwriting the set is more efficient and maintains the accumulated dataset knowledge across multiple validation calls.
663-678: LGTM! Robust dataset validation with proper error handling.The addition of dataset validation for query jobs with appropriate error handling ensures that jobs with invalid datasets are properly failed and logged, preventing downstream processing issues.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
Outdated
Show resolved
Hide resolved
| dataset = ( | ||
| CLP_DEFAULT_DATASET_NAME | ||
| if StorageEngine.CLP_S == clp_config.package.storage_engine | ||
| else None | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used inside _generate_clp_io_config, why not move this into _generate_clp_io_config?
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Outdated
Show resolved
Hide resolved
…into DatasetRefactor
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)
152-152: LGTM! Error message improvements are correctly applied.The removal of unnecessary f-string formatting for static error messages is a good improvement that addresses the previous linting issues.
Also applies to: 161-161
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py(2 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(4 hunks)components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py(7 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: gibber9809
PR: y-scope/clp#672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (9)
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (6)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (11)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py:239-244
Timestamp: 2025-06-29T22:01:05.569Z
Learning: In the CLP codebase, dataset validation is handled at the outer layer in wrapper scripts (like clp_package_utils/scripts/search.py) that call native implementation scripts (like clp_package_utils/scripts/native/search.py). The native scripts are internal components that receive pre-validated parameters, so they don't need their own dataset validation logic.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:869-870
Timestamp: 2025-06-10T05:56:56.572Z
Learning: The team prefers keeping simple conditional logic inline rather than extracting it into helper functions, especially for straightforward operations like string concatenation based on storage engine type.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
🧬 Code Graph Analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (2)
get_tags_table_name(231-232)add_dataset(137-170)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
🪛 Flake8 (7.2.0)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
[error] 8-8: 'typing.Optional' imported but unused
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (14)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
12-17: LGTM! Import additions support the dataset handling refactoring.The new imports for
CLP_DEFAULT_DATASET_NAMEandStorageEngineare correctly added to support the internal dataset determination logic.
143-147: LGTM! Dataset determination logic is correctly implemented.The logic properly determines the dataset based on the storage engine type: using the default dataset name for CLP_S and None for other engines. This encapsulation within the function aligns well with the PR's objective to centralize dataset handling.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5)
26-26: LGTM!The addition of
get_tags_table_nameimport aligns with the PR's objective to use dynamic table name retrieval functions.
156-162: Function signature simplified appropriately.The removal of the
clp_storage_engineparameter aligns with the broader refactoring to eliminate storage engine-based conditional logic throughout the codebase.
186-198: Dataset handling logic is properly implemented.The code correctly retrieves the dataset from the input config and adds it if not already present. The caching mechanism with
existing_datasetsis efficient.
271-282: Dynamic table name generation properly implemented.The use of
get_tags_table_namewith the dataset parameter ensures correct table resolution for dataset-specific workflows.
425-428: Direct config access improves code clarity.Accessing the storage engine directly from the config eliminates unnecessary variable storage while maintaining the same functionality.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (7)
39-45: LGTM!The addition of table name getter function imports supports the transition to dynamic table name resolution.
171-171: Dataset parameter correctly propagated.The addition of the dataset parameter from the job config ensures the correct dataset-specific archives table is queried.
393-409: Dynamic table resolution properly implemented for archive searches.The function correctly extracts the dataset and uses it to generate appropriate table names for archives, archive tags, and tags tables.
464-464: Files table correctly uses None for dataset parameter.This aligns with the learning that file-related functions no longer use dataset parameters, indicating files are not dataset-specific.
477-490: Function signature properly updated for dataset support.The addition of the optional dataset parameter and dynamic table name generation ensures correct dataset-specific archive queries.
631-634: Dataset validation correctly updates the existing set.The use of
update()instead of assignment ensures that previously cached datasets are retained while adding newly discovered ones.
664-679: Robust dataset validation with proper error handling.The code correctly validates the dataset existence and marks the job as failed with appropriate error logging when the dataset doesn't exist.
| import typing | ||
| from contextlib import closing | ||
| from typing import List | ||
| from typing import List, Optional, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Remove unused import.
The Optional import is not used anywhere in the code and should be removed to keep imports clean.
-from typing import List, Optional, Union
+from typing import List, Union📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from typing import List, Optional, Union | |
| -from typing import List, Optional, Union | |
| +from typing import List, Union |
🧰 Tools
🪛 Flake8 (7.2.0)
[error] 8-8: 'typing.Optional' imported but unused
(F401)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py at
line 8, the import statement includes Optional which is not used anywhere in the
code. Remove Optional from the import statement to clean up unused imports and
keep the code tidy.
| ResultSetHeader, | ||
| } from "mysql2/promise"; | ||
|
|
||
| import settings from "../../settings.json"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need with {type: "json"};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/webui/server/src/plugins/DbManager.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
⚙️ Source: CodeRabbit Configuration File
List of files the instruction was applied to:
components/webui/server/src/plugins/DbManager.ts
🧠 Learnings (2)
📓 Common learnings
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.
Learnt from: gibber9809
PR: y-scope/clp#672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
components/webui/server/src/plugins/DbManager.ts (1)
Learnt from: junhaoliao
PR: y-scope/clp#596
File: components/log-viewer-webui/client/src/api/query.js:16-23
Timestamp: 2024-11-21T15:51:33.203Z
Learning: In `components/log-viewer-webui/client/src/api/query.js`, the `ExtractJsonResp` type definition is accurate as-is and does not require modification. When suggesting changes to type definitions, ensure they align with the server-side definitions, referencing the source code if necessary.
🧬 Code Graph Analysis (1)
components/webui/server/src/plugins/DbManager.ts (1)
components/webui/server/src/configConstants.ts (2)
CLP_STORAGE_ENGINE_CLP_S(8-8)CLP_DEFAULT_DATASET_NAME(7-7)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
components/webui/server/src/plugins/DbManager.ts (1)
12-16: Confirm tool-chain support for thewith { type: "json" }import assertionNode.js currently only supports the older
assert { type: "json" }syntax unless run with experimental flags, and TypeScript requiresresolveJsonModuleplus"module": "NodeNext"to emit the correct loader code. Please ensure the runtime (and test) environment actually understands thewithform, otherwise the server will fail to start.
| jobConfig = { | ||
| dataset: CLP_STORAGE_ENGINE_CLP_S === settings.ClpStorageEngine ? | ||
| CLP_DEFAULT_DATASET_NAME : | ||
| null, | ||
| archive_id: streamId, | ||
| target_chunk_size: targetUncompressedSize, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Avoid sending a dataset: null field when it’s not needed
Down-stream scheduler components typically look for the dataset key only when the storage engine is CLP_S. Serialising the field with a null value adds noise and may confuse strict schema validators.
- jobConfig = {
- dataset: CLP_STORAGE_ENGINE_CLP_S === settings.ClpStorageEngine ?
- CLP_DEFAULT_DATASET_NAME :
- null,
- archive_id: streamId,
- target_chunk_size: targetUncompressedSize,
- };
+ jobConfig = {
+ archive_id: streamId,
+ target_chunk_size: targetUncompressedSize,
+ ...(CLP_STORAGE_ENGINE_CLP_S === settings.ClpStorageEngine && {
+ dataset: CLP_DEFAULT_DATASET_NAME,
+ }),
+ };This keeps the payload minimal and makes future schema evolution easier.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| jobConfig = { | |
| dataset: CLP_STORAGE_ENGINE_CLP_S === settings.ClpStorageEngine ? | |
| CLP_DEFAULT_DATASET_NAME : | |
| null, | |
| archive_id: streamId, | |
| target_chunk_size: targetUncompressedSize, | |
| }; | |
| jobConfig = { | |
| archive_id: streamId, | |
| target_chunk_size: targetUncompressedSize, | |
| ...(CLP_STORAGE_ENGINE_CLP_S === settings.ClpStorageEngine && { | |
| dataset: CLP_DEFAULT_DATASET_NAME, | |
| }), | |
| }; |
🤖 Prompt for AI Agents
In components/webui/server/src/plugins/DbManager.ts around lines 132 to 138, the
jobConfig object always includes the dataset field, even when its value is null.
To avoid sending dataset: null, modify the code to only add the dataset field to
jobConfig when settings.ClpStorageEngine equals CLP_STORAGE_ENGINE_CLP_S. This
can be done by conditionally adding the dataset property instead of assigning
null, ensuring the field is omitted entirely when not needed.
| @@ -0,0 +1,9 @@ | |||
| // NOTE: These settings are duplicated from components/webui/client/src/config/index.ts, but will be | |||
| // removed in the near future. | |||
| const CLP_STORAGE_ENGINE_CLP_S = "clp-s"; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to use an enum like
enum CLP_STORAGE_ENGINE {
CLP: "clp",
CLP_S: "clp-s",
}then compare the values against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to move stuff from this file into the typings directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this enum in the client code. This constant is just temporary to keep the webui working in this PR. In #1050, we're going to pass the dataset from the client code, so the logic int he last two commits can be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are you suggesting we duplicate the entire enum in the serve code for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right i was going to suggest reusing / duplicating this enum: https://github.com/haiqi96/clp_fork/blob/7b42568fa9d335ecbe85f13edc5d3c111b4e8587/components/webui/client/src/config/index.ts
so the logic in the last two commits can be reverted.
sure. in that case i think we can temporarily leave the constant as-is then
junhaoliao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title & body, how about:
refactor(clp-package): Refactor dataset-related logic:
- Add helper methods to generate SQL table names.
- Make `dataset` optional in job configs.
- Change if-guards around dataset logic from checking for a specific storage engine to checking whether the dataset is set.
Description
This PR refactored the following:
The PR also fixes a bug where old webui fails to load before the first CLP-S compression job
As a result, the old webui will shown an empty page before the first compression job is launched. This is fixed by creating the set of tables with dataset = default when the package starts, at here
Checklist
breaking change.
Validation performed
Manually tested:
Summary by CodeRabbit
Refactor
Style
Chores