feat(package): Add garbage collectors to delete archives and search results based on a configurable retention policy. #1035

haiqi96 · 2025-06-25T22:10:48Z

Description

Note: this PR is based on #1023

This PR includes the following change.

Added a new retention_cleaner component to the package that launches one or more tasks via asyncio to handle retention for different components.
Updated clp_config.py to support specifying retention period and retention frequencies for archives and search results.
Updated S3 api for deleting objects from a bucket and updated the require permissions in the user Docs.
Implemented a targetBuffer class to support fault tolerance.

This PR also handles a potential race condition between search scheduler and archive retention. It's possible that an archive to be removed by the retention cleaner is still being searched.
To avoid this, we require the following:

the search scheduler will not search on any archives that is considered out-of-date
the archive retention cleaner shall not start deleting archives until all search jobs that starts before the expiry_ts finishes, which ensures that no jobs will be searching on the archive.

Note:
This PR doesn't support stream retention yet. The stream retention will be properly supported in a future PR.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Tested corner case and basic functionality of archive retention for both clp-json and clp-text on fs. The test script can be found at https://github.com/haiqi96/retention_tester
Manually tested S3 basic functionality
Manually tested search results garbage collection

Summary by CodeRabbit

New Features
- Added a configurable garbage collector component with CLI start/stop integration and per-container config handling.
Improvements
- New retention-period and sweep-interval configuration; scheduler now filters by retention and a DB index improves job lookup performance.
- S3 utilities support batched object deletion; IAM examples updated to include s3:DeleteObject.
- Standardized directory validation and safer, coordinated metadata-plus-storage deletion with fault-tolerant buffering.

…period

LinZhihao-723

The name target is very vague in this context. Let's try to make it clear and specific. Thus I made the following suggestions:

components/job-orchestration/job_orchestration/garbage_collector/utils.py

LinZhihao-723

Reviewed changes to the query scheduler. I don't have more comments for now. I will do another round when the current comments are resolved.

components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

components/job-orchestration/job_orchestration/garbage_collector/utils.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py (3)

25-34: Handle potential ValueError more gracefully.

The function raises a ValueError that isn't caught by the caller, which could crash the garbage collector unexpectedly.

Apply this diff to handle the error gracefully:

 def _get_latest_doc_timestamp(collection: pymongo.collection.Collection) -> int:
     latest_doc = collection.find_one(sort=[(MONGODB_ID_KEY, pymongo.DESCENDING)])
     if latest_doc is None:
         return 0
 
     object_id = latest_doc[MONGODB_ID_KEY]
     if isinstance(object_id, ObjectId):
         return int(object_id.generation_time.timestamp())
-    raise ValueError(f"{object_id} is not an ObjectID")
+    logger.warning(f"Unexpected ID type in collection: {type(object_id)}. Skipping.")
+    return 0

53-56: Document the rationale for filtering numeric job IDs.

Add a comment explaining why non-numeric collection names are skipped:

         for job_id in collection_names:
+            # Skip non-numeric collections as they are not job result collections
             if not job_id.isdigit():
                 continue

83-85: Consider more specific exception handling.

The broad exception catch might hide specific MongoDB connection or operation errors.

Replace the broad exception handling with more specific error handling:

+import pymongo.errors
+
 async def search_result_garbage_collector(
     clp_config: CLPConfig, log_directory: pathlib.Path, logging_level: str
 ) -> None:
     configure_logger(logger, logging_level, log_directory, SEARCH_RESULT_GARBAGE_COLLECTOR_NAME)
 
     sweep_interval_secs = clp_config.garbage_collector.sweep_interval.search_result * MIN_TO_SECONDS
 
     logger.info(f"{SEARCH_RESULT_GARBAGE_COLLECTOR_NAME} started.")
     try:
         while True:
-            _collect_and_sweep_expired_search_results(
-                clp_config.results_cache, clp_config.webui.results_metadata_collection_name
-            )
+            try:
+                _collect_and_sweep_expired_search_results(
+                    clp_config.results_cache, clp_config.webui.results_metadata_collection_name
+                )
+            except pymongo.errors.ConnectionFailure as e:
+                logger.exception(f"MongoDB connection failed: {e}")
+            except pymongo.errors.OperationFailure as e:
+                logger.exception(f"MongoDB operation failed: {e}")
             await asyncio.sleep(sweep_interval_secs)
     except Exception:
         logger.exception(f"{SEARCH_RESULT_GARBAGE_COLLECTOR_NAME} exited with failure.")
         raise

components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py (2)

72-73: Use conventional comparison order.

-    if 0 == num_candidates_to_delete:
+    if num_candidates_to_delete == 0:

86-88: Split long comment to comply with line length limit.

-        # Note: If dataset is not None, candidates are expected to be in the format
-        # `<dataset>/<archive_id>`
+        # Note: If dataset is not None, candidates are expected to be in the format of
+        # `<dataset>/<archive_id>`

components/job-orchestration/job_orchestration/garbage_collector/utils.py (1)

21-28: Consider adding error handling for file operations.

The function should handle potential exceptions from file operations and validate that the log directory exists.

 def configure_logger(
     logger: logging.Logger, logging_level: str, log_directory: pathlib.Path, handler_name: str
 ):
+    if not log_directory.exists():
+        log_directory.mkdir(parents=True, exist_ok=True)
+    
     log_file = log_directory / f"{handler_name}.log"
-    logging_file_handler = logging.FileHandler(filename=log_file, encoding="utf-8")
-    logging_file_handler.setFormatter(get_logging_formatter())
-    logger.addHandler(logging_file_handler)
-    set_logging_level(logger, logging_level)
+    try:
+        logging_file_handler = logging.FileHandler(filename=log_file, encoding="utf-8")
+        logging_file_handler.setFormatter(get_logging_formatter())
+        logger.addHandler(logging_file_handler)
+        set_logging_level(logger, logging_level)
+    except (OSError, IOError) as e:
+        logger.error(f"Failed to configure file handler for {handler_name}: {e}")
+        raise

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b93bb4b and a4546cf.

📒 Files selected for processing (5)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (4 hunks)
components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py (1 hunks)
components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py (1 hunks)
components/job-orchestration/job_orchestration/garbage_collector/utils.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (10 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: in `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variabl...

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...

Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py
components/job-orchestration/job_orchestration/garbage_collector/utils.py

📚 Learning: the column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-sp...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: in python, `logger.exception()` automatically logs the full exception stack trace, so there's no nee...

Learnt from: haiqi96
PR: y-scope/clp#665
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:142-152
Timestamp: 2025-01-22T20:26:25.942Z
Learning: In Python, `logger.exception()` automatically logs the full exception stack trace, so there's no need to explicitly add the exception string to the message.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

📚 Learning: in the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions...

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

📚 Learning: the current implementation assumes single-threaded execution, so race conditions in functions like `...

Learnt from: haiqi96
PR: y-scope/clp#569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (7)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)

343-359: Good refactoring to use centralized deletion function.

The change from DELETE ... RETURNING to SELECT followed by delete_archives_from_metadata_db improves maintainability by centralizing deletion logic and ensuring proper ordering of foreign key deletions.

components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py (1)

99-147: Well-designed logic for safe archive expiry calculation.

The function correctly handles the race condition between query scheduling and archive deletion by considering running query jobs' creation times. The documentation clearly explains the safety guarantees.

components/job-orchestration/job_orchestration/garbage_collector/utils.py (2)

54-56: Good use of modern datetime API.

Using datetime.fromtimestamp with timezone.utc is the correct approach and avoids the deprecated utcfromtimestamp.

90-148: Well-designed fault-tolerant buffer implementation.

The DeletionCandidatesBuffer class provides excellent fault tolerance with recovery capabilities. The separation between in-memory candidates and those to be persisted is a good design choice for crash recovery.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (3)

394-408: Well-implemented retention period filtering.

The addition of archive_end_ts_lower_bound parameter and the filtering logic correctly ensures that archives outside the retention period are excluded from search. The condition (end_timestamp >= {archive_end_ts_lower_bound} OR end_timestamp = 0) properly handles both completed archives within retention and ongoing archives.

682-690: Correct calculation of retention cutoff timestamp.

The conversion from job creation time to archive retention cutoff is properly implemented, correctly converting minutes to milliseconds and accounting for the time difference.

1203-1203: Proper integration of retention configuration.

The retention period is correctly passed from the configuration through the call chain, ensuring the feature is properly integrated into the system.

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

LinZhihao-723

A few missing comments.

components/clp-package-utils/clp_package_utils/general.py

components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py

components/clp-package-utils/clp_package_utils/general.py

components/job-orchestration/job_orchestration/garbage_collector/utils.py

LinZhihao-723 · 2025-08-07T01:50:13Z

components/job-orchestration/job_orchestration/garbage_collector/utils.py

+
+    def persist_new_candidates(self) -> None:
+        """
+        Writes any new candidates added since initialization to the recovery file.


This isn't true, right? It should be since the last call of persist_new_candidates, not "since initialization". Can u update this description?

...nts/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

components/job-orchestration/job_orchestration/garbage_collector/utils.py

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (8)

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py (2)

25-34: Handle potential ValueError in timestamp retrieval.

The function raises a ValueError if the ObjectId type check fails, but this exception is not caught in the calling code. This could crash the garbage collector unexpectedly.

53-55: Document the rationale for filtering numeric job IDs.

The code only processes collections with numeric names. Consider adding a comment explaining why non-numeric collection names are skipped.

components/clp-package-utils/clp_package_utils/general.py (3)

585-593: Add missing docstring.

Add a docstring to document the function.

99-104: Improve exception handling and add documentation.

The helper function needs proper exception chaining and a docstring.

185-190: Improve exception handling and add documentation.

The helper function needs proper exception chaining and a docstring.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

1058-1063: Add function docstring.

Please add a docstring describing the function's purpose and parameters.

1078-1079: Specify encoding when opening file.

Always specify encoding to avoid platform-dependent behaviour.
components/job-orchestration/job_orchestration/garbage_collector/utils.py (1)
21-28: Ensure log directory exists before creating log file.

The function should ensure the log directory exists to prevent FileNotFoundError.
 def configure_logger(
     logger: logging.Logger, logging_level: str, log_directory: pathlib.Path, handler_name: str
 ):
+    log_directory.mkdir(parents=True, exist_ok=True)
     log_file = log_directory / f"{handler_name}.log"
     logging_file_handler = logging.FileHandler(filename=log_file, encoding="utf-8")
     logging_file_handler.setFormatter(get_logging_formatter())
     logger.addHandler(logging_file_handler)
     set_logging_level(logger, logging_level)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c4821a and 74af600.

📒 Files selected for processing (4)

components/clp-package-utils/clp_package_utils/general.py (7 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7 hunks)
components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py (1 hunks)
components/job-orchestration/job_orchestration/garbage_collector/utils.py (1 hunks)

🧰 Additional context used

🧠 Learnings (14)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-01-22T20:26:25.942Z

Learnt from: haiqi96
PR: y-scope/clp#665
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:142-152
Timestamp: 2025-01-22T20:26:25.942Z
Learning: In Python, `logger.exception()` automatically logs the full exception stack trace, so there's no need to explicitly add the exception string to the message.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

📚 Learning: 2024-10-07T21:18:02.510Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-07T21:18:02.510Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

📚 Learning: 2024-10-08T15:52:50.753Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

📚 Learning: 2024-12-16T21:49:06.483Z

Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

components/job-orchestration/job_orchestration/garbage_collector/utils.py

📚 Learning: 2024-11-18T16:49:20.248Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-02-12T22:24:17.723Z

Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-07-03T12:58:18.407Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-07-01T14:51:19.172Z

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-07-07T17:41:15.655Z

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py
components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-06-28T07:10:47.295Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2024-11-15T16:21:52.122Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-07-03T20:10:43.789Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

🧬 Code Graph Analysis (2)

components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

ResultsCache (283-317)

get_uri (316-317)

components/clp-py-utils/clp_py_utils/clp_logging.py (1)

get_logger (17-25)

components/job-orchestration/job_orchestration/garbage_collector/utils.py (2)

configure_logger (21-28)

get_expiry_epoch_secs (42-51)

components/clp-package-utils/clp_package_utils/general.py (2)

components/clp-py-utils/clp_py_utils/core.py (1)

validate_path_could_be_dir (65-72)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

CLPConfig (629-815)

validate_logs_input_config (666-680)

coderabbitai · 2025-08-07T16:50:22Z

...nts/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

+def _collect_and_sweep_expired_search_results(
+    result_cache_config: ResultsCache, results_metadata_collection_name: str
+):
+    expiry_epoch = get_expiry_epoch_secs(result_cache_config.retention_period)
+
+    logger.debug(f"Searching for search jobs finished before {expiry_epoch}.")
+    deleted_job_ids: List[int] = []
+    with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client:
+        results_cache_db = results_cache_client.get_default_database()
+        collection_names = results_cache_db.list_collection_names()
+        for job_id in collection_names:
+            if not job_id.isdigit():
+                continue
+
+            job_results_collection = results_cache_db.get_collection(job_id)
+            collection_timestamp = _get_latest_doc_timestamp(job_results_collection)
+            if collection_timestamp >= expiry_epoch:
+                continue
+
+            _delete_result_metadata(results_cache_db, results_metadata_collection_name, job_id)
+            job_results_collection.drop()
+            deleted_job_ids.append(int(job_id))
+
+    if len(deleted_job_ids) != 0:
+        logger.debug(f"Deleted search results of job(s): {deleted_job_ids}.")
+    else:
+        logger.debug(f"No search results matched the expiry criteria.")
+


⚠️ Potential issue

Add specific exception handling for MongoDB operations.

The function calls _get_latest_doc_timestamp which can raise a ValueError, but this isn't caught. Additionally, MongoDB operations could fail with connection or operation errors.

Consider catching specific exceptions:

def _collect_and_sweep_expired_search_results( result_cache_config: ResultsCache, results_metadata_collection_name: str ): expiry_epoch = get_expiry_epoch_secs(result_cache_config.retention_period) logger.debug(f"Searching for search jobs finished before {expiry_epoch}.") deleted_job_ids: List[int] = [] - with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client: - results_cache_db = results_cache_client.get_default_database() - collection_names = results_cache_db.list_collection_names() - for job_id in collection_names: - if not job_id.isdigit(): - continue - - job_results_collection = results_cache_db.get_collection(job_id) - collection_timestamp = _get_latest_doc_timestamp(job_results_collection) - if collection_timestamp >= expiry_epoch: - continue - - _delete_result_metadata(results_cache_db, results_metadata_collection_name, job_id) - job_results_collection.drop() - deleted_job_ids.append(int(job_id)) + try: + with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client: + results_cache_db = results_cache_client.get_default_database() + collection_names = results_cache_db.list_collection_names() + for job_id in collection_names: + if not job_id.isdigit(): + continue + + try: + job_results_collection = results_cache_db.get_collection(job_id) + collection_timestamp = _get_latest_doc_timestamp(job_results_collection) + if collection_timestamp >= expiry_epoch: + continue + + _delete_result_metadata(results_cache_db, results_metadata_collection_name, job_id) + job_results_collection.drop() + deleted_job_ids.append(int(job_id)) + except ValueError as e: + logger.warning(f"Skipping job {job_id} due to error: {e}") + continue + except pymongo.errors.ConnectionFailure as e: + logger.exception(f"MongoDB connection failed: {e}") + raise + except pymongo.errors.OperationFailure as e: + logger.exception(f"MongoDB operation failed: {e}") + raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _collect_and_sweep_expired_search_results(

result_cache_config: ResultsCache, results_metadata_collection_name: str

):

expiry_epoch = get_expiry_epoch_secs(result_cache_config.retention_period)

logger.debug(f"Searching for search jobs finished before {expiry_epoch}.")

deleted_job_ids: List[int] = []

with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client:

results_cache_db = results_cache_client.get_default_database()

collection_names = results_cache_db.list_collection_names()

for job_id in collection_names:

if not job_id.isdigit():

continue

job_results_collection = results_cache_db.get_collection(job_id)

collection_timestamp = _get_latest_doc_timestamp(job_results_collection)

if collection_timestamp >= expiry_epoch:

continue

_delete_result_metadata(results_cache_db, results_metadata_collection_name, job_id)

job_results_collection.drop()

deleted_job_ids.append(int(job_id))

if len(deleted_job_ids) != 0:

logger.debug(f"Deleted search results of job(s): {deleted_job_ids}.")

else:

logger.debug(f"No search results matched the expiry criteria.")

def _collect_and_sweep_expired_search_results(

result_cache_config: ResultsCache, results_metadata_collection_name: str

):

expiry_epoch = get_expiry_epoch_secs(result_cache_config.retention_period)

logger.debug(f"Searching for search jobs finished before {expiry_epoch}.")

deleted_job_ids: List[int] = []

try:

with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client:

results_cache_db = results_cache_client.get_default_database()

collection_names = results_cache_db.list_collection_names()

for job_id in collection_names:

if not job_id.isdigit():

continue

try:

job_results_collection = results_cache_db.get_collection(job_id)

collection_timestamp = _get_latest_doc_timestamp(job_results_collection)

if collection_timestamp >= expiry_epoch:

continue

_delete_result_metadata(

results_cache_db,

results_metadata_collection_name,

job_id

)

job_results_collection.drop()

deleted_job_ids.append(int(job_id))

except ValueError as e:

logger.warning(f"Skipping job {job_id} due to error: {e}")

continue

except pymongo.errors.ConnectionFailure as e:

logger.exception(f"MongoDB connection failed: {e}")

raise

except pymongo.errors.OperationFailure as e:

logger.exception(f"MongoDB operation failed: {e}")

raise

if deleted_job_ids:

logger.debug(f"Deleted search results of job(s): {deleted_job_ids}.")

else:

logger.debug("No search results matched the expiry criteria.")

🤖 Prompt for AI Agents

In components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py around lines 43 to 70, add specific exception handling for MongoDB operations and the call to _get_latest_doc_timestamp which may raise a ValueError. Wrap the MongoDB client usage and operations in try-except blocks to catch pymongo-specific exceptions like ConnectionError or OperationFailure, and also catch ValueError from _get_latest_doc_timestamp. Log appropriate error messages for these exceptions to handle failures gracefully without crashing the function.

We forward the errors to the top-level catch, as DB exceptions are considered non-recoverable in our use-case. We will exit on such errors.

@LinZhihao-723, understood! That makes sense for a garbage collector service where database connectivity is essential for safe operation. Forwarding DB exceptions to the top-level handler and exiting on non-recoverable errors is a reasonable architectural choice to prevent inconsistent state.

🐰
📚 noting the architectural decision
💭

✏️ Learnings added

Learnt from: LinZhihao-723 PR: y-scope/clp#1035 File: components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py:43-70 Timestamp: 2025-08-13T18:55:32.034Z Learning: In the CLP garbage collector components, DB exceptions (like MongoDB ConnectionFailure and OperationFailure) are considered non-recoverable and are forwarded to top-level exception handlers to exit the application rather than being handled locally.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

components/job-orchestration/job_orchestration/garbage_collector/utils.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)

612-626: Fix: Compression scheduler uses the query scheduler’s logging level.

Per team learning, generic_start_scheduler should respect each scheduler’s own logging level, not always query’s. Recommend computing the logging level based on component_name.

Apply this diff:

@@
-    necessary_env_vars = [
+    scheduler_logging_level = (
+        clp_config.compression_scheduler.logging_level
+        if COMPRESSION_SCHEDULER_COMPONENT_NAME == component_name
+        else clp_config.query_scheduler.logging_level
+    )
+
+    necessary_env_vars = [
         f"PYTHONPATH={clp_site_packages_dir}",
         (
             f"BROKER_URL=amqp://"
             f"{container_clp_config.queue.username}:{container_clp_config.queue.password}@"
             f"{container_clp_config.queue.host}:{container_clp_config.queue.port}"
         ),
         (
             f"RESULT_BACKEND=redis://default:{container_clp_config.redis.password}@"
             f"{container_clp_config.redis.host}:{container_clp_config.redis.port}/"
             f"{container_clp_config.redis.query_backend_database}"
         ),
         f"CLP_LOGS_DIR={container_logs_dir}",
-        f"CLP_LOGGING_LEVEL={clp_config.query_scheduler.logging_level}",
+        f"CLP_LOGGING_LEVEL={scheduler_logging_level}",
     ]

1062-1067: Add a concise docstring to describe the new component starter.

Keeps consistency and satisfies linting (missing-docstring) noted in prior reviews.

Apply this diff:

 def start_garbage_collector(
     instance_id: str,
     clp_config: CLPConfig,
     container_clp_config: CLPConfig,
     mounts: CLPDockerMounts,
 ):
+    """
+    Start the retention cleaner (garbage collector) container.
+
+    - Skips startup when no retention period is configured.
+    - Writes a per-container config into the logs directory.
+    - Mounts logs/archives/stream output directories as needed (FS), and configures AWS auth when required.
+    - Launches job_orchestration.garbage_collector.garbage_collector with the config path.
+
+    :param instance_id: Short identifier appended to container names.
+    :param clp_config: Host-side CLP configuration.
+    :param container_clp_config: Container-side CLP configuration for runtime.
+    :param mounts: Precomputed docker mounts for CLP directories and optional AWS config.
+    """

1071-1075: Use lazy formatting for logging (non-blocking, for future consistency).

Prefer logger’s lazy formatting to avoid unnecessary string interpolation when the log level filters messages out. Acknowledge prior decision to keep f-strings for now; leaving this as a gentle reminder.

Apply this diff:

-        logger.info(f"Retention period is not configured, skipping {component_name} creation...")
+        logger.info("Retention period is not configured, skipping %s creation...", component_name)
@@
-    logger.info(f"Starting {component_name}...")
+    logger.info("Starting %s...", component_name)
@@
-    logger.info(f"Started {component_name}.")
+    logger.info("Started %s.", component_name)

Also applies to: 1141-1141

1082-1083: Specify encoding when writing the container config.

Avoid platform-dependent defaults; use explicit UTF-8.

Apply this diff:

-    with open(container_config_file_path, "w") as f:
+    with open(container_config_file_path, "w", encoding="utf-8") as f:
         yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74af600 and 34a06a5.

📒 Files selected for processing (1)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

🧬 Code Graph Analysis (1)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

components/clp-package-utils/clp_package_utils/general.py (5)

is_retention_period_configured (585-592)

validate_log_directory (185-189)

validate_logs_input_config (498-499)

validate_output_storage_config (502-507)

CLPDockerMounts (86-96)

components/clp-py-utils/clp_py_utils/clp_config.py (3)

validate_logs_input_config (666-680)

CLPConfig (629-815)

StorageType (53-55)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

generate_container_auth_options (98-172)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: lint-check (macos-15)

🔇 Additional comments (2)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

25-25: Wiring imports and validators for retention/gc component look consistent.

Importing GARBAGE_COLLECTOR_NAME and the new validation helpers aligns with the PR scope and is consistent with patterns used for other components.

Also applies to: 58-58, 65-67

1180-1180: LGTM: CLI, credentials validation, storage validation, and startup wiring for GC are correct.

Adds subparser for GARBAGE_COLLECTOR_NAME.

Includes DB credentials validation for GC.

Validates output storage (needed for FS/S3 deletion).

Wires start_garbage_collector into the startup sequence.

Also applies to: 1205-1206, 1240-1243, 1311-1312

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

LinZhihao-723

Last-min comment on the docstring.

components/job-orchestration/job_orchestration/garbage_collector/utils.py

LinZhihao-723 · 2025-08-13T18:54:47Z

...nts/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py

+def _collect_and_sweep_expired_search_results(
+    result_cache_config: ResultsCache, results_metadata_collection_name: str
+):
+    expiry_epoch = get_expiry_epoch_secs(result_cache_config.retention_period)
+
+    logger.debug(f"Searching for search jobs finished before {expiry_epoch}.")
+    deleted_job_ids: List[int] = []
+    with pymongo.MongoClient(result_cache_config.get_uri()) as results_cache_client:
+        results_cache_db = results_cache_client.get_default_database()
+        collection_names = results_cache_db.list_collection_names()
+        for job_id in collection_names:
+            if not job_id.isdigit():
+                continue
+
+            job_results_collection = results_cache_db.get_collection(job_id)
+            collection_timestamp = _get_latest_doc_timestamp(job_results_collection)
+            if collection_timestamp >= expiry_epoch:
+                continue
+
+            _delete_result_metadata(results_cache_db, results_metadata_collection_name, job_id)
+            job_results_collection.drop()
+            deleted_job_ids.append(int(job_id))
+
+    if len(deleted_job_ids) != 0:
+        logger.debug(f"Deleted search results of job(s): {deleted_job_ids}.")
+    else:
+        logger.debug(f"No search results matched the expiry criteria.")
+


We forward the errors to the top-level catch, as DB exceptions are considered non-recoverable in our use-case. We will exit on such errors.

…or/utils.py Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

components/job-orchestration/job_orchestration/garbage_collector/utils.py (3)
21-28: Logger configuration is missing critical error handling.

The function doesn't validate if the log directory exists or handle file operation exceptions, which could cause the garbage collector to fail unexpectedly during initialization.

114-116: Add explicit encoding to file operations.

The recovery file is opened without specifying encoding, which can lead to inconsistent behaviour across platforms.

Apply this diff to specify UTF-8 encoding:
-        with open(self._recovery_file_path, "r") as f:
+        with open(self._recovery_file_path, "r", encoding="utf-8") as f:
135-137: Add explicit encoding to file write operations.

The recovery file is opened for appending without specifying encoding, which can lead to inconsistencies across environments.

Apply this diff to specify UTF-8 encoding:
-        with open(self._recovery_file_path, "a") as f:
+        with open(self._recovery_file_path, "a", encoding="utf-8") as f:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34a06a5 and 7633a6c.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/garbage_collector/utils.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

components/job-orchestration/job_orchestration/garbage_collector/utils.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (4)

ArchiveOutput (500-554)

FsStorage (405-421)

StorageEngine (48-50)

StorageType (53-55)

components/clp-py-utils/clp_py_utils/clp_logging.py (2)

get_logging_formatter (13-14)

set_logging_level (36-41)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

s3_delete_objects (315-352)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (9)

components/job-orchestration/job_orchestration/garbage_collector/utils.py (9)

31-39: LGTM! Well-designed validation logic.

The storage type validation correctly enforces that S3 storage is only supported with CLP_S engine, and properly validates supported storage types.

42-51: LGTM! Clear and correct expiry calculation.

The function correctly calculates the expiry cutoff based on current time and retention period, with good documentation explaining the logic.

54-55: Good fix for deprecated datetime method.

The use of datetime.fromtimestamp(expiry_epoch_secs, tz=timezone.utc) correctly replaces the deprecated utcfromtimestamp method.

77-87: LGTM! Clean abstraction for different storage backends.

The function correctly delegates to the appropriate deletion method based on storage type, with proper error handling for unsupported types.

90-101: Well-designed fault-tolerant buffer class.

The class provides excellent fault tolerance for deletion candidates with clear recovery semantics and proper validation.

103-117: Recovery logic is robust and well-implemented.

The initialization properly handles recovery file existence, validates file types, and loads previous candidates correctly.

118-121: Efficient duplicate prevention logic.

The method correctly prevents duplicate candidates while maintaining the persistence queue for new additions.

126-139: Atomic persistence with proper cleanup.

The method correctly appends new candidates to the recovery file and clears the persistence buffer, maintaining consistency.

141-150: Clean resource cleanup implementation.

The clear method properly removes both in-memory state and the recovery file, ensuring no orphaned recovery data remains.

coderabbitai · 2025-08-13T19:07:07Z

components/job-orchestration/job_orchestration/garbage_collector/utils.py

+def execute_fs_deletion(fs_storage_config: FsStorage, candidate: str) -> None:
+    """
+    Deletes a candidate (either a directory or a file) from the filesystem storage. The full path
+    of the candidate is constructed as `fs_storage_config.directory / candidate`. The function
+    performs no action if the candidate does not exist.
+
+    :param fs_storage_config:
+    :param candidate: Relative path of the file or directory to delete.
+    """
+    path_to_delete = fs_storage_config.directory / candidate
+    if not path_to_delete.exists():
+        return
+
+    if path_to_delete.is_dir():
+        shutil.rmtree(path_to_delete)
+    else:
+        os.remove(path_to_delete)


🧹 Nitpick (assertive)

Consider adding debug logging for deletion operations.

While the filesystem deletion logic is correct, adding debug-level logging would help with troubleshooting retention issues in production.

Consider adding debug logging like this:

def execute_fs_deletion(fs_storage_config: FsStorage, candidate: str) -> None: """ Deletes a candidate (either a directory or a file) from the filesystem storage. The full path of the candidate is constructed as `fs_storage_config.directory / candidate`. The function performs no action if the candidate does not exist. :param fs_storage_config: :param candidate: Relative path of the file or directory to delete. """ path_to_delete = fs_storage_config.directory / candidate if not path_to_delete.exists(): + logging.debug(f"Deletion candidate does not exist: {path_to_delete}") return + logging.debug(f"Deleting {'directory' if path_to_delete.is_dir() else 'file'}: {path_to_delete}") if path_to_delete.is_dir(): shutil.rmtree(path_to_delete) else: os.remove(path_to_delete)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def execute_fs_deletion(fs_storage_config: FsStorage, candidate: str) -> None:

"""

Deletes a candidate (either a directory or a file) from the filesystem storage. The full path

of the candidate is constructed as `fs_storage_config.directory / candidate`. The function

performs no action if the candidate does not exist.

:param fs_storage_config:

:param candidate: Relative path of the file or directory to delete.

"""

path_to_delete = fs_storage_config.directory / candidate

if not path_to_delete.exists():

return

if path_to_delete.is_dir():

shutil.rmtree(path_to_delete)

else:

os.remove(path_to_delete)

def execute_fs_deletion(fs_storage_config: FsStorage, candidate: str) -> None:

"""

Deletes a candidate (either a directory or a file) from the filesystem storage. The full path

of the candidate is constructed as `fs_storage_config.directory / candidate`. The function

performs no action if the candidate does not exist.

:param fs_storage_config:

:param candidate: Relative path of the file or directory to delete.

"""

path_to_delete = fs_storage_config.directory / candidate

if not path_to_delete.exists():

logging.debug(f"Deletion candidate does not exist: {path_to_delete}")

return

logging.debug(f"Deleting {'directory' if path_to_delete.is_dir() else 'file'}: {path_to_delete}")

if path_to_delete.is_dir():

shutil.rmtree(path_to_delete)

else:

os.remove(path_to_delete)

🤖 Prompt for AI Agents

In components/job-orchestration/job_orchestration/garbage_collector/utils.py around lines 58-74, add debug-level logging around the filesystem deletion flow: log the resolved path_to_delete and the candidate at the start, log a debug message when the path does not exist (and you return), log a debug message just before deleting (indicating whether you will rmtree or remove), and log a debug confirmation after successful deletion; use the module logger (e.g., logging.getLogger(__name__)) and ensure any exceptions raised during deletion are logged at debug/error level with the path info before re-raising or handling.

LinZhihao-723

For the PR title, how about:

feat(package): Add garbage collectors to delete archives and search results based on a configurable retention policy.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (2)

380-380: Use lazy formatting in logging.

-            logger.warning(f"Archive {archive_id} is not a directory. Skipping deletion.")
+            logger.warning("Archive %s is not a directory. Skipping deletion.", archive_id)

383-383: Use lazy formatting in logging.

-        logger.info(f"Deleting archive {archive_id} from disk.")
+        logger.info("Deleting archive %s from disk.", archive_id)

♻️ Duplicate comments (7)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)

297-297: Use lazy formatting in logging.

-                    logger.warning(f"Archive {archive_id} in database not found on disk.")
+                    logger.warning("Archive %s in database not found on disk.", archive_id)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

673-687: Add docstrings to configuration classes.

The new classes lack documentation explaining their purpose and fields.

Add docstrings to improve code documentation:

 class SweepInterval(BaseModel):
+    """
+    Configuration for garbage collector sweep intervals.
+    
+    Attributes:
+        archive: Sweep interval in minutes for archive garbage collection (default: 60).
+        search_result: Sweep interval in minutes for search result garbage collection (default: 30).
+    """
     archive: int = 60
     search_result: int = 30

689-697: Add docstring to GarbageCollector class.

Add docstring to improve code documentation:

 class GarbageCollector(BaseModel):
+    """
+    Configuration for the garbage collector component.
+    
+    Attributes:
+        logging_level: Logging level for garbage collection operations (default: "INFO").
+        sweep_interval: Sweep interval configuration for different resource types.
+    """
     logging_level: str = "INFO"
     sweep_interval: SweepInterval = SweepInterval()

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)

1064-1069: Add function docstring.

Please add a docstring describing the function's purpose and parameters.

 def start_garbage_collector(
     instance_id: str,
     clp_config: CLPConfig,
     container_clp_config: CLPConfig,
     mounts: CLPDockerMounts,
 ):
+    """
+    Starts the garbage collector component for retention management.
+    
+    :param instance_id: Unique identifier for this CLP instance.
+    :param clp_config: The host CLP configuration.
+    :param container_clp_config: The container-specific CLP configuration.
+    :param mounts: Docker mount configurations for the container.
+    """
     component_name = GARBAGE_COLLECTOR_NAME

1084-1085: Specify encoding when opening file.

Always specify encoding to avoid platform-dependent behaviour.

-    with open(container_config_file_path, "w") as f:
+    with open(container_config_file_path, "w", encoding="utf-8") as f:

1118-1123: Ensure FS output directories exist before bind-mounting.

Docker bind mounts require host paths to exist. Unlike workers (which mkdir the outputs), this function doesn't create them. If they don't exist, docker run will fail.

Apply this diff:

     # Add necessary mounts for archives and streams.
     if StorageType.FS == clp_config.archive_output.storage.type:
+        # Ensure the directory exists before bind-mounting
+        clp_config.archive_output.get_directory().mkdir(parents=True, exist_ok=True)
         necessary_mounts.append(mounts.archives_output_dir)
     if StorageType.FS == clp_config.stream_output.storage.type:
+        # Ensure the directory exists before bind-mounting
+        clp_config.stream_output.get_directory().mkdir(parents=True, exist_ok=True)
         necessary_mounts.append(mounts.stream_output_dir)

1143-1143: Use lazy formatting in logging statement.

-    logger.info(f"Started {component_name}.")
+    logger.info("Started %s.", component_name)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7633a6c and 8ca206e.

📒 Files selected for processing (4)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (7 hunks)
components/package-template/src/etc/clp-config.yml (3 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2024-11-15T16:21:52.122Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

🧬 Code Graph Analysis (2)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (3)

components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (2)

delete_archives_from_metadata_db (222-261)

get_archives_table_name (268-269)

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

SQL_Adapter (60-131)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

get_clp_connection_params_and_type (209-230)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

components/clp-package-utils/clp_package_utils/general.py (5)

is_retention_period_configured (585-592)

validate_log_directory (185-189)

validate_logs_input_config (498-499)

validate_output_storage_config (502-507)

CLPDockerMounts (86-96)

components/clp-py-utils/clp_py_utils/clp_config.py (10)

validate_logs_input_config (736-750)

CLPConfig (699-891)

dump_to_primitive_dict (471-472)

dump_to_primitive_dict (488-491)

dump_to_primitive_dict (520-523)

dump_to_primitive_dict (621-624)

dump_to_primitive_dict (643-646)

dump_to_primitive_dict (878-891)

dump_to_primitive_dict (903-911)

StorageType (90-92)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

generate_container_auth_options (98-172)

🔇 Additional comments (21)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (3)

327-331: LGTM! The initialization placement is appropriate.

Moving these initializations outside the try block is fine since they're not expected to raise exceptions during normal execution. The actual database operations that could fail remain properly wrapped in the try-catch block.

343-349: Good refactoring to use SELECT instead of DELETE with RETURNING.

The change from DELETE ... RETURNING id to SELECT id followed by a separate delete operation provides better separation of concerns and allows for validation before deletion.

359-359: Excellent consolidation of deletion logic.

Using the centralized delete_archives_from_metadata_db function simplifies the code and ensures proper deletion order (files → archive_tags → archives) to respect foreign key constraints.

components/package-template/src/etc/clp-config.yml (3)

56-57: LGTM!

The retention period configuration for search results is well-structured with proper documentation.

78-79: LGTM!

The retention period configuration for archives is consistent with the search results configuration and well-documented.

109-117: LGTM!

The garbage collector configuration is well-structured with appropriate logging level and sweep interval settings for archives and search results.

components/clp-py-utils/clp_py_utils/clp_config.py (6)

29-29: LGTM!

The constant follows the existing naming convention for component names.

358-358: LGTM!

The optional retention period field is properly typed.

380-385: LGTM!

The validator correctly ensures retention period is positive when set.

577-577: LGTM!

The optional retention period field for archives is consistent with the results cache implementation.

609-614: LGTM!

The validator for archive retention period is consistent with the results cache validator.

715-715: LGTM!

The garbage collector field is properly integrated into CLPConfig with a sensible default.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (9)

25-25: LGTM!

The import is correctly added to support the garbage collector component.

59-59: LGTM!

The retention period check function is properly imported.

65-67: LGTM!

The validation function imports are properly added to support the garbage collector.

1072-1074: LGTM!

Early return when retention is not configured prevents unnecessary container creation.

1182-1182: LGTM!

The garbage collector parser is correctly added to the component args parser.

1216-1216: LGTM!

The garbage collector is correctly included in the database credentials validation targets.

1251-1253: LGTM!

The garbage collector is correctly included in output storage validation.

1333-1334: LGTM!

The garbage collector is properly integrated into the startup flow, being started for both ALL_TARGET_NAME and its specific target.

1242-1246: Query workers do not require logs_input validation — validate_logs_input_config is only invoked for the compression worker

start_clp.py only calls validate_logs_input_config when target is ALL or COMPRESSION_WORKER_COMPONENT_NAME (start_clp.py:1246). start_query_worker (start_clp.py:def start_query_worker at ~680) does not call logs_input validation; query code operates on archives/archives-config instead.

Files of interest:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py — validate_logs_input_config(...) at ~line 1246; start_query_worker defined ~line 680.

components/clp-py-utils/clp_py_utils/clp_config.py — logs_input definition and validate_logs_input_config() implementation.

components/clp-package-utils/clp_package_utils/general.py — wrapper/mounting logic and validate_logs_input_config() (around lines ~240 and ~498).

components/clp-package-utils/clp_package_utils/scripts/native/compress.py — calls clp_config.validate_logs_input_config().

Action: Ignore the original concern — the code correctly validates logs_input only for compression-related components.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-08-13T21:22:50Z

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

            for archive_id in archive_ids:
                logger.info(f"Deleted archive {archive_id} from the database.")



🧹 Nitpick (assertive)

Consider batching the deletion logging for better performance.

When deleting many archives, logging each deletion individually could impact performance. Consider logging a summary instead.

- for archive_id in archive_ids: - logger.info(f"Deleted archive {archive_id} from the database.") + logger.info(f"Deleted {len(archive_ids)} archives from the database.") + logger.debug(f"Deleted archive IDs: {', '.join(archive_ids)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for archive_id in archive_ids:

logger.info(f"Deleted archive {archive_id} from the database.")

logger.info(f"Deleted {len(archive_ids)} archives from the database.")

logger.debug(f"Deleted archive IDs: {', '.join(archive_ids)}")

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py around lines 360 to 362, the code logs each deleted archive inside the loop which can degrade performance for large batches; change this to log a single summary after the deletions complete — collect the deleted archive_ids during the loop (or count them), then call logger.info once with a concise summary (e.g., total deleted, and optionally a short sample or first/last N ids) rather than one log per id.

This is unrelated to this PR. will skip.

@haiqi96 Understood! You're absolutely right that logging performance optimization is outside the scope of this retention and garbage collection PR. I'll skip this suggestion to keep the focus on the core functionality being added.

🧠 Learnings used

Learnt from: haiqi96 PR: y-scope/clp#594 File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110 Timestamp: 2024-11-15T16:21:52.122Z Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)

376-388: Harden disk deletion to avoid aborting after DB commit

If rmtree raises (permissions, transient IO), you’ll return success after DB commit with archives still on disk. Catch and log errors per archive and propagate a non-zero exit if any deletions fail.

Apply this diff:

-    for archive_id in archive_ids:
-        archive_path = archive_output_dir / archive_id
-        if not archive_path.is_dir():
-            logger.warning(f"Archive {archive_id} is not a directory. Skipping deletion.")
-            continue
-
-        logger.info(f"Deleting archive {archive_id} from disk.")
-        shutil.rmtree(archive_path)
-
-    logger.info(f"Finished deleting archives from disk.")
-
-    return 0
+    deletion_failed = False
+    for archive_id in archive_ids:
+        archive_path = archive_output_dir / archive_id
+        if not archive_path.is_dir():
+            logger.warning("Archive %s is not a directory. Skipping deletion.", archive_id)
+            continue
+
+        logger.info("Deleting archive %s from disk.", archive_id)
+        try:
+            shutil.rmtree(archive_path)
+        except FileNotFoundError:
+            logger.warning("Archive %s already removed from disk.", archive_id)
+        except Exception as ex:
+            deletion_failed = True
+            logger.error("Failed to delete archive %s from disk: %s", archive_id, ex)
+
+    logger.info("Finished deleting archives from disk.")
+
+    return -1 if deletion_failed else 0

components/clp-py-utils/clp_py_utils/clp_config.py (1)

708-708: Fix invalid type annotation for reducer field

Current annotation uses a call expression in the type position, which is invalid Python and will break imports.

Apply this diff:
-    reducer: Reducer() = Reducer()
+    reducer: Reducer = Reducer()

♻️ Duplicate comments (5)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (1)

327-333: Acknowledging choice to keep adapter initialisation outside try

Given the earlier discussion, keeping SQL_Adapter creation outside the try is acceptable for your context.

components/clp-py-utils/clp_py_utils/clp_config.py (2)

673-687: Add a short docstring to SweepInterval for clarity

Helps users and tooling understand units and defaults at a glance.

Example:

"""
Sweep intervals (in minutes) for garbage-collector jobs.

Attributes:
archive: Interval for archive sweeps (default: 60).
search_result: Interval for search result sweeps (default: 30).
"""

689-697: Add a brief docstring to GarbageCollector

Documenting the purpose and fields improves maintainability.

Example:

"""
Configuration for the garbage-collector component.

Attributes:
logging_level: Logging level for the component (default: "INFO").
sweep_interval: Per-task sweep intervals in minutes.
"""
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
1064-1144: Ensure FS output dirs exist before bind-mounting (prevents docker run failure)

Docker bind mounts require host paths to exist. Create them if using FS storage so the container start doesn’t fail.

Apply this diff:
     # Add necessary mounts for archives and streams.
     if StorageType.FS == clp_config.archive_output.storage.type:
+        clp_config.archive_output.get_directory().mkdir(parents=True, exist_ok=True)
         necessary_mounts.append(mounts.archives_output_dir)
     if StorageType.FS == clp_config.stream_output.storage.type:
+        clp_config.stream_output.get_directory().mkdir(parents=True, exist_ok=True)
         necessary_mounts.append(mounts.stream_output_dir)
1084-1086: Specify file encoding when writing the container config

Prevents platform-dependent defaults and encoding surprises.

Apply this diff:
-    with open(container_config_file_path, "w") as f:
+    with open(container_config_file_path, "w", encoding="utf-8") as f:
         yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7633a6c and 8ca206e.

📒 Files selected for processing (4)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (7 hunks)
components/package-template/src/etc/clp-config.yml (3 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

📚 Learning: 2024-11-15T16:21:52.122Z

Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py

🔇 Additional comments (16)

components/package-template/src/etc/clp-config.yml (3)

56-58: Clear retention toggle for search results looks good

Comment and default align with intended behaviour (“null” disables automatic deletion).

78-80: Archive retention config reads well

Wording is precise; default keeps deletion off by default. Matches the model fields.

109-117: Garbage collector section and naming are consistent

Using “garbage_collector” and “sweep_interval” (singular keys) aligns with the updated API and avoids ambiguity.

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py (3)

12-14: Centralising metadata deletes via helper is a good move

Importing and using delete_archives_from_metadata_db reduces duplication and keeps DB schema logic in one place.

345-349: Selecting IDs before deletion improves clarity and dry-run support

Fetching IDs with a SELECT prior to deletion plays nicely with dry-run and downstream validation.

359-362: Good use of delete_archives_from_metadata_db

This consolidates table-specific deletion logic and reduces the chance of partial metadata cleanup.

components/clp-py-utils/clp_py_utils/clp_config.py (4)

29-29: New component name constant is consistent

GARBAGE_COLLECTOR_NAME aligns with CLI wiring and container naming conventions.

358-385: results_cache.retention_period and validator are correct

Optional minutes with a >0 constraint if set is clear and matches the YAML guidance.

577-577: archive_output.retention_period and validator are correct

Same validation semantics as results cache; consistent and minimal.

Also applies to: 609-614

715-715: Wiring garbage_collector into CLPConfig is appropriate

Defaulting to GarbageCollector() keeps the config self-contained without requiring YAML entries.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

25-25: Importing GARBAGE_COLLECTOR_NAME keeps naming consistent across layers

Matches the config model and CLI additions.

59-68: Good: using existing validation helpers

validate_log_directory, validate_logs_input_config, and validate_output_storage_config are applied in the right places for GC wiring.

1182-1183: CLI target added for the garbage collector

This enables starting the GC independently; good addition.

1216-1221: DB credentials loading includes the GC—correct

The GC needs DB access for archive metadata and job orchestration; inclusion here is necessary.

1243-1254: Validation coverage looks right

logs_input validated only where needed (compression worker paths)

output storage validated for workers and GC

1333-1335: GC startup wired into main flow appropriately

Starts on ALL or explicit target; respects the configured retention toggle in the function.

coderabbitai · 2025-08-13T21:30:25Z

components/clp-package-utils/clp_package_utils/scripts/start_clp.py

+def start_garbage_collector(
+    instance_id: str,
+    clp_config: CLPConfig,
+    container_clp_config: CLPConfig,
+    mounts: CLPDockerMounts,
+):
+    component_name = GARBAGE_COLLECTOR_NAME


🧹 Nitpick (assertive)

Add a concise docstring to start_garbage_collector

A brief description of purpose and parameters will help future maintainers.

Example:

"""
Start the garbage-collector container if any retention period is configured.

Args:
instance_id: Short instance identifier used in container names.
clp_config: Host-side CLPConfig with absolute host paths/credentials.
container_clp_config: Container-side CLPConfig with container paths.
mounts: Precomputed Docker mounts for CLP directories and optional AWS config.
"""

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines 1064 to 1070, add a concise docstring to the start_garbage_collector function describing its purpose and parameters; include a short one-line summary (what the function does) and parameter entries for instance_id, clp_config, container_clp_config, and mounts that mention expected types/semantics (e.g., host vs container config, instance identifier, precomputed Docker mounts), following the provided example format and placement immediately under the def line.

haiqi96 added 30 commits June 17, 2025 14:50

Barebone

73f0d69

backup of progress

bc4c464

Backup for initial handling for streams

e6238a2

Backup for initial handling for streams

653f3dc

fix

a269a97

small update to use fancier syntax

1742ff0

fixes

efbb47b

linter yeah

5714945

Adding simple handler for reusing

b1fe0d4

commit to propogate change

a5f5a3b

Add handler

53d7417

Fix mistakes in the handler logic

ccd23dc

Update scheduler to handle logs of time.

5bb3685

Merge branch 'main' into retension_period

bc688b9

Refactor dataset related code

0c8c6f6

Refactor dataset related code

0d186e6

further refactor

75ac0ff

Linter

bb1e5f4

A few more fixes

ba7cfe1

Linter fixes

68454c6

Merge branch 'DatasetRefactor' into retension_period

5eccaaf

missing fixes

c1de746

Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…

f08802b

…period

Fix mistake

d797198

Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…

c5dc9b9

…period

actually fixing

8c39e77

Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…

ea4318e

…period

Intermediate backup for archive retention

2c97441

Update

2eff448

Linter again

d570ab6

LinZhihao-723 requested changes Aug 4, 2025

View reviewed changes

components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py Outdated Show resolved Hide resolved

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py Outdated Show resolved Hide resolved

haiqi96 added 2 commits August 5, 2025 13:55

Address review comments

390333f

Fix timezone

a4546cf

haiqi96 commented Aug 5, 2025

View reviewed changes

components/job-orchestration/job_orchestration/garbage_collector/utils.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py Show resolved Hide resolved

LinZhihao-723 requested changes Aug 7, 2025

View reviewed changes

haiqi96 changed the title ~~feat(package): Add retention period support for archives and search results.~~ feat(package): Add garbage collectors to support retention for archives and search results. Aug 7, 2025

haiqi96 and others added 3 commits August 7, 2025 11:48

Apply suggestions from code review

2c4821a

Co-authored-by: Lin Zhihao <[email protected]>

Address code review comments and slight improved logging.

9d5d087

Linter

74af600

coderabbitai bot reviewed Aug 7, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into retension_period

34a06a5

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

components/clp-package-utils/clp_package_utils/scripts/start_clp.py Show resolved Hide resolved

LinZhihao-723 requested changes Aug 13, 2025

View reviewed changes

Update components/job-orchestration/job_orchestration/garbage_collect…

7633a6c

…or/utils.py Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

LinZhihao-723 previously approved these changes Aug 13, 2025

View reviewed changes

haiqi96 changed the title ~~feat(package): Add garbage collectors to support retention for archives and search results.~~ feat(package): Add garbage collectors to delete archives and search results based on a configurable retention policy. Aug 13, 2025

Merge remote-tracking branch 'origin/main' into retension_period

bd48ca5

haiqi96 dismissed LinZhihao-723’s stale review via bd48ca5 August 13, 2025 20:41

Merge branch 'main' into retension_period

8ca206e

LinZhihao-723 approved these changes Aug 13, 2025

View reviewed changes

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

Merge branch 'main' into retension_period

ee9dd59

haiqi96 merged commit 70adb90 into y-scope:main Aug 13, 2025
7 checks passed

coderabbitai bot mentioned this pull request Aug 14, 2025

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152

Merged

3 tasks

haiqi96 mentioned this pull request Aug 14, 2025

fix(clp-package): Start garbage collector as part of the "all" and "controller" targets; Validate retention periods aren't configured when using the Presto query engine. #1205

Merged

3 tasks

coderabbitai bot mentioned this pull request Aug 16, 2025

feat(package): Add dataset-manager scripts to support listing datasets, and deleting them entirely. #1144

Merged

3 tasks

		for archive_id in archive_ids:
		logger.info(f"Deleted archive {archive_id} from the database.")

feat(package): Add garbage collectors to delete archives and search results based on a configurable retention policy. #1035

feat(package): Add garbage collectors to delete archives and search results based on a configurable retention policy. #1035

Uh oh!

Conversation

haiqi96 commented Jun 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LinZhihao-723 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

haiqi96 commented Jun 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot Aug 7, 2025 •

edited

Loading

coderabbitai bot Aug 13, 2025 •

edited

Loading