fix(clp-package): Add dataset to metadata database after input paths are processed for compression jobs (fixes #2091). by quinntaylormitchell · Pull Request #2092 · y-scope/clp

quinntaylormitchell · 2026-03-12T16:48:55Z

Description

This PR addresses issue #2091 by calling _ensure_dataset_exists() after input paths are processed for a compression job, not before.

Note: My only concern with this implementation is that this fix only protects against path-processing failures. Datasets will still be added to the metadata database if the compression job fails in the core. I tested the idea of calling _ensure_dataset_exists() from _complete_compression_job() instead, and while that did strictly fix the issue (in the sense that failed jobs were no longer adding their datasets to the metadata database), it also broke all compression, because compression jobs need the dataset to exist in the metadata database before the compression job starts.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ran the replication steps described in issue #2091; datasets are only added to the metadata database if the paths in the compression command are valid.

Summary by CodeRabbit

Refactor
- Modified the timing of dataset validation in the compression scheduler to occur after input processing instead of at the start, with the validation now conditional on the CLP_S storage engine configuration.

coderabbitai · 2026-03-12T16:49:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa40bed6-cb31-4615-8ade-02d3f802ec62

📥 Commits

Reviewing files that changed from the base of the PR and between 5798e0e and cef42bb.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

Walkthrough

The change defers dataset existence validation in search_and_schedule_new_tasks from the method's start to after input path processing, and restricts the check to execute only when using StorageEngine.CLP_S. The validation now retrieves the dataset from the input configuration instead of pre-computed existing datasets.

Changes

Cohort / File(s)	Summary
Compression Scheduler Logic `components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py`	Deferred dataset existence check from method entry to post-input processing block, now gated by StorageEngine.CLP_S condition and using input configuration dataset value instead of pre-fetched datasets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Datasets are prematurely added to the metadata database #2091: This change directly addresses premature dataset insertion by deferring _ensure_dataset_exists to after input processing and gating it with StorageEngine.CLP_S.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: moving dataset addition to the metadata database to occur after input path processing for compression jobs, which directly addresses issue `#2091`.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Apply fix for premature dataset addition bug.

cef42bb

quinntaylormitchell requested a review from junhaoliao March 12, 2026 16:48

quinntaylormitchell requested a review from a team as a code owner March 12, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clp-package): Add dataset to metadata database after input paths are processed for compression jobs (fixes #2091).#2092

fix(clp-package): Add dataset to metadata database after input paths are processed for compression jobs (fixes #2091).#2092
quinntaylormitchell wants to merge 1 commit intoy-scope:mainfrom
quinntaylormitchell:compression-dataset-addition

quinntaylormitchell commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quinntaylormitchell commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

quinntaylormitchell commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading