[OPIK-4842] [BE] Refactor retention jobs to Quartz, extract estimation by ldaugusto · Pull Request #5898 · comet-ml/opik

ldaugusto · 2026-03-26T17:51:26Z

Details

Refactors the retention job infrastructure from the Managed+Flux.interval pattern to proper Quartz jobs, and extracts velocity estimation from the HTTP request handler into a background job.

Three independent Quartz jobs:

RetentionSlidingWindowJob — regular sliding-window cycle, every 30min (configurable via executionsPerDay), fraction-based workspace sharding
RetentionEstimationJob — velocity estimation for newly created rules, every 5min (configurable via catchUp.estimationIntervalMinutes). Fixes the blocking HTTP thread concern from PR [OPIK-4842] [BE] Catch-up job for apply-to-past retention rules #5820 review
RetentionCatchUpJob — progressive historical deletion, every 45min (configurable via catchUp.intervalMinutes)

Key changes:

Each job has its own distributed Redis lock — catch-up/estimation never block regular retention
Lock TTL equals the scheduling interval for each job (30min for sliding window, 5min for estimation, 45min for catch-up). Combined with holdUntilExpiry=true, this prevents redundant runs across multiple instances
@DisallowConcurrentExecution + InterruptableJob for Quartz-level safety; Redis lock is the primary distributed guard
doJob() uses .block() with try/catch for proper error handling and graceful shutdown via AtomicBoolean interrupted
Rule creation no longer calls ClickHouse — saves with velocity=null, cursor=null, catchUpDone=false. The estimation job picks it up within 5 minutes
Follows established codebase patterns (TraceThreadsClosingJob, MetricsAlertJob)

Observability (OpenTelemetry metrics):

Job-level: runCounter (success/skipped_lock/error) and runDuration histogram on all 3 jobs
Sliding window: workspacesProcessed counter, rowsToDelete counter (by table)
Catch-up: rulesProcessed (by tier: small/medium/large), rulesCompleted, rowsToDelete (by table)
Estimation: rulesEstimated counter, velocityValues histogram
Pre-delete counts: Lightweight SELECT count() before each delete batch for observability — upper-bound ceiling with >99% precision (excludes expensive experiment_items exclusion subquery to avoid join cost)

Review feedback addressed:

Consolidated setRetentionJobs() in lifecycle listener
findUnestimatedCatchUpRules limited to 10 per tick
isTooManyRowsException moved to RetentionUtils
getCatchUpInterval() renamed to avoid shadowing parent getInterval()
Expanded catch-up index to cover all equality + ORDER BY columns
Migration bumped from 000061 to 000062 (collision with workspace rule project index)

Depends on: PR #5820 (catch-up job) — already merged to main.

Change checklist

User facing
Documentation update

Issues

OPIK-4797
OPIK-4842

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code
- Model(s): Claude Opus 4
- Scope: Implementation with human-driven architecture and design decisions
- Human verification: Pair-programmed throughout — human led architecture, concurrency strategy, tier design, and all key technical decisions

Testing

Commands run:

cd apps/opik-backend
mvn test -Dtest="RetentionPolicyServiceTest"
mvn test -Dtest="RetentionEstimationServiceTest,RetentionUtilsTest"

Results: 20 tests, 0 failures, 0 errors

Scenarios validated:

All existing retention tests pass unchanged
Catch-up integration test updated: estimation now runs as separate step before catch-up

Documentation

Retention documentation will come in future task.

github-actions · 2026-03-26T17:55:58Z

Backend Tests - Integration Group 8

291 tests ±0 289 ✅ ±0 9m 56s ⏱️ +15s
29 suites ±0 2 💤 ±0
29 files ±0 0 ❌ ±0

Results for commit e4d4859. ± Comparison against base commit eee4994.

♻️ This comment has been updated with latest results.

apps/opik-backend/config.yml

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionEstimationService.java

...ik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionSlidingWindowJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleDAO.java

Split the single RetentionPolicyJob (Managed pattern) into two independent Quartz jobs: - RetentionSlidingWindowJob: regular sliding-window cycle, runs every (24*60)/executionsPerDay minutes (default 30min), fraction-based workspace sharding - RetentionCatchUpJob: progressive historical deletion, runs on its own schedule (default 60min), separate distributed lock Both follow the established Quartz pattern (Job + InterruptableJob, @DisallowConcurrentExecution, bestEffortLock) and are registered via OpikGuiceyLifecycleEventListener. Benefits: - Separate locks: catch-up never blocks regular retention - Independent schedules: catch-up can run less frequently - Either can be disabled independently - Follows codebase conventions (TraceThreadsClosingJob pattern)

Move velocity estimation from the synchronous HTTP rule creation endpoint into a dedicated RetentionEstimationJob (3rd Quartz job): - RetentionEstimationJob: runs every 5 min (configurable), finds rules with catchUpDone=false and no velocity, estimates velocity + cursor for each, updates the rule in MySQL - RetentionRuleService.create(): no longer calls ClickHouse during rule creation. Saves rule with velocity=null, cursor=null, catchUpDone=false. The estimation job picks it up within minutes. - RetentionEstimationService: extracted from RetentionRuleServiceImpl, contains estimateVelocity, scoutFirstDataCursor, isTooManyRowsException This fixes the blocking HTTP thread concern raised in PR #5820 review: the scouting loop could make up to ~18 sequential ClickHouse queries for huge workspaces, now it runs in a background job instead.

apps/opik-backend/src/main/java/com/comet/opik/domain/TraceDAO.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleService.java

apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionCatchUpJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionEstimationService.java

apps/opik-backend/src/main/java/com/comet/opik/domain/SpanDAO.java

After rebase onto main, use the new bestEffortLock overload with holdUntilExpiry=true on all three retention jobs. The lock is held for the full interval (30min/5min/45min) so that with N instances, only one execution happens per interval, not N sequential ones.

.../opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionEstimationJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleService.java

…DEBUG - Remove catchUp.enabled check from rule creation so rules are always marked for catch-up when applyToPast=true, surviving temporary job disablement - Convert findUnestimatedCatchUpRules SQL to text block - Demote SpanDAO.estimateVelocityForRetention log to DEBUG - Demote TraceDAO.scoutFirstDayWithData log to DEBUG

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleDAO.java

…nds, fix reactor thread blocking - Remove InterruptableJob, AtomicBoolean interrupted, and @DisallowConcurrentExecution from all 3 retention jobs. Concurrency is guarded by Redis lock (holdUntilExpiry), not Quartz. doJob() returns immediately via subscribe(), freeing the Quartz thread. - Remove unused lockTimeoutSeconds from RetentionConfig and CatchUpConfig. - Add subscribeOn(boundedElastic) to estimation job to avoid blocking a reactor thread (estimatePendingRules calls .block() on DAO chains internally). - Retention deletes and estimation are idempotent — incomplete work during shutdown is safely retried on the next cycle.

baz-reviewer · 2026-03-27T10:15:58Z

.../opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionEstimationJob.java

+    @Override
+    public void doJob(JobExecutionContext context) {
+        // estimatePendingRules() calls .block() on DAO reactive chains internally,
+        // so we use subscribeOn(boundedElastic) to avoid blocking a reactor thread.
+        lockService.bestEffortLock(
+                RUN_LOCK,
+                Mono.fromRunnable(estimationService::estimatePendingRules)
+                        .subscribeOn(Schedulers.boundedElastic()),
+                Mono.fromRunnable(() -> log.debug(
+                        "Retention estimation: could not acquire lock, another instance is running")),
+                Duration.ofMinutes(config.getCatchUp().getEstimationIntervalMinutes()),
+                Duration.ZERO,


lockService.bestEffortLock(...).subscribe(...) is duplicated across retention jobs, should we extract a shared helper like AbstractRetentionJob or RetentionJobRunner.runWithLock(lock, workMono, interval, failLog, successLog) to centralize the lock/subscribe/logging boilerplate?

_{Finding type: Code Dedup and Conventions | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

github-actions · 2026-03-27T10:18:32Z

Backend Tests - Integration Group 11

39 files 39 suites 3m 36s ⏱️
198 tests 197 ✅ 0 💤 0 ❌ 1 🔥
197 runs 197 ✅ 0 💤 0 ❌

For more details on these errors, see this check.

Results for commit 8d0198e.

♻️ This comment has been updated with latest results.

apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionCatchUpJob.java

.../opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionEstimationJob.java

...ik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionSlidingWindowJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleDAO.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionEstimationService.java

apps/opik-backend/src/main/java/com/comet/opik/infrastructure/RetentionConfig.java

...backend/src/main/java/com/comet/opik/infrastructure/bi/OpikGuiceyLifecycleEventListener.java

apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionCatchUpJob.java

...ik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionSlidingWindowJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleDAO.java

thiagohora · 2026-03-27T12:00:42Z

💡 suggestion | Observability

None of the three new jobs emit any metrics. Given that these are deletion jobs with direct data impact, adding instrumentation would make operational monitoring significantly easier.

Suggested additions:

Metric	Type	Tags	Jobs
`retention.job.<name>.run.total`	Counter	`result=success\|skipped_lock\|error`	All three
`retention.job.<name>.run.duration`	Histogram	—	All three
`retention.sliding_window.rows_deleted`	Counter	`workspace_id`	SlidingWindow
`retention.sliding_window.fraction`	Counter/tag	—	SlidingWindow
`retention.catch_up.rules_processed`	Counter	`tier=small\|medium\|large`	CatchUp
`retention.catch_up.rows_deleted`	Counter	`workspace_id`, `tier`	CatchUp
`retention.catch_up.rules_completed`	Counter	—	CatchUp
`retention.estimation.rules_estimated`	Counter	—	Estimation
`retention.estimation.velocity_value`	Histogram	—	Estimation
`retention.query.estimate_velocity.duration`	Histogram	`workspace_id`	Estimation
`retention.query.scout_first_day.duration`	Histogram	`workspace_id`	Estimation

The per-workspace rows_deleted and rules_completed counters are the most operationally valuable — they let you verify backfill progress and detect stalled workspaces without querying the database directly.

🤖 Review posted via /review-github-pr

- Add @DisallowConcurrentExecution, InterruptableJob, .block() to all 3 jobs - Add LIMIT 10 to findUnestimatedCatchUpRules to bound work per cycle - Move isTooManyRowsException to RetentionUtils - Rename CatchUpConfig.getInterval() to getCatchUpInterval() for clarity - Expand idx_catch_up_pending to cover all equality + ORDER BY columns - Standardize lock-not-acquired logs to DEBUG across all jobs - Consolidate 3 retention setup methods into single setRetentionJobs()

apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionCatchUpJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionEstimationService.java

...ik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionSlidingWindowJob.java

…bility - Add run counter and duration histogram to all 3 retention jobs - Add domain-specific counters to services (workspaces processed, rules processed/completed, rules estimated, velocity values) - Add lightweight pre-delete row counts (SELECT count) in SpanDAO/TraceDAO for observability — upper-bound ceiling with >99% precision, excludes expensive experiment_items exclusion subquery - Bump migration from 000061 to 000062 (collision with workspace rule project index migration on main)

github-actions · 2026-03-27T14:50:59Z

📋 PR Linter Failed

❌ Missing Section. The description is missing the ## Documentation section.

ldaugusto · 2026-03-27T14:51:19Z

Addressed in 7bfe150. Here's what was implemented:

Job-level metrics (all 3 jobs):

opik.retention.<job>.run — Counter with result=success|skipped_lock|error
opik.retention.<job>.duration — Histogram (ms), recorded on success/error only (not skipped_lock)

Domain-level metrics:

opik.retention.sliding_window.workspaces_processed — Counter
opik.retention.sliding_window.rows_to_delete — Counter, tagged by table=traces|spans
opik.retention.catch_up.rules_processed — Counter, tagged by tier=small|medium|large
opik.retention.catch_up.rules_completed — Counter
opik.retention.catch_up.rows_to_delete — Counter, tagged by table=traces|spans
opik.retention.estimation.rules_estimated — Counter
opik.retention.estimation.velocity_value — Histogram

On rows_deleted vs rows_to_delete: ClickHouse lightweight DELETEs don't return affected row counts. Instead, we issue a lightweight SELECT count() before each delete batch. This is an upper-bound ceiling with >99% precision — it excludes the experiment_items exclusion subquery to avoid the join cost. The metric is named rows_to_delete to reflect that it's a pre-delete estimate, not an exact post-delete count.

Skipped: Per-workspace tags on row counters (high-cardinality risk) and per-query duration histograms for estimation (would need deeper plumbing for marginal value). The fraction is already visible in the job logs.

…-delete ordering - FIFO ordering (ORDER BY created_at ASC) prevents a consistently failing rule from starving other pending rules - Document why counts run sequentially before deletes (not in parallel): metric must reflect what's about to be removed, cost is minimal via ClickHouse primary key index

…onale Sequential counts avoid overloading ClickHouse and the collected metrics help assess query cost over time.

apps/opik-backend/src/main/java/com/comet/opik/domain/SpanDAO.java

...ik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionSlidingWindowJob.java

apps/opik-backend/src/main/java/com/comet/opik/domain/TraceDAO.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionPolicyService.java

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionCatchUpService.java

Without the SETTINGS clause in the SQL template, the log_comment placeholder from getSTWithLogComment was never injected into the query, making these counts invisible in ClickHouse query logs.

...sources/liquibase/db-app-state/migrations/000062_add_catch_up_columns_to_retention_rules.sql

…weight catch_up_velocity is always a range predicate (<, >=, BETWEEN) in finder queries, so MySQL's B-tree can't use columns after it for ORDER BY. The new index (catch_up_done, enabled, apply_to_past, catch_up_cursor) satisfies both filter and sort from a single index scan. Velocity filtering happens post-index with negligible cost on a small table.

...sources/liquibase/db-app-state/migrations/000062_add_catch_up_columns_to_retention_rules.sql

github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Mar 26, 2026

github-actions bot assigned ldaugusto Mar 26, 2026

baz-reviewer bot reviewed Mar 26, 2026

View reviewed changes

Base automatically changed from daniela/opik-4842-retention-catchup-job to main March 26, 2026 17:58

ldaugusto added 3 commits March 26, 2026 18:01

Change catch-up interval default from 60 to 45 minutes

3ac0589

baz-reviewer bot reviewed Mar 26, 2026

View reviewed changes

ldaugusto force-pushed the daniela/opik-4842-retention-quartz-refactor branch from c3b7074 to e4d4859 Compare March 26, 2026 18:10

baz-reviewer bot reviewed Mar 26, 2026

View reviewed changes

.../opik-backend/src/main/java/com/comet/opik/api/resources/v1/jobs/RetentionEstimationJob.java Outdated Show resolved Hide resolved

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleService.java Show resolved Hide resolved

baz-reviewer bot reviewed Mar 26, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/retention/RetentionRuleDAO.java Show resolved Hide resolved

baz-reviewer bot approved these changes Mar 26, 2026

View reviewed changes

baz-reviewer bot reviewed Mar 27, 2026

View reviewed changes

ldaugusto marked this pull request as ready for review March 27, 2026 10:25

ldaugusto requested a review from a team as a code owner March 27, 2026 10:25

thiagohora reviewed Mar 27, 2026

View reviewed changes

baz-reviewer bot reviewed Mar 27, 2026

View reviewed changes

Improve pre-delete count comments: document sequential execution rati…

419b1a7

…onale Sequential counts avoid overloading ClickHouse and the collected metrics help assess query cost over time.

baz-reviewer bot reviewed Mar 27, 2026

View reviewed changes

ldaugusto added 2 commits March 27, 2026 15:12

Add log_comment SETTINGS to COUNT_FOR_RETENTION queries

ac2b994

Without the SETTINGS clause in the SQL template, the log_comment placeholder from getSTWithLogComment was never injected into the query, making these counts invisible in ClickHouse query logs.

Merge branch 'main' into daniela/opik-4842-retention-quartz-refactor

25cc574

thiagohora reviewed Mar 27, 2026

View reviewed changes

...sources/liquibase/db-app-state/migrations/000062_add_catch_up_columns_to_retention_rules.sql Outdated Show resolved Hide resolved

thiagohora approved these changes Mar 27, 2026

View reviewed changes

baz-reviewer bot reviewed Mar 27, 2026

View reviewed changes

...sources/liquibase/db-app-state/migrations/000062_add_catch_up_columns_to_retention_rules.sql Show resolved Hide resolved

ldaugusto merged commit a24543e into main Mar 27, 2026
77 checks passed

ldaugusto deleted the daniela/opik-4842-retention-quartz-refactor branch March 27, 2026 15:52

CometActions mentioned this pull request Mar 27, 2026

[NA] [SDK] [DOCS] Update automatically OpenAPI spec and Fern code #5926

Closed

2 tasks

Conversation

ldaugusto commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 8

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baz-reviewer bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 11

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thiagohora commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2026

📋 PR Linter Failed

Uh oh!

ldaugusto commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ldaugusto commented Mar 26, 2026 •

edited

Loading

github-actions bot commented Mar 26, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading