Skip to content

[OPIK-4842] [BE] Refactor retention jobs to Quartz, extract estimation#5898

Merged
ldaugusto merged 13 commits intomainfrom
daniela/opik-4842-retention-quartz-refactor
Mar 27, 2026
Merged

[OPIK-4842] [BE] Refactor retention jobs to Quartz, extract estimation#5898
ldaugusto merged 13 commits intomainfrom
daniela/opik-4842-retention-quartz-refactor

Conversation

@ldaugusto
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto commented Mar 26, 2026

Details

Refactors the retention job infrastructure from the Managed+Flux.interval pattern to proper Quartz jobs, and extracts velocity estimation from the HTTP request handler into a background job.

Three independent Quartz jobs:

  • RetentionSlidingWindowJob — regular sliding-window cycle, every 30min (configurable via executionsPerDay), fraction-based workspace sharding
  • RetentionEstimationJob — velocity estimation for newly created rules, every 5min (configurable via catchUp.estimationIntervalMinutes). Fixes the blocking HTTP thread concern from PR [OPIK-4842] [BE] Catch-up job for apply-to-past retention rules #5820 review
  • RetentionCatchUpJob — progressive historical deletion, every 45min (configurable via catchUp.intervalMinutes)

Key changes:

  • Each job has its own distributed Redis lock — catch-up/estimation never block regular retention
  • Lock TTL equals the scheduling interval for each job (30min for sliding window, 5min for estimation, 45min for catch-up). Combined with holdUntilExpiry=true, this prevents redundant runs across multiple instances
  • @DisallowConcurrentExecution + InterruptableJob for Quartz-level safety; Redis lock is the primary distributed guard
  • doJob() uses .block() with try/catch for proper error handling and graceful shutdown via AtomicBoolean interrupted
  • Rule creation no longer calls ClickHouse — saves with velocity=null, cursor=null, catchUpDone=false. The estimation job picks it up within 5 minutes
  • Follows established codebase patterns (TraceThreadsClosingJob, MetricsAlertJob)

Observability (OpenTelemetry metrics):

  • Job-level: runCounter (success/skipped_lock/error) and runDuration histogram on all 3 jobs
  • Sliding window: workspacesProcessed counter, rowsToDelete counter (by table)
  • Catch-up: rulesProcessed (by tier: small/medium/large), rulesCompleted, rowsToDelete (by table)
  • Estimation: rulesEstimated counter, velocityValues histogram
  • Pre-delete counts: Lightweight SELECT count() before each delete batch for observability — upper-bound ceiling with >99% precision (excludes expensive experiment_items exclusion subquery to avoid join cost)

Review feedback addressed:

  • Consolidated setRetentionJobs() in lifecycle listener
  • findUnestimatedCatchUpRules limited to 10 per tick
  • isTooManyRowsException moved to RetentionUtils
  • getCatchUpInterval() renamed to avoid shadowing parent getInterval()
  • Expanded catch-up index to cover all equality + ORDER BY columns
  • Migration bumped from 000061 to 000062 (collision with workspace rule project index)

Depends on: PR #5820 (catch-up job) — already merged to main.

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-4797
  • OPIK-4842

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4
    • Scope: Implementation with human-driven architecture and design decisions
    • Human verification: Pair-programmed throughout — human led architecture, concurrency strategy, tier design, and all key technical decisions

Testing

Commands run:

cd apps/opik-backend
mvn test -Dtest="RetentionPolicyServiceTest"
mvn test -Dtest="RetentionEstimationServiceTest,RetentionUtilsTest"

Results: 20 tests, 0 failures, 0 errors

Scenarios validated:

  • All existing retention tests pass unchanged
  • Catch-up integration test updated: estimation now runs as separate step before catch-up

Documentation

Retention documentation will come in future task.

@github-actions github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Mar 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

Backend Tests - Integration Group 8

291 tests  ±0   289 ✅ ±0   9m 56s ⏱️ +15s
 29 suites ±0     2 💤 ±0 
 29 files   ±0     0 ❌ ±0 

Results for commit e4d4859. ± Comparison against base commit eee4994.

♻️ This comment has been updated with latest results.

Base automatically changed from daniela/opik-4842-retention-catchup-job to main March 26, 2026 17:58
Split the single RetentionPolicyJob (Managed pattern) into two
independent Quartz jobs:

- RetentionSlidingWindowJob: regular sliding-window cycle, runs every
  (24*60)/executionsPerDay minutes (default 30min), fraction-based
  workspace sharding
- RetentionCatchUpJob: progressive historical deletion, runs on its
  own schedule (default 60min), separate distributed lock

Both follow the established Quartz pattern (Job + InterruptableJob,
@DisallowConcurrentExecution, bestEffortLock) and are registered via
OpikGuiceyLifecycleEventListener.

Benefits:
- Separate locks: catch-up never blocks regular retention
- Independent schedules: catch-up can run less frequently
- Either can be disabled independently
- Follows codebase conventions (TraceThreadsClosingJob pattern)
Move velocity estimation from the synchronous HTTP rule creation
endpoint into a dedicated RetentionEstimationJob (3rd Quartz job):

- RetentionEstimationJob: runs every 5 min (configurable), finds
  rules with catchUpDone=false and no velocity, estimates velocity
  + cursor for each, updates the rule in MySQL
- RetentionRuleService.create(): no longer calls ClickHouse during
  rule creation. Saves rule with velocity=null, cursor=null,
  catchUpDone=false. The estimation job picks it up within minutes.
- RetentionEstimationService: extracted from RetentionRuleServiceImpl,
  contains estimateVelocity, scoutFirstDataCursor, isTooManyRowsException

This fixes the blocking HTTP thread concern raised in PR #5820 review:
the scouting loop could make up to ~18 sequential ClickHouse queries
for huge workspaces, now it runs in a background job instead.
After rebase onto main, use the new bestEffortLock overload with
holdUntilExpiry=true on all three retention jobs. The lock is held
for the full interval (30min/5min/45min) so that with N instances,
only one execution happens per interval, not N sequential ones.
@ldaugusto ldaugusto force-pushed the daniela/opik-4842-retention-quartz-refactor branch from c3b7074 to e4d4859 Compare March 26, 2026 18:10
…DEBUG

- Remove catchUp.enabled check from rule creation so rules are always
  marked for catch-up when applyToPast=true, surviving temporary job disablement
- Convert findUnestimatedCatchUpRules SQL to text block
- Demote SpanDAO.estimateVelocityForRetention log to DEBUG
- Demote TraceDAO.scoutFirstDayWithData log to DEBUG
…nds, fix reactor thread blocking

- Remove InterruptableJob, AtomicBoolean interrupted, and @DisallowConcurrentExecution
  from all 3 retention jobs. Concurrency is guarded by Redis lock (holdUntilExpiry),
  not Quartz. doJob() returns immediately via subscribe(), freeing the Quartz thread.
- Remove unused lockTimeoutSeconds from RetentionConfig and CatchUpConfig.
- Add subscribeOn(boundedElastic) to estimation job to avoid blocking a reactor thread
  (estimatePendingRules calls .block() on DAO chains internally).
- Retention deletes and estimation are idempotent — incomplete work during shutdown
  is safely retried on the next cycle.
Comment on lines +50 to +61
@Override
public void doJob(JobExecutionContext context) {
// estimatePendingRules() calls .block() on DAO reactive chains internally,
// so we use subscribeOn(boundedElastic) to avoid blocking a reactor thread.
lockService.bestEffortLock(
RUN_LOCK,
Mono.fromRunnable(estimationService::estimatePendingRules)
.subscribeOn(Schedulers.boundedElastic()),
Mono.fromRunnable(() -> log.debug(
"Retention estimation: could not acquire lock, another instance is running")),
Duration.ofMinutes(config.getCatchUp().getEstimationIntervalMinutes()),
Duration.ZERO,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lockService.bestEffortLock(...).subscribe(...) is duplicated across retention jobs, should we extract a shared helper like AbstractRetentionJob or RetentionJobRunner.runWithLock(lock, workMono, interval, failLog, successLog) to centralize the lock/subscribe/logging boilerplate?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 11

 39 files   39 suites   3m 36s ⏱️
198 tests 197 ✅ 0 💤 0 ❌ 1 🔥
197 runs  197 ✅ 0 💤 0 ❌

For more details on these errors, see this check.

Results for commit 8d0198e.

♻️ This comment has been updated with latest results.

@ldaugusto ldaugusto marked this pull request as ready for review March 27, 2026 10:25
@ldaugusto ldaugusto requested a review from a team as a code owner March 27, 2026 10:25
@thiagohora
Copy link
Copy Markdown
Contributor

💡 suggestion | Observability

None of the three new jobs emit any metrics. Given that these are deletion jobs with direct data impact, adding instrumentation would make operational monitoring significantly easier.

Suggested additions:

Metric Type Tags Jobs
retention.job.<name>.run.total Counter result=success|skipped_lock|error All three
retention.job.<name>.run.duration Histogram All three
retention.sliding_window.rows_deleted Counter workspace_id SlidingWindow
retention.sliding_window.fraction Counter/tag SlidingWindow
retention.catch_up.rules_processed Counter tier=small|medium|large CatchUp
retention.catch_up.rows_deleted Counter workspace_id, tier CatchUp
retention.catch_up.rules_completed Counter CatchUp
retention.estimation.rules_estimated Counter Estimation
retention.estimation.velocity_value Histogram Estimation
retention.query.estimate_velocity.duration Histogram workspace_id Estimation
retention.query.scout_first_day.duration Histogram workspace_id Estimation

The per-workspace rows_deleted and rules_completed counters are the most operationally valuable — they let you verify backfill progress and detect stalled workspaces without querying the database directly.

🤖 Review posted via /review-github-pr

- Add @DisallowConcurrentExecution, InterruptableJob, .block() to all 3 jobs
- Add LIMIT 10 to findUnestimatedCatchUpRules to bound work per cycle
- Move isTooManyRowsException to RetentionUtils
- Rename CatchUpConfig.getInterval() to getCatchUpInterval() for clarity
- Expand idx_catch_up_pending to cover all equality + ORDER BY columns
- Standardize lock-not-acquired logs to DEBUG across all jobs
- Consolidate 3 retention setup methods into single setRetentionJobs()
…bility

- Add run counter and duration histogram to all 3 retention jobs
- Add domain-specific counters to services (workspaces processed, rules
  processed/completed, rules estimated, velocity values)
- Add lightweight pre-delete row counts (SELECT count) in SpanDAO/TraceDAO
  for observability — upper-bound ceiling with >99% precision, excludes
  expensive experiment_items exclusion subquery
- Bump migration from 000061 to 000062 (collision with workspace rule
  project index migration on main)
@github-actions
Copy link
Copy Markdown
Contributor

📋 PR Linter Failed

Missing Section. The description is missing the ## Documentation section.

@ldaugusto
Copy link
Copy Markdown
Contributor Author

Addressed in 7bfe150. Here's what was implemented:

Job-level metrics (all 3 jobs):

  • opik.retention.<job>.run — Counter with result=success|skipped_lock|error
  • opik.retention.<job>.duration — Histogram (ms), recorded on success/error only (not skipped_lock)

Domain-level metrics:

  • opik.retention.sliding_window.workspaces_processed — Counter
  • opik.retention.sliding_window.rows_to_delete — Counter, tagged by table=traces|spans
  • opik.retention.catch_up.rules_processed — Counter, tagged by tier=small|medium|large
  • opik.retention.catch_up.rules_completed — Counter
  • opik.retention.catch_up.rows_to_delete — Counter, tagged by table=traces|spans
  • opik.retention.estimation.rules_estimated — Counter
  • opik.retention.estimation.velocity_value — Histogram

On rows_deleted vs rows_to_delete: ClickHouse lightweight DELETEs don't return affected row counts. Instead, we issue a lightweight SELECT count() before each delete batch. This is an upper-bound ceiling with >99% precision — it excludes the experiment_items exclusion subquery to avoid the join cost. The metric is named rows_to_delete to reflect that it's a pre-delete estimate, not an exact post-delete count.

Skipped: Per-workspace tags on row counters (high-cardinality risk) and per-query duration histograms for estimation (would need deeper plumbing for marginal value). The fraction is already visible in the job logs.

…-delete ordering

- FIFO ordering (ORDER BY created_at ASC) prevents a consistently failing
  rule from starving other pending rules
- Document why counts run sequentially before deletes (not in parallel):
  metric must reflect what's about to be removed, cost is minimal via
  ClickHouse primary key index
…onale

Sequential counts avoid overloading ClickHouse and the collected metrics
help assess query cost over time.
Without the SETTINGS clause in the SQL template, the log_comment
placeholder from getSTWithLogComment was never injected into the
query, making these counts invisible in ClickHouse query logs.
…weight

catch_up_velocity is always a range predicate (<, >=, BETWEEN) in finder
queries, so MySQL's B-tree can't use columns after it for ORDER BY. The
new index (catch_up_done, enabled, apply_to_past, catch_up_cursor)
satisfies both filter and sort from a single index scan. Velocity
filtering happens post-index with negligible cost on a small table.
@ldaugusto ldaugusto merged commit a24543e into main Mar 27, 2026
77 checks passed
@ldaugusto ldaugusto deleted the daniela/opik-4842-retention-quartz-refactor branch March 27, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants