fix: defer report scheduler checkpoint until enqueue succeeds by ixchio · Pull Request #1590 · lmnr-ai/lmnr

ixchio · 2026-04-06T19:39:26Z

Summary

Fixes #1586 — the reports scheduler was advancing its checkpoint before the enqueue loop completed, causing silently missed report triggers on process crash or MQ failure.

Problem

In check_and_enqueue, the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY was updated to now at the top of the function (line 99), before any messages were actually published to the queue. This created a race condition:

Process crash mid-enqueue: If the process crashed after the checkpoint write but before all reports were enqueued, the next run would start from the already-advanced checkpoint — permanently skipping the pending hour buckets.
MQ unavailability: If push_to_reports_queue failed (e.g. RabbitMQ down), errors were only logged and Ok(()) was still returned, so the checkpoint moved forward and the failed triggers were never retried.
Silent cache errors: The result of cache.insert(...) was discarded with let _ =, hiding any cache write failures.

Fix

Incremental checkpoint advancement: The checkpoint is now updated per hour bucket, only after all reports in that bucket have been successfully enqueued.
Fail-stop on enqueue errors: If any report fails to enqueue, checkpoint advancement stops immediately. The failed hour bucket (and any remaining ones) will be retried on the next scheduler tick.
Error propagation for cache writes: cache.insert(...) errors are now propagated via ? instead of silently ignored, so the caller can log and handle them.
Empty-window handling: When no hour boundaries need processing, the checkpoint is still advanced to now to prevent re-scanning the same sub-hour window.

Testing

cargo check passes with no compilation errors.
The change is minimal and surgical — only app-server/src/reports/scheduler.rs is modified (23 insertions, 4 deletions).
Behavioral correctness follows from the invariant that the checkpoint can never advance past a timestamp whose reports have not all been successfully enqueued.

Note

Medium Risk
Changes the report scheduling checkpointing semantics and error handling; a mistake could cause duplicate triggers or stalled scheduling if failures aren’t handled as expected.

Overview
Fixes the reports scheduler so the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY checkpoint is not advanced until report triggers are actually enqueued.

Checkpoint writes now happen per processed hour bucket (and propagate cache write errors), while enqueue failures cause the function to error out and hold the checkpoint for retry; when there are no hour boundaries, the checkpoint still advances to now to avoid rescanning the same sub-hour window.

^{Reviewed by Cursor Bugbot for commit 3279eb5. Bugbot is set up for automated code reviews on this repo. Configure here.}

CLAassistant · 2026-04-06T19:39:35Z

All committers have signed the CLA.

greptile-apps · 2026-04-06T19:42:54Z

Greptile Summary

This PR fixes a correctness bug in the reports scheduler where the checkpoint (REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY) was advanced before report messages were actually published to the queue, causing permanently missed triggers on process crash or MQ failure. The fix moves the checkpoint advance to after each hour bucket is fully enqueued, with fail-stop semantics on any enqueue error.

Changes:

Checkpoint is now updated per hour bucket, only after all reports in that bucket are successfully enqueued
Enqueue failures halt further checkpoint advancement, allowing the failed bucket to be retried on the next scheduler tick
cache.insert(...) result is now propagated via ? instead of silently discarded
Empty-window case still advances the checkpoint to now to prevent re-scanning the same sub-hour window

Issues found:

Duplicate notification risk on partial bucket failure (no idempotency): If n reports exist in an hour bucket and reports 1…k succeed but report k+1 fails, all_enqueued = false causes the checkpoint to stay at the previous hour boundary. On the next tick the entire bucket is retried, so reports 1…k are enqueued a second time and will generate duplicate emails/Slack notifications for those users. Confirmed by reading generator.rs — there is no deduplication or idempotency check on (report_id, triggered_at) in the consumer.
Silent swallowing of partial failure: Returning Ok(()) (instead of Err(...)) when !all_enqueued means the error-level log in run_reports_scheduler is never triggered; only the inner log::warn! fires, which may be missed in alerting pipelines.

Confidence Score: 3/5

The PR fixes a real correctness bug but introduces a retry path that can cause duplicate user-facing notifications due to the absence of consumer-side idempotency.

The original bug (checkpoint advancing before enqueue) is correctly fixed and the incremental advancement logic is sound. However, the retry semantics now activated expose a pre-existing gap: the RabbitMQ consumer (ReportsGenerator) has no idempotency protection on (report_id, triggered_at), so any partial-bucket failure followed by a retry will deliver duplicate emails/Slack messages to users. This is a P1 reliability concern in a user-facing notification path. The secondary issue (returning Ok(()) on partial failure) suppresses error-level alerting. Together these reduce confidence to 3/5.

app-server/src/reports/scheduler.rs lines 117-145 (partial-failure retry path) and app-server/src/reports/generator.rs (consumer lacks idempotency guard on (report_id, triggered_at)).

Important Files Changed

Filename	Overview
app-server/src/reports/scheduler.rs	Checkpoint is now advanced per-bucket after successful enqueue (fixing the original race), but partial-bucket failures will re-enqueue already-successful reports, causing duplicate notifications since the consumer has no idempotency protection.

Sequence Diagram

sequenceDiagram
    participant Sched as Reports Scheduler
    participant Cache as Redis Cache
    participant DB as PostgreSQL
    participant MQ as RabbitMQ
    participant Gen as ReportsGenerator

    Sched->>Cache: try_acquire_lock(LOCK_KEY, 900s)
    Cache-->>Sched: Ok(true)

    Sched->>Cache: get(LAST_CHECK_KEY)
    Cache-->>Sched: last_check_ts

    Sched->>Sched: hour_boundaries_between(last_check, now)

    alt No hour boundaries
        Sched->>Cache: insert(LAST_CHECK_KEY, now)
        Sched-->>Sched: return Ok(())
    else Has hour boundaries
        loop For each (weekday, hour, triggered_at)
            Sched->>DB: get_reports_for_weekday_and_hour(weekday, hour)
            DB-->>Sched: reports[]

            loop For each report
                Sched->>MQ: push_to_reports_queue(ReportTriggerMessage)
                alt Enqueue succeeds
                    MQ-->>Sched: Ok(())
                else Enqueue fails
                    MQ-->>Sched: Err(e)
                    Sched->>Sched: all_enqueued = false
                    Note over Sched: Logs error, continues loop
                end
            end

            alt all_enqueued = false
                Sched->>Sched: warn + return Ok(())
                Note over Sched,Cache: Checkpoint NOT advanced; bucket retried next tick
                Note over MQ,Gen: Already-enqueued reports will be processed AGAIN (no idempotency)
            else all_enqueued = true
                Sched->>Cache: insert(LAST_CHECK_KEY, triggered_at)
                Cache-->>Sched: Ok(())
            end
        end
    end

    Sched->>Cache: release_lock(LOCK_KEY)

    MQ-->>Gen: deliver ReportTriggerMessage
    Gen->>Gen: generate report + send notifications
    Note over Gen: No dedup check on (report_id, triggered_at)

_{Reviews (1): Last reviewed commit: "fix: defer report scheduler checkpoint u..." | Re-trigger Greptile}

app-server/src/reports/scheduler.rs

Move the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY checkpoint update from before the enqueue loop to after each hour bucket is fully processed. Previously the checkpoint was advanced to 'now' before any messages were published. If the process crashed or the message queue was unavailable, the pending hour buckets were permanently skipped because the next run would start from the already-advanced checkpoint. Changes: - Advance checkpoint incrementally per hour bucket, only after all reports in that bucket have been successfully enqueued. - Stop advancing checkpoint on first enqueue failure so the failed hour bucket (and any remaining ones) are retried on the next tick. - Propagate cache.insert errors via '?' instead of silently ignoring them with 'let _ ='. - When no hour boundaries need processing, still advance checkpoint to avoid re-scanning the same sub-hour window. Fixes lmnr-ai#1586

ixchio · 2026-04-06T20:19:25Z

Hey @dinmukhamedm — this addresses #1586, the checkpoint race in the reports scheduler. The core issue is that the checkpoint was moving forward before the enqueue loop finished, so a crash or MQ hiccup would silently drop scheduled reports with no retry.

The fix is pretty small — just defers the checkpoint write to after each hour bucket is fully enqueued, and returns an error on partial failure so it actually shows up in logs. I also addressed the Greptile bot's review feedback (returning Err instead of Ok on failure for proper error-level logging).

Re: the duplicate notification concern the bot flagged — that's a pre-existing at-least-once delivery gap in the consumer, not something this PR introduces. Happy to open a follow-up issue for adding idempotency on the generator side if that'd be useful.

Would appreciate a review when you get a chance, thanks!

greptile-apps bot reviewed Apr 6, 2026

View reviewed changes

app-server/src/reports/scheduler.rs Outdated Show resolved Hide resolved

app-server/src/reports/scheduler.rs Outdated Show resolved Hide resolved

ixchio force-pushed the fix/report-scheduler-checkpoint-race branch from 8242593 to 3279eb5 Compare April 6, 2026 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: defer report scheduler checkpoint until enqueue succeeds#1590

fix: defer report scheduler checkpoint until enqueue succeeds#1590
ixchio wants to merge 1 commit intolmnr-ai:devfrom
ixchio:fix/report-scheduler-checkpoint-race

ixchio commented Apr 6, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Apr 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

ixchio commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ixchio commented Apr 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Testing

Uh oh!

CLAassistant commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Apr 6, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

ixchio commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ixchio commented Apr 6, 2026 •

edited by cursor bot

Loading

CLAassistant commented Apr 6, 2026 •

edited

Loading