Skip to content

fix: defer report scheduler checkpoint until enqueue succeeds#1590

Open
ixchio wants to merge 1 commit intolmnr-ai:devfrom
ixchio:fix/report-scheduler-checkpoint-race
Open

fix: defer report scheduler checkpoint until enqueue succeeds#1590
ixchio wants to merge 1 commit intolmnr-ai:devfrom
ixchio:fix/report-scheduler-checkpoint-race

Conversation

@ixchio
Copy link
Copy Markdown

@ixchio ixchio commented Apr 6, 2026

Summary

Fixes #1586 — the reports scheduler was advancing its checkpoint before the enqueue loop completed, causing silently missed report triggers on process crash or MQ failure.

Problem

In check_and_enqueue, the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY was updated to now at the top of the function (line 99), before any messages were actually published to the queue. This created a race condition:

  1. Process crash mid-enqueue: If the process crashed after the checkpoint write but before all reports were enqueued, the next run would start from the already-advanced checkpoint — permanently skipping the pending hour buckets.
  2. MQ unavailability: If push_to_reports_queue failed (e.g. RabbitMQ down), errors were only logged and Ok(()) was still returned, so the checkpoint moved forward and the failed triggers were never retried.
  3. Silent cache errors: The result of cache.insert(...) was discarded with let _ =, hiding any cache write failures.

Fix

  • Incremental checkpoint advancement: The checkpoint is now updated per hour bucket, only after all reports in that bucket have been successfully enqueued.
  • Fail-stop on enqueue errors: If any report fails to enqueue, checkpoint advancement stops immediately. The failed hour bucket (and any remaining ones) will be retried on the next scheduler tick.
  • Error propagation for cache writes: cache.insert(...) errors are now propagated via ? instead of silently ignored, so the caller can log and handle them.
  • Empty-window handling: When no hour boundaries need processing, the checkpoint is still advanced to now to prevent re-scanning the same sub-hour window.

Testing

  • cargo check passes with no compilation errors.
  • The change is minimal and surgical — only app-server/src/reports/scheduler.rs is modified (23 insertions, 4 deletions).
  • Behavioral correctness follows from the invariant that the checkpoint can never advance past a timestamp whose reports have not all been successfully enqueued.

Note

Medium Risk
Changes the report scheduling checkpointing semantics and error handling; a mistake could cause duplicate triggers or stalled scheduling if failures aren’t handled as expected.

Overview
Fixes the reports scheduler so the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY checkpoint is not advanced until report triggers are actually enqueued.

Checkpoint writes now happen per processed hour bucket (and propagate cache write errors), while enqueue failures cause the function to error out and hold the checkpoint for retry; when there are no hour boundaries, the checkpoint still advances to now to avoid rescanning the same sub-hour window.

Reviewed by Cursor Bugbot for commit 3279eb5. Bugbot is set up for automated code reviews on this repo. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 6, 2026

CLA assistant check
All committers have signed the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 6, 2026

Greptile Summary

This PR fixes a correctness bug in the reports scheduler where the checkpoint (REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY) was advanced before report messages were actually published to the queue, causing permanently missed triggers on process crash or MQ failure. The fix moves the checkpoint advance to after each hour bucket is fully enqueued, with fail-stop semantics on any enqueue error.

Changes:

  • Checkpoint is now updated per hour bucket, only after all reports in that bucket are successfully enqueued
  • Enqueue failures halt further checkpoint advancement, allowing the failed bucket to be retried on the next scheduler tick
  • cache.insert(...) result is now propagated via ? instead of silently discarded
  • Empty-window case still advances the checkpoint to now to prevent re-scanning the same sub-hour window

Issues found:

  • Duplicate notification risk on partial bucket failure (no idempotency): If n reports exist in an hour bucket and reports 1…k succeed but report k+1 fails, all_enqueued = false causes the checkpoint to stay at the previous hour boundary. On the next tick the entire bucket is retried, so reports 1…k are enqueued a second time and will generate duplicate emails/Slack notifications for those users. Confirmed by reading generator.rs — there is no deduplication or idempotency check on (report_id, triggered_at) in the consumer.
  • Silent swallowing of partial failure: Returning Ok(()) (instead of Err(...)) when !all_enqueued means the error-level log in run_reports_scheduler is never triggered; only the inner log::warn! fires, which may be missed in alerting pipelines.

Confidence Score: 3/5

The PR fixes a real correctness bug but introduces a retry path that can cause duplicate user-facing notifications due to the absence of consumer-side idempotency.

The original bug (checkpoint advancing before enqueue) is correctly fixed and the incremental advancement logic is sound. However, the retry semantics now activated expose a pre-existing gap: the RabbitMQ consumer (ReportsGenerator) has no idempotency protection on (report_id, triggered_at), so any partial-bucket failure followed by a retry will deliver duplicate emails/Slack messages to users. This is a P1 reliability concern in a user-facing notification path. The secondary issue (returning Ok(()) on partial failure) suppresses error-level alerting. Together these reduce confidence to 3/5.

app-server/src/reports/scheduler.rs lines 117-145 (partial-failure retry path) and app-server/src/reports/generator.rs (consumer lacks idempotency guard on (report_id, triggered_at)).

Important Files Changed

Filename Overview
app-server/src/reports/scheduler.rs Checkpoint is now advanced per-bucket after successful enqueue (fixing the original race), but partial-bucket failures will re-enqueue already-successful reports, causing duplicate notifications since the consumer has no idempotency protection.

Sequence Diagram

sequenceDiagram
    participant Sched as Reports Scheduler
    participant Cache as Redis Cache
    participant DB as PostgreSQL
    participant MQ as RabbitMQ
    participant Gen as ReportsGenerator

    Sched->>Cache: try_acquire_lock(LOCK_KEY, 900s)
    Cache-->>Sched: Ok(true)

    Sched->>Cache: get(LAST_CHECK_KEY)
    Cache-->>Sched: last_check_ts

    Sched->>Sched: hour_boundaries_between(last_check, now)

    alt No hour boundaries
        Sched->>Cache: insert(LAST_CHECK_KEY, now)
        Sched-->>Sched: return Ok(())
    else Has hour boundaries
        loop For each (weekday, hour, triggered_at)
            Sched->>DB: get_reports_for_weekday_and_hour(weekday, hour)
            DB-->>Sched: reports[]

            loop For each report
                Sched->>MQ: push_to_reports_queue(ReportTriggerMessage)
                alt Enqueue succeeds
                    MQ-->>Sched: Ok(())
                else Enqueue fails
                    MQ-->>Sched: Err(e)
                    Sched->>Sched: all_enqueued = false
                    Note over Sched: Logs error, continues loop
                end
            end

            alt all_enqueued = false
                Sched->>Sched: warn + return Ok(())
                Note over Sched,Cache: Checkpoint NOT advanced; bucket retried next tick
                Note over MQ,Gen: Already-enqueued reports will be processed AGAIN (no idempotency)
            else all_enqueued = true
                Sched->>Cache: insert(LAST_CHECK_KEY, triggered_at)
                Cache-->>Sched: Ok(())
            end
        end
    end

    Sched->>Cache: release_lock(LOCK_KEY)

    MQ-->>Gen: deliver ReportTriggerMessage
    Gen->>Gen: generate report + send notifications
    Note over Gen: No dedup check on (report_id, triggered_at)
Loading

Reviews (1): Last reviewed commit: "fix: defer report scheduler checkpoint u..." | Re-trigger Greptile

Move the REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY checkpoint update from
before the enqueue loop to after each hour bucket is fully processed.

Previously the checkpoint was advanced to 'now' before any messages
were published. If the process crashed or the message queue was
unavailable, the pending hour buckets were permanently skipped because
the next run would start from the already-advanced checkpoint.

Changes:
- Advance checkpoint incrementally per hour bucket, only after all
  reports in that bucket have been successfully enqueued.
- Stop advancing checkpoint on first enqueue failure so the failed
  hour bucket (and any remaining ones) are retried on the next tick.
- Propagate cache.insert errors via '?' instead of silently ignoring
  them with 'let _ ='.
- When no hour boundaries need processing, still advance checkpoint
  to avoid re-scanning the same sub-hour window.

Fixes lmnr-ai#1586
@ixchio ixchio force-pushed the fix/report-scheduler-checkpoint-race branch from 8242593 to 3279eb5 Compare April 6, 2026 19:45
@ixchio
Copy link
Copy Markdown
Author

ixchio commented Apr 6, 2026

Hey @dinmukhamedm — this addresses #1586, the checkpoint race in the reports scheduler. The core issue is that the checkpoint was moving forward before the enqueue loop finished, so a crash or MQ hiccup would silently drop scheduled reports with no retry.

The fix is pretty small — just defers the checkpoint write to after each hour bucket is fully enqueued, and returns an error on partial failure so it actually shows up in logs. I also addressed the Greptile bot's review feedback (returning Err instead of Ok on failure for proper error-level logging).

Re: the duplicate notification concern the bot flagged — that's a pre-existing at-least-once delivery gap in the consumer, not something this PR introduces. Happy to open a follow-up issue for adding idempotency on the generator side if that'd be useful.

Would appreciate a review when you get a chance, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants