Skip to content

Conversation

p-schlickmann
Copy link
Contributor

@p-schlickmann p-schlickmann commented Jul 25, 2025

Hey @rosa

While I was trying to reduce the flakiness of test/integration/instrumentation_test.rb, I uncovered a real issue:

On rare occasions the scheduler would recurse with a near-zero delay and enqueue 2-3 jobs in the same second, which 100% shouldn't happen. The output was something like:

SolidQueue::Job.all.map(&:scheduled_at)
# ⇒ [
#   2025-08-01T02:41:38.995992Z,
#   2025-08-01T02:41:38.998557Z,
#   2025-08-01T02:41:38.999440Z
# ]

# Note they are all scheduled when it's very close to the next whole-second.

Explanation:

def schedule(task)
  # 1) Create a scheduled task that will wait `delay_from_now` seconds and then run the block
  scheduled_task = Concurrent::ScheduledTask.new(task.delay_from_now, args: [ self, task, task.next_time ]) do |thread_schedule, thread_task, thread_task_run_at|
    # 2) Schedule the next occurrence, which calls #schedule again
    thread_schedule.schedule_task(thread_task)
  
    wrap_in_app_executor do
      # 3) Enqueues the run and creates SolidQueue::Job
      thread_task.enqueue(at: thread_task_run_at)
    end
  end
  
  scheduled_task.add_observer do |_, _, error|
    handle_thread_error(error) if error && !error.is_a?(Concurrent::CancelledOperationError)
  end
  
  # 4) If `delay_from_now` is very small, this will execute the block immediately
  scheduled_task.tap(&:execute)
end

So what happens is:

  • delay_from_now function returns something very close to 0, because the timestamp is 2025-08-01T02:41:38.995992Z
  • Step 4 with delay_from_now ≈ 0 executes the block right away
  • Step 2 inside that block calls #schedule_task, which invokes #schedule again (still zero delay), and so on, nesting 2-3 immediate runs before any enqueue happens.
  • Only after all that recursion, Step 3 enqueues the first job (and creates the DB record).

The fix

By clamping the delay to at least 100 ms - when the raw delay is just a few milliseconds before the next whole-second mark - that extra 0.1 s pushes the scheduled timestamp into the following second, guaranteeing Step 3 (the enqueue) has time to complete and prevents Step 2 from going into recursion.

It's kinda confusing 😅, but the risk should be very low, and I have confirmed the test no longer fails:

# Runs ok!
for i in {1..1000}; do bin/rails test test/integration/instrumentation_test.rb || break; done

@p-schlickmann p-schlickmann changed the title Attempt to reduce flakiness of test/integration/instrumentation_test.rb Adding 100ms minimum delay for recurring tasks Aug 6, 2025
Copy link
Member

@rosa rosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, interesting! Great catch!

@rosa rosa merged commit c445201 into rails:main Sep 13, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants