Document Celery retry timing gaps investigation (#12472) #12631

Copilot · 2025-12-03T17:20:25Z

Build retries show unexplained large gaps (30m-5h) instead of the expected 5m delay. This documents the investigation findings and potential causes in the Celery configuration.

Investigation Summary

Potential causes for retry timing gaps with Redis broker + acks_late=True:

Redis sorted set polling — ETA/countdown tasks stored in sorted set, polled periodically; high contention delays pickup
visibility_timeout interaction — 5h timeout prevents duplicate task redelivery but doesn't affect ETA scheduling directly
Worker availability — tasks wait when all workers busy at countdown expiry
Queue backlog — ETA tasks compete with regular tasks

Changes

Added inline documentation in BROKER_TRANSPORT_OPTIONS explaining visibility_timeout interaction with acks_late and known timing gap causes
Documented RTD_BUILDS_RETRY_DELAY behavior and reference to issue Builds: retries sometimes show large gaps in retry timing #12472
Added comments in UpdateDocsTask noting the task_executed_at field logs warnings when gaps exceed 10 minutes

The Build.task_executed_at field (from PR #12500) continues to track actual vs scheduled execution time for ongoing debugging.

Original prompt

This section details on the original issue you should resolve

<issue_title>Builds: retries sometimes show large gaps in retry timing</issue_title>
<issue_description>See parent for more background, but summary of the issue is that concurrency limited builds don't execute retry tasks on a tight schedule (default is 300s). Instead, the gap between retries is anywhere from 5m, 30m, and I've seen 5 hour gaps.

Build task queue contention might be able to explain away a 10 min or maybe 30 min gap, but not consistent gaps and certainly not a 5 hour gap. To me, this feels like something else is happening here.</issue_description>

<agent_instructions>Look into why this might be happening with our celery configuration. </agent_instructions>

Comments on the Issue (you are @copilot in this section)

@humitos The big gaps are not because of queue contention, they happen on projects hitting concurrency limit constantly. When the build is mark to be retried in 5 minutes in the future, it's then executed at that time but due to concurrency limit on the project, it's retried again. That happens constantly and then you get these big gaps.

I've only noticed these gaps, but they were always on these type of projects.</comment_new>
<comment_new>@agjohnson
Not sure I follow. How does a task get a 30 minute or 5 hour gap in between retry tasks then? The task needs to picked up by a worker at retry to evaluate the concurrency limit.

I'm not describing the cumulative difference between commands executing or something, I'm describing the time difference between the worker setting the task to retry and actually picking up the the retried task.

https://onenr.io/0BQ1J0aJ5wx

https://onenr.io/0Zw0J0YrNjv

These are just the two builds I noted before, there are many others and logging has the data on what is actually happening here.</comment_new>
<comment_new>@humitos

I'm not describing the cumulative difference between commands executing or something, I'm describing the time difference between the worker setting the task to retry and actually picking up the the retried task.

Ah, good point! The only explanation to the screenshot you shared is build queue backed up... but it doesn't make sense to happen that often, and makes me think there should be something else.</comment_new>
<comment_new>@agjohnson
Yeah definitely. While it could be queue backups, it seems like it would have to be unlikely. I suppose the next step here could be to pull a list of builds that retried repeatedly and test the timings?

It might just be easier to add some code and try to log this scenario when we notice it at the code level. I don't think we have a way to query retries at all, but you can look at the logs for Retrying this task and look for repeat build ids in the columns (you might have to add the column)

https://onenr.io/0BQ1JWmpWwx

But it's still a lot of manual searching, I didn't see any in the handful of builds I tried.</comment_new>
<comment_new>@humitos

It might just be easier to add some code and try to log this scenario when we notice it at the code level

How are you thinking we can detect this scenario at code level? 🤔 </comment_new>
<comment_new>@agjohnson
I was thinking either storing a history of when tasks are attempted, or if the task gives us the time the task was scheduled to be executed we could also just log the time difference.</comment_new>
<comment_new>@humitos
I opened #12500 to save Build.task_executed_at when the task containing the build is grabbed by a worker. The worker checks if the build was created or retried more than 10 minutes ago, and logs a warning in that case. This means the build was in the queue waiting for a few time. Would this help?</comment_new>
<comment_new>@agjohnson
Yeah, that seems like a great place to start. If we need to we could also test something closer to ~6m to get an accurate view of all the occurances of retry time skew but it's the >10min retries that are the most concerning right now.</comment_new>
<comment_new>@humitos
In less than a day we have +80 builds hitting the 10 minutes waiting time: https://onenr.io/0VRVvvbEvRa

Fixes Builds: retries sometimes show large gaps in retry timing #12472

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…12472) Co-authored-by: ericholscher <[email protected]>

Co-authored-by: ericholscher <[email protected]>

ericholscher · 2025-12-03T20:20:34Z

@copilot some tasks are retrying at 300 minutes, is there a chance we're setting that incorrectly somewhere?

Copilot · 2025-12-03T20:22:45Z

@copilot some tasks are retrying at 300 minutes, is there a chance we're setting that incorrectly somewhere?

I investigated and the retry delay configuration looks correct:

RTD_BUILDS_RETRY_DELAY = 5 * 60 = 300 seconds (5 minutes) in settings/base.py
UpdateDocsTask.default_retry_delay = settings.RTD_BUILDS_RETRY_DELAY correctly uses this value
The countdown in prepare_build also uses settings.RTD_BUILDS_RETRY_DELAY

The 300-minute gaps observed are the symptom, not the configured value. The task is being scheduled for 5 minute retries, but something in the Celery/Redis scheduling is causing it to actually be picked up much later.

The potential causes documented in this PR (sorted set polling, visibility_timeout interaction, worker availability, queue backlog) remain the likely explanations rather than a misconfiguration.

ericholscher · 2025-12-03T21:09:23Z

@copilot Can we fix the issues easily?

Copilot · 2025-12-03T21:12:13Z

@copilot Can we fix the issues easily?

Unfortunately, there's no easy fix. The issue is inherent to how Celery with Redis handles ETA/countdown tasks:

ETA tasks use Redis sorted sets — These are polled periodically, not event-driven. When the countdown expires, the task competes with all other queued tasks.
acks_late=True is necessary — This ensures builds restart if a worker dies, but means tasks stay "reserved" longer.
Worker availability is the bottleneck — When all workers are busy at countdown expiry, the task waits.

Potential mitigations (not easy fixes):

More workers — Reduces wait time when countdown expires
Separate queue for retries with dedicated workers — Prevents retries from competing with new builds
Use Celery Beat for scheduled retries instead of countdown — More predictable scheduling but significant code changes

The task_executed_at logging from PR #12500 will help quantify how often this happens and whether scaling workers would help. The issue may be acceptable if it only affects projects consistently hitting concurrency limits.

humitos · 2025-12-09T13:17:02Z

Use Celery Beat for scheduled retries instead of countdown — More predictable scheduling but significant code changes

This is a potential fix for our case. I've done this in the past and it worked in a lot more reliable way than re-triggering Celery tasks and depend on it to handle them properly. It requires

adding a new Build.state="retrying"
create a scheduled/periodic task to grab those builds
modify trigger_build to accept an existing Build object

Doesn't look like a ton of work -- at least without jumping into the details 😉

Initial plan

ff9c33e

Copilot AI assigned Copilot and ericholscher Dec 3, 2025

Copilot started work on behalf of ericholscher December 3, 2025 17:20 View session

Copilot AI and others added 2 commits December 3, 2025 17:26

Document Celery configuration and retry timing investigation (issue #…

7c10f4d

…12472) Co-authored-by: ericholscher <[email protected]>

Address code review: update PR reference to be more generic

e4eaa04

Co-authored-by: ericholscher <[email protected]>

Copilot AI changed the title ~~[WIP] Fix large gaps in retry timing for builds~~ Document Celery retry timing gaps investigation (#12472) Dec 3, 2025

Copilot AI requested a review from ericholscher December 3, 2025 17:33

Copilot finished work on behalf of ericholscher December 3, 2025 17:33

Copilot started work on behalf of ericholscher December 3, 2025 20:21 View session

Copilot finished work on behalf of ericholscher December 3, 2025 20:23

Copilot started work on behalf of ericholscher December 3, 2025 21:10 View session

Copilot finished work on behalf of ericholscher December 3, 2025 21:13

ericholscher closed this Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Document Celery retry timing gaps investigation (#12472) #12631

Document Celery retry timing gaps investigation (#12472) #12631

Uh oh!

Copilot AI commented Dec 3, 2025 •

edited

Loading

Uh oh!

ericholscher commented Dec 3, 2025

Uh oh!

Copilot AI commented Dec 3, 2025

Uh oh!

ericholscher commented Dec 3, 2025

Uh oh!

Copilot AI commented Dec 3, 2025

Uh oh!

humitos commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Document Celery retry timing gaps investigation (#12472) #12631

Document Celery retry timing gaps investigation (#12472) #12631

Uh oh!

Conversation

Copilot AI commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation Summary

Changes

Comments on the Issue (you are @copilot in this section)

Uh oh!

ericholscher commented Dec 3, 2025

Uh oh!

Copilot AI commented Dec 3, 2025

Uh oh!

ericholscher commented Dec 3, 2025

Uh oh!

Copilot AI commented Dec 3, 2025

Uh oh!

humitos commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Dec 3, 2025 •

edited

Loading