Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 3, 2025

Build retries show unexplained large gaps (30m-5h) instead of the expected 5m delay. This documents the investigation findings and potential causes in the Celery configuration.

Investigation Summary

Potential causes for retry timing gaps with Redis broker + acks_late=True:

  1. Redis sorted set polling — ETA/countdown tasks stored in sorted set, polled periodically; high contention delays pickup
  2. visibility_timeout interaction — 5h timeout prevents duplicate task redelivery but doesn't affect ETA scheduling directly
  3. Worker availability — tasks wait when all workers busy at countdown expiry
  4. Queue backlog — ETA tasks compete with regular tasks

Changes

  • Added inline documentation in BROKER_TRANSPORT_OPTIONS explaining visibility_timeout interaction with acks_late and known timing gap causes
  • Documented RTD_BUILDS_RETRY_DELAY behavior and reference to issue Builds: retries sometimes show large gaps in retry timing #12472
  • Added comments in UpdateDocsTask noting the task_executed_at field logs warnings when gaps exceed 10 minutes

The Build.task_executed_at field (from PR #12500) continues to track actual vs scheduled execution time for ongoing debugging.

Original prompt

This section details on the original issue you should resolve

<issue_title>Builds: retries sometimes show large gaps in retry timing</issue_title>
<issue_description>See parent for more background, but summary of the issue is that concurrency limited builds don't execute retry tasks on a tight schedule (default is 300s). Instead, the gap between retries is anywhere from 5m, 30m, and I've seen 5 hour gaps.

Build task queue contention might be able to explain away a 10 min or maybe 30 min gap, but not consistent gaps and certainly not a 5 hour gap. To me, this feels like something else is happening here.</issue_description>

<agent_instructions>Look into why this might be happening with our celery configuration. </agent_instructions>

Comments on the Issue (you are @copilot in this section)

@humitos The big gaps are not because of queue contention, they happen on projects hitting concurrency limit constantly. When the build is mark to be retried in 5 minutes in the future, it's then executed at that time but due to concurrency limit on the project, it's retried again. That happens constantly and then you get these big gaps.

I've only noticed these gaps, but they were always on these type of projects.</comment_new>
<comment_new>@agjohnson
Not sure I follow. How does a task get a 30 minute or 5 hour gap in between retry tasks then? The task needs to picked up by a worker at retry to evaluate the concurrency limit.

I'm not describing the cumulative difference between commands executing or something, I'm describing the time difference between the worker setting the task to retry and actually picking up the the retried task.

https://onenr.io/0BQ1J0aJ5wx

Image

https://onenr.io/0Zw0J0YrNjv

Image

These are just the two builds I noted before, there are many others and logging has the data on what is actually happening here.</comment_new>
<comment_new>@humitos

I'm not describing the cumulative difference between commands executing or something, I'm describing the time difference between the worker setting the task to retry and actually picking up the the retried task.

Ah, good point! The only explanation to the screenshot you shared is build queue backed up... but it doesn't make sense to happen that often, and makes me think there should be something else.</comment_new>
<comment_new>@agjohnson
Yeah definitely. While it could be queue backups, it seems like it would have to be unlikely. I suppose the next step here could be to pull a list of builds that retried repeatedly and test the timings?

It might just be easier to add some code and try to log this scenario when we notice it at the code level. I don't think we have a way to query retries at all, but you can look at the logs for Retrying this task and look for repeat build ids in the columns (you might have to add the column)

Image

https://onenr.io/0BQ1JWmpWwx

But it's still a lot of manual searching, I didn't see any in the handful of builds I tried.</comment_new>
<comment_new>@humitos

It might just be easier to add some code and try to log this scenario when we notice it at the code level

How are you thinking we can detect this scenario at code level? 🤔 </comment_new>
<comment_new>@agjohnson
I was thinking either storing a history of when tasks are attempted, or if the task gives us the time the task was scheduled to be executed we could also just log the time difference.</comment_new>
<comment_new>@humitos
I opened #12500 to save Build.task_executed_at when the task containing the build is grabbed by a worker. The worker checks if the build was created or retried more than 10 minutes ago, and logs a warning in that case. This means the build was in the queue waiting for a few time. Would this help?</comment_new>
<comment_new>@agjohnson
Yeah, that seems like a great place to start. If we need to we could also test something closer to ~6m to get an accurate view of all the occurances of retry time skew but it's the >10min retries that are the most concerning right now.</comment_new>
<comment_new>@humitos
In less than a day we have +80 builds hitting the 10 minutes waiting time: https://onenr.io/0VRVvvbEvRa


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Fix large gaps in retry timing for builds Document Celery retry timing gaps investigation (#12472) Dec 3, 2025
Copilot AI requested a review from ericholscher December 3, 2025 17:33
@ericholscher
Copy link
Member

@copilot some tasks are retrying at 300 minutes, is there a chance we're setting that incorrectly somewhere?

Copy link
Contributor Author

Copilot AI commented Dec 3, 2025

@copilot some tasks are retrying at 300 minutes, is there a chance we're setting that incorrectly somewhere?

I investigated and the retry delay configuration looks correct:

  • RTD_BUILDS_RETRY_DELAY = 5 * 60 = 300 seconds (5 minutes) in settings/base.py
  • UpdateDocsTask.default_retry_delay = settings.RTD_BUILDS_RETRY_DELAY correctly uses this value
  • The countdown in prepare_build also uses settings.RTD_BUILDS_RETRY_DELAY

The 300-minute gaps observed are the symptom, not the configured value. The task is being scheduled for 5 minute retries, but something in the Celery/Redis scheduling is causing it to actually be picked up much later.

The potential causes documented in this PR (sorted set polling, visibility_timeout interaction, worker availability, queue backlog) remain the likely explanations rather than a misconfiguration.

@ericholscher
Copy link
Member

@copilot Can we fix the issues easily?

Copy link
Contributor Author

Copilot AI commented Dec 3, 2025

@copilot Can we fix the issues easily?

Unfortunately, there's no easy fix. The issue is inherent to how Celery with Redis handles ETA/countdown tasks:

  1. ETA tasks use Redis sorted sets — These are polled periodically, not event-driven. When the countdown expires, the task competes with all other queued tasks.

  2. acks_late=True is necessary — This ensures builds restart if a worker dies, but means tasks stay "reserved" longer.

  3. Worker availability is the bottleneck — When all workers are busy at countdown expiry, the task waits.

Potential mitigations (not easy fixes):

  • More workers — Reduces wait time when countdown expires
  • Separate queue for retries with dedicated workers — Prevents retries from competing with new builds
  • Use Celery Beat for scheduled retries instead of countdown — More predictable scheduling but significant code changes

The task_executed_at logging from PR #12500 will help quantify how often this happens and whether scaling workers would help. The issue may be acceptable if it only affects projects consistently hitting concurrency limits.

@humitos
Copy link
Member

humitos commented Dec 9, 2025

Use Celery Beat for scheduled retries instead of countdown — More predictable scheduling but significant code changes

This is a potential fix for our case. I've done this in the past and it worked in a lot more reliable way than re-triggering Celery tasks and depend on it to handle them properly. It requires

  • adding a new Build.state="retrying"
  • create a scheduled/periodic task to grab those builds
  • modify trigger_build to accept an existing Build object

Doesn't look like a ton of work -- at least without jumping into the details 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Builds: retries sometimes show large gaps in retry timing

3 participants