-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Document Celery retry timing gaps investigation (#12472) #12631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…12472) Co-authored-by: ericholscher <[email protected]>
Co-authored-by: ericholscher <[email protected]>
|
@copilot some tasks are retrying at 300 minutes, is there a chance we're setting that incorrectly somewhere? |
I investigated and the retry delay configuration looks correct:
The 300-minute gaps observed are the symptom, not the configured value. The task is being scheduled for 5 minute retries, but something in the Celery/Redis scheduling is causing it to actually be picked up much later. The potential causes documented in this PR (sorted set polling, visibility_timeout interaction, worker availability, queue backlog) remain the likely explanations rather than a misconfiguration. |
|
@copilot Can we fix the issues easily? |
Unfortunately, there's no easy fix. The issue is inherent to how Celery with Redis handles ETA/countdown tasks:
Potential mitigations (not easy fixes):
The |
This is a potential fix for our case. I've done this in the past and it worked in a lot more reliable way than re-triggering Celery tasks and depend on it to handle them properly. It requires
Doesn't look like a ton of work -- at least without jumping into the details 😉 |
Build retries show unexplained large gaps (30m-5h) instead of the expected 5m delay. This documents the investigation findings and potential causes in the Celery configuration.
Investigation Summary
Potential causes for retry timing gaps with Redis broker +
acks_late=True:Changes
BROKER_TRANSPORT_OPTIONSexplainingvisibility_timeoutinteraction withacks_lateand known timing gap causesRTD_BUILDS_RETRY_DELAYbehavior and reference to issue Builds: retries sometimes show large gaps in retry timing #12472UpdateDocsTasknoting thetask_executed_atfield logs warnings when gaps exceed 10 minutesThe
Build.task_executed_atfield (from PR #12500) continues to track actual vs scheduled execution time for ongoing debugging.Original prompt
This section details on the original issue you should resolve
<issue_title>Builds: retries sometimes show large gaps in retry timing</issue_title>
<issue_description>See parent for more background, but summary of the issue is that concurrency limited builds don't execute retry tasks on a tight schedule (default is 300s). Instead, the gap between retries is anywhere from 5m, 30m, and I've seen 5 hour gaps.
Build task queue contention might be able to explain away a 10 min or maybe 30 min gap, but not consistent gaps and certainly not a 5 hour gap. To me, this feels like something else is happening here.</issue_description>
<agent_instructions>Look into why this might be happening with our celery configuration. </agent_instructions>
Comments on the Issue (you are @copilot in this section)
@humitos The big gaps are not because of queue contention, they happen on projects hitting concurrency limit constantly. When the build is mark to be retried in 5 minutes in the future, it's then executed at that time but due to concurrency limit on the project, it's retried again. That happens constantly and then you get these big gaps.I've only noticed these gaps, but they were always on these type of projects.</comment_new>
<comment_new>@agjohnson
Not sure I follow. How does a task get a 30 minute or 5 hour gap in between retry tasks then? The task needs to picked up by a worker at retry to evaluate the concurrency limit.
I'm not describing the cumulative difference between commands executing or something, I'm describing the time difference between the worker setting the task to retry and actually picking up the the retried task.
https://onenr.io/0BQ1J0aJ5wx
https://onenr.io/0Zw0J0YrNjv
These are just the two builds I noted before, there are many others and logging has the data on what is actually happening here.</comment_new>
<comment_new>@humitos
Ah, good point! The only explanation to the screenshot you shared is build queue backed up... but it doesn't make sense to happen that often, and makes me think there should be something else.</comment_new>
<comment_new>@agjohnson
Yeah definitely. While it could be queue backups, it seems like it would have to be unlikely. I suppose the next step here could be to pull a list of builds that retried repeatedly and test the timings?
It might just be easier to add some code and try to log this scenario when we notice it at the code level. I don't think we have a way to query retries at all, but you can look at the logs for
Retrying this taskand look for repeat build ids in the columns (you might have to add the column)https://onenr.io/0BQ1JWmpWwx
But it's still a lot of manual searching, I didn't see any in the handful of builds I tried.</comment_new>
<comment_new>@humitos
How are you thinking we can detect this scenario at code level? 🤔 </comment_new>
<comment_new>@agjohnson
I was thinking either storing a history of when tasks are attempted, or if the task gives us the time the task was scheduled to be executed we could also just log the time difference.</comment_new>
<comment_new>@humitos
I opened #12500 to save
Build.task_executed_atwhen the task containing the build is grabbed by a worker. The worker checks if the build was created or retried more than 10 minutes ago, and logs a warning in that case. This means the build was in the queue waiting for a few time. Would this help?</comment_new><comment_new>@agjohnson
Yeah, that seems like a great place to start. If we need to we could also test something closer to ~6m to get an accurate view of all the occurances of retry time skew but it's the >10min retries that are the most concerning right now.</comment_new>
<comment_new>@humitos
In less than a day we have +80 builds hitting the 10 minutes waiting time: https://onenr.io/0VRVvvbEvRa
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.