[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
Open
mn-ram wants to merge 3 commits intochaoss:mainfrom
Open
[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802mn-ram wants to merge 3 commits intochaoss:mainfrom
mn-ram wants to merge 3 commits intochaoss:mainfrom
Conversation
…rying When detect_github_repo_move_core found a 301-redirected repo, it raised Retry() which left core_status stuck in COLLECTING. augur_collection_monitor counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied by pending retries, the scheduler dispatched no new work until each retry eventually failed and on_failure reset the status to Error. Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior collection) or Success (prior data exists) and clear the task_id before raising RepoMovedException. Change both detect_github_repo_move_core and detect_github_repo_move_secondary to raise Reject instead of Retry so the slot is freed immediately and the next scheduler cycle picks up the repo under its updated URL without constraint violations. Fixes chaoss#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
7ee8e1d to
5193257
Compare
MoralCode
requested changes
Mar 27, 2026
Comment on lines
+119
to
+132
| # Reset status so the scheduler re-queues the repo under the new URL. | ||
| status_field = f"{collection_hook}_status" | ||
| task_id_field = f"{collection_hook}_task_id" | ||
| last_collected_field = f"{collection_hook}_data_last_collected" | ||
|
|
||
| statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id) | ||
| collectionRecord = execute_session_query(statusQuery, 'one') | ||
| setattr(collectionRecord, task_id_field, None) | ||
| if getattr(collectionRecord, last_collected_field) is not None: | ||
| setattr(collectionRecord, status_field, CollectionState.SUCCESS.value) | ||
| else: | ||
| setattr(collectionRecord, status_field, CollectionState.PENDING.value) | ||
| session.commit() | ||
|
|
Contributor
There was a problem hiding this comment.
I really dont think we should be changing the task status things as a solution to this problem.
the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name
When a 301 redirect is detected, the repo URL is updated in the database. Downstream collection tasks can continue under the old URL since GitHub will redirect remaining API requests, and any new collection requests will use the updated URL directly. Removing the retry eliminates the pile-up of COLLECTING slots that blocked augur_collection_monitor from dispatching new work once all max_repo slots were occupied by pending retries. Fixes chaoss#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When
detect_github_repo_move_coredetects a 301-redirected repository, it previously raisedRetry(). This leftcore_statusstuck inCOLLECTINGin thecollection_statustable.augur_collection_monitorusesget_active_repo_countto measure how many repos are currently collecting and enforcesmax_repo=40. Once all 40 slots were occupied by pending retries fromdetect_github_repo_move_core, the scheduler stopped dispatching any new Core collection work — confirmed to require manual container restarts to recover.This PR fixes the root cause:
ping_github_for_repo_movenow resets the relevant hook's status (toPendingif no prior data, orSuccessif data has been collected before) and clears the task ID before raisingRepoMovedException, respecting the DB constraints defined inCollectionStatus. Bothdetect_github_repo_move_coreanddetect_github_repo_move_secondaryare changed to raiseRejectinstead ofRetry, freeing the slot immediately. The scheduler then picks up the repo on the next cycle under its updated URL.This PR fixes #3667
Notes for Reviewers
The key constraint at play is
core_data_last_collected_check:NOT (core_data_last_collected IS NOT NULL AND core_status = 'Pending')— can't set Pending if data already collectedNOT (core_task_id IS NOT NULL AND core_status IN ('Pending', 'Success', 'Error'))— must clear task_id before changing statusThe fix handles both cases (first-time collection vs. recollection) to avoid violating either constraint. The
collection_hookparameter already present inping_github_for_repo_moveis used to target the correct status columns, so both core and secondary tasks are handled correctly with no code duplication.Signed commits
Changeset
augur/tasks/github/detect_move/core.py: inping_github_for_repo_move, reset the hook's collection status and clear its task_id atomically before raisingRepoMovedExceptionon a 301 responseaugur/tasks/github/detect_move/tasks.py: replaceraise Retry(e.new_url)withraise Reject(e)in the core task; add consistent try/except to the secondary task; remove unusedRetryimportNotes
The previous
raise Retry(e.new_url)was also semantically wrong: Celery'sRetryexception does not forward positional args to the retried task invocation, so the task would retry with the original (now-stale)repo_gitand crash, eventually triggeringon_failureand anErrorstate. Theretry_errored_reposperiodic task would then reset it toPending— but only once per day, meaning repos with moved URLs could be stuck in Error for up to 24 hours in addition to blocking the scheduler.Related issues/PRs