Skip to content

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802

Open
mn-ram wants to merge 3 commits intochaoss:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries
Open

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
mn-ram wants to merge 3 commits intochaoss:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries

Conversation

@mn-ram
Copy link
Copy Markdown

@mn-ram mn-ram commented Mar 27, 2026

Description

  • Please include a summary of the change.

When detect_github_repo_move_core detects a 301-redirected repository, it previously raised Retry(). This left core_status stuck in COLLECTING in the collection_status table. augur_collection_monitor uses get_active_repo_count to measure how many repos are currently collecting and enforces max_repo=40. Once all 40 slots were occupied by pending retries from detect_github_repo_move_core, the scheduler stopped dispatching any new Core collection work — confirmed to require manual container restarts to recover.

This PR fixes the root cause: ping_github_for_repo_move now resets the relevant hook's status (to Pending if no prior data, or Success if data has been collected before) and clears the task ID before raising RepoMovedException, respecting the DB constraints defined in CollectionStatus. Both detect_github_repo_move_core and detect_github_repo_move_secondary are changed to raise Reject instead of Retry, freeing the slot immediately. The scheduler then picks up the repo on the next cycle under its updated URL.

This PR fixes #3667

Notes for Reviewers

The key constraint at play is core_data_last_collected_check:

  • NOT (core_data_last_collected IS NOT NULL AND core_status = 'Pending') — can't set Pending if data already collected
  • NOT (core_task_id IS NOT NULL AND core_status IN ('Pending', 'Success', 'Error')) — must clear task_id before changing status

The fix handles both cases (first-time collection vs. recollection) to avoid violating either constraint. The collection_hook parameter already present in ping_github_for_repo_move is used to target the correct status columns, so both core and secondary tasks are handled correctly with no code duplication.

Signed commits

  • Yes, I signed my commits.

Changeset

  • augur/tasks/github/detect_move/core.py: in ping_github_for_repo_move, reset the hook's collection status and clear its task_id atomically before raising RepoMovedException on a 301 response
  • augur/tasks/github/detect_move/tasks.py: replace raise Retry(e.new_url) with raise Reject(e) in the core task; add consistent try/except to the secondary task; remove unused Retry import

Notes

The previous raise Retry(e.new_url) was also semantically wrong: Celery's Retry exception does not forward positional args to the retried task invocation, so the task would retry with the original (now-stale) repo_git and crash, eventually triggering on_failure and an Error state. The retry_errored_repos periodic task would then reset it to Pending — but only once per day, meaning repos with moved URLs could be stuck in Error for up to 24 hours in addition to blocking the scheduler.

Related issues/PRs

@mn-ram mn-ram requested a review from sgoggins as a code owner March 27, 2026 20:07
mn-ram added 2 commits March 28, 2026 01:42
…rying

When detect_github_repo_move_core found a 301-redirected repo, it raised
Retry() which left core_status stuck in COLLECTING. augur_collection_monitor
counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied
by pending retries, the scheduler dispatched no new work until each retry
eventually failed and on_failure reset the status to Error.

Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior
collection) or Success (prior data exists) and clear the task_id before raising
RepoMovedException. Change both detect_github_repo_move_core and
detect_github_repo_move_secondary to raise Reject instead of Retry so the slot
is freed immediately and the next scheduler cycle picks up the repo under its
updated URL without constraint violations.

Fixes chaoss#3667

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 7ee8e1d to 5193257 Compare March 27, 2026 20:12
Comment on lines +119 to +132
# Reset status so the scheduler re-queues the repo under the new URL.
status_field = f"{collection_hook}_status"
task_id_field = f"{collection_hook}_task_id"
last_collected_field = f"{collection_hook}_data_last_collected"

statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id)
collectionRecord = execute_session_query(statusQuery, 'one')
setattr(collectionRecord, task_id_field, None)
if getattr(collectionRecord, last_collected_field) is not None:
setattr(collectionRecord, status_field, CollectionState.SUCCESS.value)
else:
setattr(collectionRecord, status_field, CollectionState.PENDING.value)
session.commit()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dont think we should be changing the task status things as a solution to this problem.

the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name

When a 301 redirect is detected, the repo URL is updated in the database.
Downstream collection tasks can continue under the old URL since GitHub
will redirect remaining API requests, and any new collection requests will
use the updated URL directly.

Removing the retry eliminates the pile-up of COLLECTING slots that blocked
augur_collection_monitor from dispatching new work once all max_repo slots
were occupied by pending retries.

Fixes chaoss#3667

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>
@mn-ram mn-ram requested a review from MoralCode March 27, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

all Core tasks stop getting scheduled if there are more than 40 repos that need renaming

2 participants