[detect_move] fix: reset collection status on repo move to prevent scheduler stall by mn-ram · Pull Request #3802 · chaoss/augur

mn-ram · 2026-03-27T20:07:33Z

Description

Please include a summary of the change.

When detect_github_repo_move_core detects a 301-redirected repository, it previously raised Retry(). This left core_status stuck in COLLECTING in the collection_status table. augur_collection_monitor uses get_active_repo_count to measure how many repos are currently collecting and enforces max_repo=40. Once all 40 slots were occupied by pending retries from detect_github_repo_move_core, the scheduler stopped dispatching any new Core collection work — confirmed to require manual container restarts to recover.

This PR fixes the root cause: ping_github_for_repo_move now resets the relevant hook's status (to Pending if no prior data, or Success if data has been collected before) and clears the task ID before raising RepoMovedException, respecting the DB constraints defined in CollectionStatus. Both detect_github_repo_move_core and detect_github_repo_move_secondary are changed to raise Reject instead of Retry, freeing the slot immediately. The scheduler then picks up the repo on the next cycle under its updated URL.

This PR fixes #3667

Notes for Reviewers

The key constraint at play is core_data_last_collected_check:

NOT (core_data_last_collected IS NOT NULL AND core_status = 'Pending') — can't set Pending if data already collected
NOT (core_task_id IS NOT NULL AND core_status IN ('Pending', 'Success', 'Error')) — must clear task_id before changing status

The fix handles both cases (first-time collection vs. recollection) to avoid violating either constraint. The collection_hook parameter already present in ping_github_for_repo_move is used to target the correct status columns, so both core and secondary tasks are handled correctly with no code duplication.

Signed commits

Yes, I signed my commits.

Changeset

augur/tasks/github/detect_move/core.py: in ping_github_for_repo_move, reset the hook's collection status and clear its task_id atomically before raising RepoMovedException on a 301 response
augur/tasks/github/detect_move/tasks.py: replace raise Retry(e.new_url) with raise Reject(e) in the core task; add consistent try/except to the secondary task; remove unused Retry import

Notes

The previous raise Retry(e.new_url) was also semantically wrong: Celery's Retry exception does not forward positional args to the retried task invocation, so the task would retry with the original (now-stale) repo_git and crash, eventually triggering on_failure and an Error state. The retry_errored_repos periodic task would then reset it to Pending — but only once per day, meaning repos with moved URLs could be stuck in Error for up to 24 hours in addition to blocking the scheduler.

Related issues/PRs

Fixes all Core tasks stop getting scheduled if there are more than 40 repos that need renaming #3667

…rying When detect_github_repo_move_core found a 301-redirected repo, it raised Retry() which left core_status stuck in COLLECTING. augur_collection_monitor counts COLLECTING rows against max_repo (40). Once all 40 slots were occupied by pending retries, the scheduler dispatched no new work until each retry eventually failed and on_failure reset the status to Error. Fix ping_github_for_repo_move to reset the hook's status to Pending (no prior collection) or Success (prior data exists) and clear the task_id before raising RepoMovedException. Change both detect_github_repo_move_core and detect_github_repo_move_secondary to raise Reject instead of Retry so the slot is freed immediately and the next scheduler cycle picks up the repo under its updated URL without constraint violations. Fixes chaoss#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

MoralCode · 2026-03-27T20:18:26Z

augur/tasks/github/detect_move/core.py

+        # Reset status so the scheduler re-queues the repo under the new URL.
+        status_field = f"{collection_hook}_status"
+        task_id_field = f"{collection_hook}_task_id"
+        last_collected_field = f"{collection_hook}_data_last_collected"
+
+        statusQuery = session.query(CollectionStatus).filter(CollectionStatus.repo_id == repo.repo_id)
+        collectionRecord = execute_session_query(statusQuery, 'one')
+        setattr(collectionRecord, task_id_field, None)
+        if getattr(collectionRecord, last_collected_field) is not None:
+            setattr(collectionRecord, status_field, CollectionState.SUCCESS.value)
+        else:
+            setattr(collectionRecord, status_field, CollectionState.PENDING.value)
+        session.commit()
+


I really dont think we should be changing the task status things as a solution to this problem.

the whole reason for calling retry is to cause the task to start over with the updated repo name so that all the downstream tasks are run on the correct/newly updated repository name

When a 301 redirect is detected, the repo URL is updated in the database. Downstream collection tasks can continue under the old URL since GitHub will redirect remaining API requests, and any new collection requests will use the updated URL directly. Removing the retry eliminates the pile-up of COLLECTING slots that blocked augur_collection_monitor from dispatching new work once all max_repo slots were occupied by pending retries. Fixes chaoss#3667 Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram requested a review from sgoggins as a code owner March 27, 2026 20:07

mn-ram added 2 commits March 28, 2026 01:42

fix(detect_move): trim verbose comments to one-liners

5193257

Signed-off-by: mn-ram <235066282+mn-ram@users.noreply.github.com>

mn-ram force-pushed the fix/core-scheduling-blocked-by-move-retries branch from 7ee8e1d to 5193257 Compare March 27, 2026 20:12

MoralCode requested changes Mar 27, 2026

View reviewed changes

mn-ram requested a review from MoralCode March 27, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802

[detect_move] fix: reset collection status on repo move to prevent scheduler stall#3802
mn-ram wants to merge 3 commits intochaoss:mainfrom
mn-ram:fix/core-scheduling-blocked-by-move-retries

mn-ram commented Mar 27, 2026

Uh oh!

MoralCode Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mn-ram commented Mar 27, 2026

Changeset

Notes

Related issues/PRs

Uh oh!

MoralCode Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants