Improve celery rate limit and concurrency handling#22189
Open
mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
Open
Improve celery rate limit and concurrency handling#22189mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
Conversation
The existing rate-limit implementation re-reserved a new DB timeslot on every retry, causing cascading delays where a single task could push its own execution further and further into the future. Fix this by reserving the timeslot once on first attempt and storing it in a Celery message header, so retries simply wait for their already-reserved slot. Additionally, introduce a new per-user concurrency limit (`celery_user_concurrency_limit`) that caps how many Celery tasks can execute simultaneously for a single user. This prevents one user from monopolizing all available worker capacity. Implementation details: - Rate limit: on first attempt, atomically reserve the next available timeslot in `celery_user_rate_limit` table. Store the reserved time in a `_gxy_rate_limit_scheduled_time` message header. On retry, read the header and wait until the timeslot arrives. Use `max_retries=None` so tasks are never dropped. - Concurrency limit: before a task starts, count active tasks for the user in a new `celery_user_active_task` tracking table. If at the limit, defer via `task.retry(countdown=5)`. On task completion (success or failure), delete the tracking row via an `after_return` hook. A periodic beat task cleans up stale rows from crashed workers by cross-referencing `inspect().active()`. - Both features are independently configurable and composable via `GalaxyTaskBeforeStartCombined` which chains hooks in order. - New DB migration adds the `celery_user_active_task` table. - Integration tests cover concurrency admission, cleanup, multi-user isolation, and stale row recovery.
…n guide Add comprehensive admin documentation covering both celery task throttling mechanisms: - Per-user rate limiting: configuration, two-phase slot reservation design (reserve once, retry until timeslot), DB backend differences (Postgres atomic upsert vs standard SELECT FOR UPDATE), and limitations (clock precision, no priority ordering, slot consumption on failure). - Per-user concurrency limiting: configuration, admission control flow, after_return cleanup, periodic stale-row recovery via worker inspection, and limitations (retry polling interval, crash recovery window, DB overhead at scale). - Combined usage: explains that rate limiting runs first (timeslot scheduling) then concurrency limiting (execution gating). - Administrative operations: SQL recipes for clearing leaked slots and celery CLI commands for purging/revoking tasks.
jmchilton
reviewed
Mar 19, 2026
Member
jmchilton
left a comment
There was a problem hiding this comment.
I'm worried this will be difficult to debug but I'm sure the server getting overwhelmed is much more difficult to debug. This is very impressive.
davelopez
reviewed
Mar 20, 2026
Contributor
davelopez
left a comment
There was a problem hiding this comment.
Pretty cool indeed!
Need to run black and maybe make config-rebuild
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
attempt. Now the timeslot is reserved once and stored in a Celery message header so retries simply wait for the already-reserved slot.
celery_user_concurrency_limit): Caps how many tasks can run simultaneously for a single user, preventing one user from monopolizing all worker capacity. Uses acelery_user_active_tasktracking table with admission control inbefore_start, cleanup inafter_return, and a periodic beat task to reclaim stale slots from crashed workers.Changes
Rate limiting fix
_gxy_rate_limit_scheduled_timemessage headermax_retries=Noneso rate-limited tasks are never droppedConcurrency limiting (new)
celery_user_concurrency_limit(default0= disabled)before_starthook counts active tasks per user; defers viatask.retry(countdown=5)if at limitafter_returnhook deletes tracking row on task completion (success or failure)cleanup_stale_concurrency_slotsbeat task (every 5 min) reclaims slots from crashed workers by cross-referencinginspect().active()celery_user_active_tasktable + Alembic migrationDocumentation
How to test the changes?
(Select all options that apply)
License