Improve celery rate limit and concurrency handling by mvdbeek · Pull Request #22189 · galaxyproject/galaxy

mvdbeek · 2026-03-19T15:00:57Z

Summary

Fix cascading delay bug in per-user Celery rate limiting: The existing implementation re-reserved a new DB timeslot on every retry, causing tasks to push their own execution further into the future with each
attempt. Now the timeslot is reserved once and stored in a Celery message header so retries simply wait for the already-reserved slot.
Add per-user Celery task concurrency limiting (celery_user_concurrency_limit): Caps how many tasks can run simultaneously for a single user, preventing one user from monopolizing all worker capacity. Uses a
celery_user_active_task tracking table with admission control in before_start, cleanup in after_return, and a periodic beat task to reclaim stale slots from crashed workers.
Both features are independently configurable and composable — when both are enabled, rate limiting runs first (timeslot scheduling) then concurrency limiting (execution gating).

Changes

Rate limiting fix

Reserve timeslot atomically in DB on first attempt only
Store reserved time in _gxy_rate_limit_scheduled_time message header
On retry, read header and wait until timeslot arrives (no re-reservation)
Use max_retries=None so rate-limited tasks are never dropped

Concurrency limiting (new)

New config option: celery_user_concurrency_limit (default 0 = disabled)
before_start hook counts active tasks per user; defers via task.retry(countdown=5) if at limit
after_return hook deletes tracking row on task completion (success or failure)
Periodic cleanup_stale_concurrency_slots beat task (every 5 min) reclaims slots from crashed workers by cross-referencing inspect().active()
New celery_user_active_task table + Alembic migration
Postgres and standard SQL backends supported

Documentation

Comprehensive production guide covering configuration, design, DB backend differences, limitations, and admin operations (SQL recipes for clearing leaked slots, celery CLI for purging/revoking tasks)

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

The existing rate-limit implementation re-reserved a new DB timeslot on every retry, causing cascading delays where a single task could push its own execution further and further into the future. Fix this by reserving the timeslot once on first attempt and storing it in a Celery message header, so retries simply wait for their already-reserved slot. Additionally, introduce a new per-user concurrency limit (`celery_user_concurrency_limit`) that caps how many Celery tasks can execute simultaneously for a single user. This prevents one user from monopolizing all available worker capacity. Implementation details: - Rate limit: on first attempt, atomically reserve the next available timeslot in `celery_user_rate_limit` table. Store the reserved time in a `_gxy_rate_limit_scheduled_time` message header. On retry, read the header and wait until the timeslot arrives. Use `max_retries=None` so tasks are never dropped. - Concurrency limit: before a task starts, count active tasks for the user in a new `celery_user_active_task` tracking table. If at the limit, defer via `task.retry(countdown=5)`. On task completion (success or failure), delete the tracking row via an `after_return` hook. A periodic beat task cleans up stale rows from crashed workers by cross-referencing `inspect().active()`. - Both features are independently configurable and composable via `GalaxyTaskBeforeStartCombined` which chains hooks in order. - New DB migration adds the `celery_user_active_task` table. - Integration tests cover concurrency admission, cleanup, multi-user isolation, and stale row recovery.

…n guide Add comprehensive admin documentation covering both celery task throttling mechanisms: - Per-user rate limiting: configuration, two-phase slot reservation design (reserve once, retry until timeslot), DB backend differences (Postgres atomic upsert vs standard SELECT FOR UPDATE), and limitations (clock precision, no priority ordering, slot consumption on failure). - Per-user concurrency limiting: configuration, admission control flow, after_return cleanup, periodic stale-row recovery via worker inspection, and limitations (retry polling interval, crash recovery window, DB overhead at scale). - Combined usage: explains that rate limiting runs first (timeslot scheduling) then concurrency limiting (execution gating). - Administrative operations: SQL recipes for clearing leaked slots and celery CLI commands for purging/revoking tasks.

jmchilton

I'm worried this will be difficult to debug but I'm sure the server getting overwhelmed is much more difficult to debug. This is very impressive.

davelopez

Pretty cool indeed!

Need to run black and maybe make config-rebuild

mvdbeek added 2 commits March 19, 2026 15:52

github-project-automation bot added this to Galaxy Dev - weeklies Mar 19, 2026

github-project-automation bot moved this to Needs Review in Galaxy Dev - weeklies Mar 19, 2026

github-actions bot added area/documentation area/testing area/admin area/database Galaxy's database or data access layer area/testing/integration labels Mar 19, 2026

github-actions bot added this to the 26.1 milestone Mar 19, 2026

jmchilton reviewed Mar 19, 2026

View reviewed changes

davelopez reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve celery rate limit and concurrency handling#22189

Improve celery rate limit and concurrency handling#22189
mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
mvdbeek:improve-celery-rate-limit-and-concurrnecy-handling

mvdbeek commented Mar 19, 2026

Uh oh!

jmchilton left a comment

Uh oh!

davelopez left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvdbeek commented Mar 19, 2026

Summary

Changes

Rate limiting fix

Concurrency limiting (new)

Documentation

How to test the changes?

License

Uh oh!

jmchilton left a comment

Choose a reason for hiding this comment

Uh oh!

davelopez left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants