Skip to content

Improve celery rate limit and concurrency handling#22189

Open
mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
mvdbeek:improve-celery-rate-limit-and-concurrnecy-handling
Open

Improve celery rate limit and concurrency handling#22189
mvdbeek wants to merge 2 commits intogalaxyproject:devfrom
mvdbeek:improve-celery-rate-limit-and-concurrnecy-handling

Conversation

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Mar 19, 2026

Summary

  • Fix cascading delay bug in per-user Celery rate limiting: The existing implementation re-reserved a new DB timeslot on every retry, causing tasks to push their own execution further into the future with each
    attempt. Now the timeslot is reserved once and stored in a Celery message header so retries simply wait for the already-reserved slot.
  • Add per-user Celery task concurrency limiting (celery_user_concurrency_limit): Caps how many tasks can run simultaneously for a single user, preventing one user from monopolizing all worker capacity. Uses a
    celery_user_active_task tracking table with admission control in before_start, cleanup in after_return, and a periodic beat task to reclaim stale slots from crashed workers.
  • Both features are independently configurable and composable — when both are enabled, rate limiting runs first (timeslot scheduling) then concurrency limiting (execution gating).

Changes

Rate limiting fix

  • Reserve timeslot atomically in DB on first attempt only
  • Store reserved time in _gxy_rate_limit_scheduled_time message header
  • On retry, read header and wait until timeslot arrives (no re-reservation)
  • Use max_retries=None so rate-limited tasks are never dropped

Concurrency limiting (new)

  • New config option: celery_user_concurrency_limit (default 0 = disabled)
  • before_start hook counts active tasks per user; defers via task.retry(countdown=5) if at limit
  • after_return hook deletes tracking row on task completion (success or failure)
  • Periodic cleanup_stale_concurrency_slots beat task (every 5 min) reclaims slots from crashed workers by cross-referencing inspect().active()
  • New celery_user_active_task table + Alembic migration
  • Postgres and standard SQL backends supported

Documentation

  • Comprehensive production guide covering configuration, design, DB backend differences, limitations, and admin operations (SQL recipes for clearing leaked slots, celery CLI for purging/revoking tasks)

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek added 2 commits March 19, 2026 15:52
The existing rate-limit implementation re-reserved a new DB timeslot on
every retry, causing cascading delays where a single task could push its
own execution further and further into the future. Fix this by reserving
the timeslot once on first attempt and storing it in a Celery message
header, so retries simply wait for their already-reserved slot.

Additionally, introduce a new per-user concurrency limit
(`celery_user_concurrency_limit`) that caps how many Celery tasks can
execute simultaneously for a single user. This prevents one user from
monopolizing all available worker capacity.

Implementation details:
- Rate limit: on first attempt, atomically reserve the next available
  timeslot in `celery_user_rate_limit` table. Store the reserved time
  in a `_gxy_rate_limit_scheduled_time` message header. On retry, read
  the header and wait until the timeslot arrives. Use `max_retries=None`
  so tasks are never dropped.
- Concurrency limit: before a task starts, count active tasks for the
  user in a new `celery_user_active_task` tracking table. If at the
  limit, defer via `task.retry(countdown=5)`. On task completion
  (success or failure), delete the tracking row via an `after_return`
  hook. A periodic beat task cleans up stale rows from crashed workers
  by cross-referencing `inspect().active()`.
- Both features are independently configurable and composable via
  `GalaxyTaskBeforeStartCombined` which chains hooks in order.
- New DB migration adds the `celery_user_active_task` table.
- Integration tests cover concurrency admission, cleanup, multi-user
  isolation, and stale row recovery.
…n guide

Add comprehensive admin documentation covering both celery task
throttling mechanisms:

- Per-user rate limiting: configuration, two-phase slot reservation
  design (reserve once, retry until timeslot), DB backend differences
  (Postgres atomic upsert vs standard SELECT FOR UPDATE), and
  limitations (clock precision, no priority ordering, slot consumption
  on failure).
- Per-user concurrency limiting: configuration, admission control flow,
  after_return cleanup, periodic stale-row recovery via worker
  inspection, and limitations (retry polling interval, crash recovery
  window, DB overhead at scale).
- Combined usage: explains that rate limiting runs first (timeslot
  scheduling) then concurrency limiting (execution gating).
- Administrative operations: SQL recipes for clearing leaked slots and
  celery CLI commands for purging/revoking tasks.
Copy link
Member

@jmchilton jmchilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried this will be difficult to debug but I'm sure the server getting overwhelmed is much more difficult to debug. This is very impressive.

Copy link
Contributor

@davelopez davelopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool indeed!

Need to run black and maybe make config-rebuild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.

3 participants