Skip to content

fix: set TTL on Celery pidbox reply keys to prevent Redis memory leak #556

@bartmichalak

Description

@bartmichalak

Problem

We noticed unusually high Redis memory usage on our internal Railway instance - 2 GB for a deployment serving only ~20 users with ~6M data points.
Investigation revealed that 22 orphaned *.reply.celery.pidbox keys were responsible for nearly all of it, each holding ~112 MB of base64-encoded Apple Health upload payloads that were never cleaned up.

These keys are created by Celery's remote control mechanism (pidbox). Normally workers delete them after reading the reply, but if a worker crashes or restarts, the keys remain forever with TTL -1.

Reproduction steps

Check overall memory usage:

redis-cli -u $REDIS_URL INFO memory | grep used_memory_human
# used_memory_human:1.97G

Find large keys using a Lua scan (keys > 100KB with size in MB):

redis-cli -u $REDIS_URL EVAL "
local cursor = '0'
local big = {}
local total = 0
repeat
  local result = redis.call('SCAN', cursor, 'COUNT', 500)
  cursor = result[1]
  for i, key in ipairs(result[2]) do
    local mem = redis.call('MEMORY', 'USAGE', key)
    if mem > 100000 then
      local mb = string.format('%.1f', mem / 1048576)
      total = total + mem
      table.insert(big, key .. ' | ' .. mb .. ' MB | ttl:' .. redis.call('TTL', key))
    end
  end
until cursor == '0'
table.insert(big, 'TOTAL: ' .. string.format('%.1f', total / 1048576) .. ' MB across ' .. #big .. ' keys')
return big
" 0

Results on our Railway instance - 22 pidbox keys, all with ttl:-1:

0822df03-...reply.celery.pidbox | 112.0 MB | ttl:-1
b6d13c71-...reply.celery.pidbox | 112.0 MB | ttl:-1
67c5f659-...reply.celery.pidbox |  32.0 MB | ttl:-1
... (22 keys total, TOTAL: 2015.8 MB)

All other keys (5,268 celery-task-meta-* + 243 garmin:*) total < 5 MB combined.

Root cause

control_queue_ttl and control_queue_expires are not configured in backend/app/integrations/celery/core.py, so pidbox reply keys have no automatic expiry.

Fix

  1. Add to celery_app.conf.update() in core.py:
    control_queue_ttl=300,
    control_queue_expires=300,
  2. One-time cleanup of existing keys:
    redis-cli -u $REDIS_URL EVAL "
    local cursor = '0'
    local deleted = 0
    repeat
      local result = redis.call('SCAN', cursor, 'MATCH', '*.reply.celery.pidbox', 'COUNT', 100)
      cursor = result[1]
      for i, key in ipairs(result[2]) do
        redis.call('DEL', key)
        deleted = deleted + 1
      end
    until cursor == '0'
    return 'Deleted ' .. deleted .. ' keys'
    " 0
    

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions