Skip to content

DF crashes with --lua_auto_async=true #6510

@heikomat

Description

@heikomat

Describe the bug
Dragonfly crashes when executing scripts with the --lua_auto_async=true flag active.
Without the flag everything runs fine.

To Reproduce

  1. Start dragonfly with this command:
    dragonfly --cache_mode=false --lua_auto_async=true --table_growth_margin 0.1
    
  2. optionally Load the script
  3. execute the script with eval/evalsha

Expected behavior
I Expect the script to run just like without the lua_auto_async flag, but faster.

Environment (please complete the following information):

  • OS: MacOS 26.2
  • Kernel: Darwin MacBook-Pro-von-Heiko.local 25.2.0 Darwin Kernel Version 25.2.0: Tue Nov 18 21:07:05 PST 2025; root:xnu-12377.61.12~1/RELEASE_ARM64_T6020 arm64
  • Containerized?: Docker Compose
  • Dragonfly Version: tested on 1.35.1 and 1.36.0

Reproducible Code Snippet
Sorry, i have not yet been able to make it a simple reproducable snippet

Additional context
I tried manually replacing the couple of calls with acalls, but acall seems to have a different return value. I was unable to find documentation around acall and apcall. if i just replace call with apcall the script errors with this:

attempt to perform arithmetic on a function value

for this line:

actually_deleted_rules = actually_deleted_rules + redis.acall("HDEL", account_key, table.unpack(rules_to_delete))

by default there is no crash output from DF. When i add the flags --vmodule=main_service=1,set_family=2, i get this crash report:

*** SIGSEGV received at time=1769763295 on cpu 0 ***
PC: @     0xaaaac350f134  (unknown)  util::fb2::detail::Scheduler::Preempt()
    @     0xaaaac39436c4        464  absl::lts_20250512::AbslFailureSignalHandler()
    @     0xffffa12e37a0       4960  (unknown)
    @     0xaaaac2c90570        224  util::fb2::EventCount::await<>()
    @     0xaaaac2d5fcc8        176  dfly::ScriptMgr::UpdateScriptCaches()
    @     0xaaaac2d634dc        224  dfly::ScriptMgr::Insert[abi:cxx11]()
    @     0xaaaac2d18cbc        656  dfly::Service::Eval()
    @     0xaaaac333bff0        208  dfly::CommandId::Invoke()
    @     0xaaaac2d0fc8c         80  dfly::Service::InvokeCmd()
    @     0xaaaac2d10a94        512  dfly::Service::DispatchCommand()
    @     0xaaaac34bb150        416  facade::Connection::DispatchSingle()
    @     0xaaaac34bb61c        304  facade::Connection::ParseRedis()
    @     0xaaaac34bc26c       1632  facade::Connection::IoLoop()
    @     0xaaaac34bc8ec        272  facade::Connection::ConnectionFlow()
    @     0xaaaac34bdb64        368  facade::Connection::HandleRequests()
    @     0xaaaac3520bc4        560  util::ListenerInterface::RunSingleConnection()
    @     0xaaaac35211c0        192  boost::context::detail::fiber_entry<>()

I use ioredis with ioredis.defineCommand. This means it usually uses EVALSHA, except for when the script has not been loaded yet, then it uses EVAL.


This is the script i'm trying to execute. In high-load scenarios it is called with up to 6k accounts and 400-500 entities, resulting in about 2-3 million hash-entries to be created

---@diagnostic disable: undefined-global
-- KEYS[1..M] = per-account keys (BR<accountId1>, IBR<accountId1>, BR<accountId2>, IBR<accountId2>, ...)
--
-- ARGV[1] = invalidateInfoTTL (seconds, e.g. "60"; "0" to skip)
-- ARGV[2..] = BR-Cache keys invalidate, one string per account:
--   format: "field␟field␟field..."
--
-- Return: deletedCount

local invalidate_info_ttl = tonumber(ARGV[1])

local separator = string.char(0x1f) -- ChatGPT is very sure that this is the best unit separator ever
local account_count = #KEYS / 2

-- check if the correct number of arguments are provided
if #ARGV ~= 1 + account_count then
  return redis.error_reply("ARGV must be: invalidateInfoTtl, then exactly one blob per account")
end


-- Determine the timestamp that the rules where invalidated at (aka now)
local redis_time = redis.call("TIME")
local now_us = redis_time[1] * 1000000 + redis_time[2]

local actually_deleted_rules = 0

-- parses the string "field␟field␟field"
-- into the arary [field, field, field]
local function parse_fields(blob)
  local fields = {}
  local blen = #blob
  local pos = 1

  while pos <= blen do
    local next_sep = string.find(blob, separator, pos, true)
    local token_end = next_sep and (next_sep - 1) or blen
    fields[#fields + 1] = string.sub(blob, pos, token_end)
    pos = next_sep and (next_sep + 1) or (blen + 1)
  end

  return fields
end

for i = 1, account_count do
  -- Name of the hash that contians the cached results for the current account
  -- Format: BR<accountId>
  local account_key = KEYS[2*i - 1]
  
  -- Name of the hash that contians the timesstamps for recently invalidated
  -- rules for the current account. Format: IBR<accountId>
  local invalidated_account_key = KEYS[2*i]

  -- names of rules to delete for the current account.
  -- Format: "field␟field␟field..."
  local deleted_rules_for_account = ARGV[1 + i] 

  -- For each account: delete the rules and record when the deletion happened
  if deleted_rules_for_account and #deleted_rules_for_account > 0 then
    local rules_to_delete = parse_fields(deleted_rules_for_account)

    -- Fast path: one field
    if #rules_to_delete == 1 then
      actually_deleted_rules = actually_deleted_rules + redis.call("HDEL", account_key, rules_to_delete[1])
      redis.call("HSET", invalidated_account_key, rules_to_delete[1], now_us)
    else
      -- Bulk path: multiple fields
      actually_deleted_rules = actually_deleted_rules + redis.call("HDEL", account_key, table.unpack(rules_to_delete))

      local deletion_timestamps = {}
      for j = 1, #rules_to_delete do
        local field_index = (j - 1) * 2 + 1
        deletion_timestamps[field_index] = rules_to_delete[j]
        deletion_timestamps[field_index + 1] = now_us
      end

      redis.call("HSET", invalidated_account_key, table.unpack(deletion_timestamps))
    end

    -- The info about when rules were invalidated are only relevant temprorarily.
    -- Make sure they expire at some point.
    if invalidate_info_ttl and invalidate_info_ttl > 0 then
      redis.call("EXPIRE", invalidated_account_key, invalidate_info_ttl)
    end
  end
end

return actually_deleted_rules

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions