Skip to content

Bug: Exponential backoff overflow and unbounded retry delays #271

@jumski

Description

@jumski

Problem

pgflow's exponential backoff has two critical issues:

  1. Integer overflow crash in SQL at attempt ≥31
  2. Unbounded retry delays allowing delays up to 68 years

Timeline to Overflow

Attempt Single Delay Total Time Elapsed Status
1 2 sec 2 sec ✓ Works
10 17 min 34 min ✓ Works
20 12 days 24 days ⚠️ Frozen for weeks
30 34 years 68 years ⚠️ Unreasonable
31 - - CRASH: integer overflow

Industry Comparison

System Max Delay Cap Default Max Attempts Pattern
Temporal 100 seconds Exponential capped
Trigger.dev 30 seconds 3 Exponential capped
Graphile Worker ~6 hours 24 exp(least(10, attempt))
Inngest Not specified 4 Exponential with jitter
pgflow SQL NONE ❌ 3 (unlimited) 2^attempt uncapped
pgflow edge-worker 300s default
68 years max ❌
50 max Exponential with cap

Root Cause

SQL Core (Primary Issue)

Function: pkgs/core/schemas/0030_utilities.sql (lines 21-32)

create or replace function pgflow.calculate_retry_delay(
  base_delay numeric,
  attempts_count int
)
returns int
as $$
  select floor(base_delay * power(2, attempts_count))::int
$$;

Problems:

  • No exponent cap: power(2, 31) exceeds PostgreSQL int32 limit (2,147,483,647)
  • No result cap: delays can grow to years/decades
  • Schema allows unlimited opt_max_attempts (no upper bound)

Used by: fail_task() function when task fails and needs retry


Edge Worker (Secondary Issue)

Function: pkgs/edge-worker/src/queue/createQueueWorker.ts (lines 91-101)

export function calculateRetryDelay(attempt: number, config: RetryConfig): number {
  switch (config.strategy) {
    case 'fixed':
      return config.baseDelay;
    case 'exponential': {
      const delay = config.baseDelay * Math.pow(2, attempt - 1);
      return Math.min(delay, config.maxDelay ?? 300);
    }
  }
}

Current behavior:

  • ✅ Default maxDelay: 300 seconds (5 minutes) is reasonable
  • ✅ Hard limit of 50 attempts prevents JavaScript overflow
  • ✅ JavaScript handles Infinity gracefully (caps via Math.min)
  • ❌ Validation allows maxDelay up to 2,147,483,647 seconds (68 years!)

Validation: pkgs/edge-worker/src/queue/validateRetryConfig.ts (lines 48-66)

// Prevents overflow
if (config.limit > 50) {
  throw new Error('For exponential strategy, limit must not exceed 50');
}

// But allows absurd maxDelay!
const MAX_POSTGRES_INTERVAL_SECONDS = 2147483647; // 68 years
if (config.maxDelay > MAX_POSTGRES_INTERVAL_SECONDS) {
  throw new Error(`maxDelay must not exceed ${MAX_POSTGRES_INTERVAL_SECONDS} seconds`);
}

Edge-worker delay comparison:

Attempt Default (maxDelay=300) Unrestricted (maxDelay=2147483647)
1 3 sec 3 sec
5 48 sec 48 sec
10 300 sec (capped) 25.6 min
20 300 sec (capped) 12.1 days
30 300 sec (capped) 33.4 years
50 300 sec (capped) 37.8 million years

Proposed Solution

1. Fix SQL calculate_retry_delay (High Priority)

Replace line 31 in pkgs/core/schemas/0030_utilities.sql:

select least(86400, greatest(0, floor(base_delay * power(2, least(attempts_count, 30)))::bigint))::int

What this does:

  • Caps exponent at 30 → prevents overflow
  • Caps result at 86400 seconds (24 hours) → prevents unbounded delays
  • Maintains exponential backoff for attempts 1-30

Fixes 7/9 problems with one line!

2. Update Edge-Worker Validation (Recommended)

Option A: Hard cap at 24 hours (strict)

Update pkgs/edge-worker/src/queue/validateRetryConfig.ts line 64:

const MAX_RETRY_DELAY_SECONDS = 86400; // 24 hours (align with SQL)
if (config.maxDelay > MAX_RETRY_DELAY_SECONDS) {
  throw new Error(`maxDelay must not exceed ${MAX_RETRY_DELAY_SECONDS} seconds (24 hours)`);
}

Option B: Warning + PostgreSQL limit (permissive)

if (config.maxDelay > 86400) {
  console.warn(`maxDelay of ${config.maxDelay}s exceeds recommended maximum of 86400s (24 hours)`);
}
if (config.maxDelay > MAX_POSTGRES_INTERVAL_SECONDS) {
  throw new Error(`maxDelay must not exceed ${MAX_POSTGRES_INTERVAL_SECONDS} seconds`);
}

3. Add Database Constraint (Optional)

In pkgs/core/schemas/0050_tables_definitions.sql:

constraint opt_max_attempts_is_reasonable check (opt_max_attempts >= 0 and opt_max_attempts <= 100)

New Tests Required

SQL Tests

Add pkgs/core/supabase/tests/functions/calculate_retry_delay.test.sql:

  1. Normal exponential growth works (attempts 1, 3, 5 with various base_delay)
  2. Attempt 31 does not crash (overflow prevention)
  3. Attempt 50 does not crash
  4. Attempt 100 does not crash
  5. Attempt 30 caps at 86400 seconds
  6. Attempt 31 caps at 86400 seconds
  7. Large base_delay + high attempt caps at 86400
  8. base_delay=0 returns 0 delay
  9. Attempt 0 returns base_delay (edge case)

Edge-Worker Tests (if validation updated)

Update existing tests in pkgs/edge-worker/tests/:

  1. maxDelay=86400 is accepted (boundary)
  2. maxDelay=86401 is rejected (over limit)
  3. maxDelay=300 works as default
  4. Warning logged when maxDelay > 86400 (if using Option B)

Impact

Before fix:

  • ❌ SQL crashes at attempt 31 with ERROR: integer out of range
  • ❌ Workflows frozen for weeks/years
  • ❌ Failed tasks consume resources indefinitely
  • ❌ No alerts until catastrophic failure
  • ❌ Edge-worker accepts 68-year delays

After fix:

  • ✅ No overflow at any attempt count
  • ✅ Maximum 24-hour retry delay (industry-aligned)
  • ✅ Tasks fail explicitly after reasonable time
  • ✅ Faster failure detection and recovery
  • ✅ Consistent limits between SQL and edge-worker

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions