-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Problem
pgflow's exponential backoff has two critical issues:
- Integer overflow crash in SQL at attempt ≥31
- Unbounded retry delays allowing delays up to 68 years
Timeline to Overflow
| Attempt | Single Delay | Total Time Elapsed | Status |
|---|---|---|---|
| 1 | 2 sec | 2 sec | ✓ Works |
| 10 | 17 min | 34 min | ✓ Works |
| 20 | 12 days | 24 days | |
| 30 | 34 years | 68 years | |
| 31 | - | - | ❌ CRASH: integer overflow |
Industry Comparison
| System | Max Delay Cap | Default Max Attempts | Pattern |
|---|---|---|---|
| Temporal | 100 seconds | ∞ | Exponential capped |
| Trigger.dev | 30 seconds | 3 | Exponential capped |
| Graphile Worker | ~6 hours | 24 | exp(least(10, attempt)) |
| Inngest | Not specified | 4 | Exponential with jitter |
| pgflow SQL | NONE ❌ | 3 (unlimited) | 2^attempt uncapped |
| pgflow edge-worker | 300s default 68 years max ❌ |
50 max | Exponential with cap |
Root Cause
SQL Core (Primary Issue)
Function: pkgs/core/schemas/0030_utilities.sql (lines 21-32)
create or replace function pgflow.calculate_retry_delay(
base_delay numeric,
attempts_count int
)
returns int
as $$
select floor(base_delay * power(2, attempts_count))::int
$$;Problems:
- No exponent cap:
power(2, 31)exceeds PostgreSQL int32 limit (2,147,483,647) - No result cap: delays can grow to years/decades
- Schema allows unlimited
opt_max_attempts(no upper bound)
Used by: fail_task() function when task fails and needs retry
Edge Worker (Secondary Issue)
Function: pkgs/edge-worker/src/queue/createQueueWorker.ts (lines 91-101)
export function calculateRetryDelay(attempt: number, config: RetryConfig): number {
switch (config.strategy) {
case 'fixed':
return config.baseDelay;
case 'exponential': {
const delay = config.baseDelay * Math.pow(2, attempt - 1);
return Math.min(delay, config.maxDelay ?? 300);
}
}
}Current behavior:
- ✅ Default
maxDelay: 300seconds (5 minutes) is reasonable - ✅ Hard limit of 50 attempts prevents JavaScript overflow
- ✅ JavaScript handles
Infinitygracefully (caps viaMath.min) - ❌ Validation allows
maxDelayup to 2,147,483,647 seconds (68 years!)
Validation: pkgs/edge-worker/src/queue/validateRetryConfig.ts (lines 48-66)
// Prevents overflow
if (config.limit > 50) {
throw new Error('For exponential strategy, limit must not exceed 50');
}
// But allows absurd maxDelay!
const MAX_POSTGRES_INTERVAL_SECONDS = 2147483647; // 68 years
if (config.maxDelay > MAX_POSTGRES_INTERVAL_SECONDS) {
throw new Error(`maxDelay must not exceed ${MAX_POSTGRES_INTERVAL_SECONDS} seconds`);
}Edge-worker delay comparison:
| Attempt | Default (maxDelay=300) | Unrestricted (maxDelay=2147483647) |
|---|---|---|
| 1 | 3 sec | 3 sec |
| 5 | 48 sec | 48 sec |
| 10 | 300 sec (capped) | 25.6 min |
| 20 | 300 sec (capped) | 12.1 days |
| 30 | 300 sec (capped) | 33.4 years |
| 50 | 300 sec (capped) | 37.8 million years |
Proposed Solution
1. Fix SQL calculate_retry_delay (High Priority)
Replace line 31 in pkgs/core/schemas/0030_utilities.sql:
select least(86400, greatest(0, floor(base_delay * power(2, least(attempts_count, 30)))::bigint))::intWhat this does:
- Caps exponent at 30 → prevents overflow
- Caps result at 86400 seconds (24 hours) → prevents unbounded delays
- Maintains exponential backoff for attempts 1-30
Fixes 7/9 problems with one line!
2. Update Edge-Worker Validation (Recommended)
Option A: Hard cap at 24 hours (strict)
Update pkgs/edge-worker/src/queue/validateRetryConfig.ts line 64:
const MAX_RETRY_DELAY_SECONDS = 86400; // 24 hours (align with SQL)
if (config.maxDelay > MAX_RETRY_DELAY_SECONDS) {
throw new Error(`maxDelay must not exceed ${MAX_RETRY_DELAY_SECONDS} seconds (24 hours)`);
}Option B: Warning + PostgreSQL limit (permissive)
if (config.maxDelay > 86400) {
console.warn(`maxDelay of ${config.maxDelay}s exceeds recommended maximum of 86400s (24 hours)`);
}
if (config.maxDelay > MAX_POSTGRES_INTERVAL_SECONDS) {
throw new Error(`maxDelay must not exceed ${MAX_POSTGRES_INTERVAL_SECONDS} seconds`);
}3. Add Database Constraint (Optional)
In pkgs/core/schemas/0050_tables_definitions.sql:
constraint opt_max_attempts_is_reasonable check (opt_max_attempts >= 0 and opt_max_attempts <= 100)New Tests Required
SQL Tests
Add pkgs/core/supabase/tests/functions/calculate_retry_delay.test.sql:
- Normal exponential growth works (attempts 1, 3, 5 with various base_delay)
- Attempt 31 does not crash (overflow prevention)
- Attempt 50 does not crash
- Attempt 100 does not crash
- Attempt 30 caps at 86400 seconds
- Attempt 31 caps at 86400 seconds
- Large base_delay + high attempt caps at 86400
- base_delay=0 returns 0 delay
- Attempt 0 returns base_delay (edge case)
Edge-Worker Tests (if validation updated)
Update existing tests in pkgs/edge-worker/tests/:
- maxDelay=86400 is accepted (boundary)
- maxDelay=86401 is rejected (over limit)
- maxDelay=300 works as default
- Warning logged when maxDelay > 86400 (if using Option B)
Impact
Before fix:
- ❌ SQL crashes at attempt 31 with
ERROR: integer out of range - ❌ Workflows frozen for weeks/years
- ❌ Failed tasks consume resources indefinitely
- ❌ No alerts until catastrophic failure
- ❌ Edge-worker accepts 68-year delays
After fix:
- ✅ No overflow at any attempt count
- ✅ Maximum 24-hour retry delay (industry-aligned)
- ✅ Tasks fail explicitly after reasonable time
- ✅ Faster failure detection and recovery
- ✅ Consistent limits between SQL and edge-worker