You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/2026-04-17-incident-report-for-april-15-2026.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,12 +53,12 @@ Because this storage layer is shared across the platform, the hot shard caused c
53
53
- Despite backlogs in function scheduling, the overall increased load caused a queue shard (where functions are enqueued after scheduling) memory to increase and executor workers for that shard became saturated.
54
54
- Significant memory pressure on shared infrastructure, approaching capacity limits, as work could not be processed fast enough.
- This was further exacerbated by increased customer programmatic bulk cancellations initiated. Bulk cancellations can include future timestamps which require the event stream consumer to match event payload expressions in realtime during scheduling.
56
+
- This was further exacerbated by a high volume of programmatic bulk cancellations from customer workspaces. Bulk cancellations can include future timestamps which require the event stream consumer to match event payload expressions in realtime during scheduling.
57
57
- Delayed ingestion into our observability pipeline, which meant the dashboard event and run data was delayed even when scheduling was still making forward progress. This prevented customers from understanding if data was being processed.
58
58
59
59
Additionally, our system typically sees increased load during nightly periods as customers schedule large workloads to run in off-hours. This is consistent and usually subsides before a daily increase in morning hours for US timezones.
60
60
61
-
Three factors combined to produce an incident of this size: a latent SDK bug that amplified event volume for a high-throughput customer, insufficient tenant isolation in the state store and event stream layers, and the typical increased nightly load.
61
+
Three factors combined to produce an incident of this size: an bug in our SDK that amplified event volume for a high-throughput customer, insufficient tenant isolation in the state store and event stream layers, and the typical increased nightly load.
62
62
63
63
## Impact
64
64
@@ -84,15 +84,15 @@ We have already shipped a number of fixes during and immediately after the incid
84
84
85
85
### In progress
86
86
87
-
-**Review of all system metrics throughout the incident for earlier detection.** The 10-hour gap between scheduling delays beginning and our detection when delays increased significantly was a failure. We are performing an audit of all metrics and determine new metrics and indicators that should have alerted us when slowness started, before the state store shard became saturated.
87
+
-**Review of all system metrics throughout the incident for earlier detection.** The 10-hour gap between scheduling delays beginning and our detection when delays increased significantly was a failure. We are performing an audit of all metrics and determining new metrics and indicators that should have alerted us when slowness started, before the state store shard became saturated.
88
88
-**New training, workflows, automations for faster customer notification via status page.** Even with our current alerting, our customers should have been notified via our customer status page much faster. We also are improving the visibility of active incidents within our dashboard.
89
89
-**Additional tenant isolation across the critical path.** We are actively working on further sharding event stream processing (function scheduling, async step operations, batching) by tenant groups and individual tenants and isolating the state store further to prevent noisy neighbor issues from degrading scheduling for others. This is now our top engineering priority.
90
90
-**Removing batching from the shared state store****entirely**, so that batching load is permanently decoupled from run scheduling.
91
91
-**Moving cancellations off of unsharded storage**, so that bulk cancellations cannot create slowlogs that affect other workloads.
92
92
-**Decoupling the event data ingestion pipeline from the scheduling critical path**, so observability data is not delayed by backups with function scheduling.
93
-
-**Reviewing and honoring SLA commitments** for contracted users. Impacted customers with contracted SLAs are asked to reach out to [our support team](https://support.inngest.com).
93
+
-**Reviewing and honoring SLA commitments** for contracted users. We will be proactively reaching out to contracted customers.
94
94
-**New SLOs and improved alerting on state store CPU, I/O, and per-tenant latency**, so we detect and page on hot-shard conditions well before they cause customer impact.
95
-
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our current database is currently managed by a cloud vendor and this limits our operational control and speed of recovery.
95
+
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our database is currently managed by a cloud vendor and this limits our operational control and speed of recovery.
0 commit comments