Skip to content

Commit 2f08d95

Browse files
committed
Improvements
1 parent c46be97 commit 2f08d95

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

content/blog/2026-04-17-incident-report-for-april-15-2026.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,12 +53,12 @@ Because this storage layer is shared across the platform, the hot shard caused c
5353
- Despite backlogs in function scheduling, the overall increased load caused a queue shard (where functions are enqueued after scheduling) memory to increase and executor workers for that shard became saturated.
5454
- Significant memory pressure on shared infrastructure, approaching capacity limits, as work could not be processed fast enough.
5555
- Slowed async operations (`step.waitForEvent`, `step.invoke`, `cancelOn`) processing.
56-
- This was further exacerbated by increased customer programmatic bulk cancellations initiated. Bulk cancellations can include future timestamps which require the event stream consumer to match event payload expressions in realtime during scheduling.
56+
- This was further exacerbated by a high volume of programmatic bulk cancellations from customer workspaces. Bulk cancellations can include future timestamps which require the event stream consumer to match event payload expressions in realtime during scheduling.
5757
- Delayed ingestion into our observability pipeline, which meant the dashboard event and run data was delayed even when scheduling was still making forward progress. This prevented customers from understanding if data was being processed.
5858

5959
Additionally, our system typically sees increased load during nightly periods as customers schedule large workloads to run in off-hours. This is consistent and usually subsides before a daily increase in morning hours for US timezones.
6060

61-
Three factors combined to produce an incident of this size: a latent SDK bug that amplified event volume for a high-throughput customer, insufficient tenant isolation in the state store and event stream layers, and the typical increased nightly load.
61+
Three factors combined to produce an incident of this size: an bug in our SDK that amplified event volume for a high-throughput customer, insufficient tenant isolation in the state store and event stream layers, and the typical increased nightly load.
6262

6363
## Impact
6464

@@ -84,15 +84,15 @@ We have already shipped a number of fixes during and immediately after the incid
8484

8585
### In progress
8686

87-
- **Review of all system metrics throughout the incident for earlier detection.** The 10-hour gap between scheduling delays beginning and our detection when delays increased significantly was a failure. We are performing an audit of all metrics and determine new metrics and indicators that should have alerted us when slowness started, before the state store shard became saturated.
87+
- **Review of all system metrics throughout the incident for earlier detection.** The 10-hour gap between scheduling delays beginning and our detection when delays increased significantly was a failure. We are performing an audit of all metrics and determining new metrics and indicators that should have alerted us when slowness started, before the state store shard became saturated.
8888
- **New training, workflows, automations for faster customer notification via status page.** Even with our current alerting, our customers should have been notified via our customer status page much faster. We also are improving the visibility of active incidents within our dashboard.
8989
- **Additional tenant isolation across the critical path.** We are actively working on further sharding event stream processing (function scheduling, async step operations, batching) by tenant groups and individual tenants and isolating the state store further to prevent noisy neighbor issues from degrading scheduling for others. This is now our top engineering priority.
9090
- **Removing batching from the shared state store** **entirely**, so that batching load is permanently decoupled from run scheduling.
9191
- **Moving cancellations off of unsharded storage**, so that bulk cancellations cannot create slowlogs that affect other workloads.
9292
- **Decoupling the event data ingestion pipeline from the scheduling critical path**, so observability data is not delayed by backups with function scheduling.
93-
- **Reviewing and honoring SLA commitments** for contracted users. Impacted customers with contracted SLAs are asked to reach out to [our support team](https://support.inngest.com).
93+
- **Reviewing and honoring SLA commitments** for contracted users. We will be proactively reaching out to contracted customers.
9494
- **New SLOs and improved alerting on state store CPU, I/O, and per-tenant latency**, so we detect and page on hot-shard conditions well before they cause customer impact.
95-
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our current database is currently managed by a cloud vendor and this limits our operational control and speed of recovery.
95+
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our database is currently managed by a cloud vendor and this limits our operational control and speed of recovery.
9696

9797
## In closing
9898

0 commit comments

Comments
 (0)