Skip to content

Commit c46be97

Browse files
committed
Improve closing
1 parent 02eb217 commit c46be97

1 file changed

Lines changed: 4 additions & 2 deletions

File tree

content/blog/2026-04-17-incident-report-for-april-15-2026.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,11 +92,13 @@ We have already shipped a number of fixes during and immediately after the incid
9292
- **Decoupling the event data ingestion pipeline from the scheduling critical path**, so observability data is not delayed by backups with function scheduling.
9393
- **Reviewing and honoring SLA commitments** for contracted users. Impacted customers with contracted SLAs are asked to reach out to [our support team](https://support.inngest.com).
9494
- **New SLOs and improved alerting on state store CPU, I/O, and per-tenant latency**, so we detect and page on hot-shard conditions well before they cause customer impact.
95-
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our current database is currently managed by a cloud vendor and this limits our operational control.
95+
- Our team already has a major project under way to migrate our state store database to another technology and self-host this to have additional control over this part of the stack. Our current database is currently managed by a cloud vendor and this limits our operational control and speed of recovery.
9696

9797
## In closing
9898

99-
We know that many of our customers depend on Inngest as critical infrastructure for quick and timely execution, and a 19-hour scheduling delay is not acceptable. We are sorry for the impact this had on your products and your own customers.
99+
We know that many of our customers depend on Inngest as critical infrastructure for quick and timely execution, and a 19-hour scheduling delay is not acceptable. In this incident, we failed in time-to-detect, time-to-acknowledge, and time-to-recover. There is no excuse for this and we will improve all of these.
100+
101+
We are deeply sorry for the impact this had on your products and your own customers.
100102

101103
The underlying cause, insufficient tenant isolation with our shared state store, is something we have been working to eliminate, and this incident makes clear that we need to move faster. Tenant isolation, better and earlier alerting on our critical path, and a decoupled data ingestion pipeline are now the top priorities for our systems and platform teams, with work already in progress.
102104

0 commit comments

Comments
 (0)