Skip to content

Commit fabd27f

Browse files
committed
docs: expand journal entry with insights on AWS outage and resilience strategies
- Added detailed reflections on a recent AWS outage, discussing the implications for business resilience and operational strategies. - Included personal anecdotes and theories related to system reliability and the importance of engineering for outages. - Enhanced the entry with links to relevant resources and concepts, such as "importance radiation" and the idea of "janky fallbacks" in IT solutions.
1 parent 3dff13e commit fabd27f

File tree

1 file changed

+19
-1
lines changed

1 file changed

+19
-1
lines changed

journals/2025_10_20.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,20 @@
11
- #Filed
2-
- [[ast-grep]]
2+
- [[ast-grep]]
3+
- [[AstroBuild]] a [[Jamstack]] [[Static Site Generator]]
4+
- "component islands" is a great idea [Islands architecture | Docs](https://docs.astro.build/en/concepts/islands/)
5+
- massive [[AWS]] outage
6+
collapsed:: true
7+
- [AWS Multiple Services Down in us-east-1 | Hacker News](https://news.ycombinator.com/item?id=45640838)
8+
- I have this theory of something I call “importance radiation.”
9+
- An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
10+
- Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
11+
- There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
12+
- I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
13+
- The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
14+
- Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
15+
- An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
16+
- I'm getting old, but this was the 1980's, not the 1800's.
17+
- In other words, to agree with your point about resilience:
18+
- A lot of the time even some really janky fallbacks will be enough.
19+
- In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides),
20+
-

0 commit comments

Comments
 (0)