electron0zero · electron0zero · Aug 3, 2025
diff --git a/_config.yml b/_config.yml
@@ -34,7 +34,6 @@ show_excerpts: true
 # pages linked on top header
 # NOTE: items are rendered in the order here
 header_pages:
- - pages/stories.md
  - pages/learn.md
  - pages/meetup.md
  - pages/meetup-pune.md
@@ -47,14 +46,13 @@ kramdown:
   input: GFM
   hard_wrap: false
 
-# posts to /incidents/ url parmalink
-# FIXME: find a better
-permalink: /:year-:month-:day/:title/
+# posts to /stories/ url permalink
+permalink: /stories/:title/
 
 
 # pagination config
-paginate: 5
-paginate_path: "/page:num/"
+paginate: 10
+paginate_path: "/stories/page:num/"
 excerpt_separator: <!--more-->
 
 # Build settings

diff --git a/_includes/header.html b/_includes/header.html
@@ -17,9 +17,10 @@
         </label>
 
         <div class="trigger">
+          <a class="page-link" href="/stories/">Stories</a>
           {%- for path in page_paths -%}
             {%- assign my_page = site.pages | where: "path", path | first -%}
-            {%- if my_page.title -%}
+            {%- if my_page.title and my_page.title != 'Stories' -%}
             <a class="page-link" href="{{ my_page.url | relative_url }}">{{ my_page.title | escape }}</a>
             {%- endif -%}
           {%- endfor -%}

diff --git a/_posts/2017-01-31-gitlab-database-incident.md b/_posts/2017-01-31-gitlab-database-incident.md
@@ -0,0 +1,24 @@
+---
+layout: post
+title: "GitLab Database Incident"
+date: 2017-01-31
+categories: [outage, database, human-error]
+tags: [gitlab, database, postgresql, backup, human-error]
+company: GitLab
+incident_date: 2017-01-31
+duration: "18+ hours"
+affected_services: ["GitLab.com", "GitLab CI", "GitLab Pages"]
+---
+
+On January 31, 2017, a GitLab engineer accidentally deleted the production database directory while troubleshooting performance issues, running `rm -rf` on the wrong server. The incident revealed that all five backup systems had been failing silently, resulting in 18 hours of downtime and 6 hours of data loss.
+
+GitLab's transparent communication during the crisis - including live blogs, real-time updates, and a publicly shared recovery document - turned a disaster into a case study in crisis management and operational transparency.
+
+<!--more-->
+
+
+## Sources
+
+- [GitLab.com database incident - January 31st 2017](https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/)
+- [Postmortem of database outage of January 31](https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/)
+- [Google Doc - Live incident response](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub) (Historical)
diff --git a/_posts/2017-04-15-engineering-war-stories.md b/_posts/2017-04-15-engineering-war-stories.md
@@ -0,0 +1,16 @@
+---
+layout: post
+title: "Engineering War Stories Collection"
+date: 2017-04-15
+company: "HackerNews Community"
+duration: "N/A"
+tags: [hackernews, community, engineering, war-stories, collection]
+---
+
+A HackerNews thread collecting engineering war stories from the community. Software engineers share their most memorable production incidents, debugging adventures, and lessons learned from system failures.
+
+The thread contains dozens of real-world experiences from engineers across different companies and industries, providing insights into common failure patterns and recovery strategies.
+
+## Sources
+
+- [Ask HN: Tell me an engineering war story from your career](https://news.ycombinator.com/item?id=13987098)
diff --git a/_posts/2018-10-21-github-outage.md b/_posts/2018-10-21-github-outage.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "GitHub Outage"
+date: 2018-10-21
+categories: [outage, database, infrastructure]
+tags: [github, database, mysql, infrastructure, git]
+company: GitHub
+incident_date: 2018-10-21
+duration: "24+ hours"
+affected_services: ["GitHub.com", "Git Operations", "Issues", "Pull Requests"]
+---
+
+On October 21, 2018, GitHub experienced a major outage lasting over 24 hours due to network connectivity issues between their East and West Coast data centers. The incident resulted in database inconsistencies and required extensive data reconciliation to restore service.
+
+The outage affected millions of developers worldwide and highlighted the challenges of maintaining consistency in distributed systems during network partitions, leading to significant improvements in GitHub's infrastructure resilience.
+
+<!--more-->
+
+## Sources
+
+- [October 21 post-incident analysis - GitHub Blog](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
diff --git a/_posts/2019-06-03-google-cloud-disruption.md b/_posts/2019-06-03-google-cloud-disruption.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Google Cloud June 3 2019 Service Disruption"
+date: 2019-06-03
+categories: [outage, cloud, networking]
+tags: [google-cloud, networking, infrastructure]
+company: Google Cloud
+incident_date: 2019-06-03
+duration: "Several hours"
+affected_services: ["Google Cloud Platform", "G Suite", "YouTube"]
+---
+
+On June 3, 2019, Google Cloud experienced a widespread service disruption affecting Google Cloud Platform, G Suite, and YouTube. The incident was caused by network congestion in the eastern United States that impacted Google's global infrastructure.
+
+The outage demonstrated how network-level issues can cascade through interconnected cloud services, affecting not just enterprise customers but also consumer services used by millions worldwide.
+
+<!--more-->
+
+## Sources
+
+- [June 3 2019 disruption](https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption)
diff --git a/_posts/2019-07-02-cloudflare-regex-outage.md b/_posts/2019-07-02-cloudflare-regex-outage.md
@@ -0,0 +1,24 @@
+---
+layout: post
+title: "Details of the Cloudflare outage on July 2, 2019"
+date: 2019-07-02
+categories: [outage, performance, regex]
+tags: [cloudflare, regex, cpu, performance, waf]
+company: Cloudflare
+incident_date: 2019-07-02
+duration: "27 minutes"
+affected_services: ["Cloudflare CDN", "Web Application Firewall"]
+---
+
+On July 2, 2019, Cloudflare deployed a Web Application Firewall rule containing a regex with catastrophic backtracking behavior. The pattern consumed excessive CPU resources, causing global server unresponsiveness and widespread internet disruption for 27 minutes.
+
+The incident demonstrated how a single line of code can have massive global impact. Recovery was achieved by disabling the problematic WAF rule, leading to improvements in regex testing, staged rollouts, and automated circuit breakers for resource-intensive operations.
+
+<!--more-->
+
+
+## Sources
+
+- [Details of the Cloudflare outage on July 2, 2019](https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/)
+- [Regex Performance Analysis Tools](https://regex101.com/) - for testing regex patterns
+- [Regular Expression Denial of Service (ReDoS)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)
diff --git a/_posts/2019-11-18-launchpad-upstream-outage.md b/_posts/2019-11-18-launchpad-upstream-outage.md
@@ -0,0 +1,16 @@
+---
+layout: post
+title: "Launchpad Upstream Outage Chain Reaction"
+date: 2019-11-18
+company: "Ubuntu/Canonical"
+duration: "Several hours"
+tags: [launchpad, ubuntu, upstream, chain-reaction, dependencies]
+---
+
+An upstream outage at Launchpad caused a cascading chain of issues across multiple Ubuntu and open-source project infrastructure systems. This incident demonstrates how upstream service failures can trigger widespread downstream effects.
+
+The outage highlighted the interconnected nature of modern software development infrastructure and the importance of understanding dependency chains when building resilient systems.
+
+## Sources
+
+- [Impact of an upstream outage - chain of issues due to launchpad outage](https://github.com/electron0zero/failure-modes/issues/58)
diff --git a/_posts/2020-01-23-grafana-labs-gcp-outage.md b/_posts/2020-01-23-grafana-labs-gcp-outage.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Grafana Labs GCP Outage"
+date: 2020-01-23
+categories: [outage, cloud, infrastructure]
+tags: [grafana, gcp, persistent-disk, cloud, monitoring]
+company: Grafana Labs
+incident_date: 2020-01-23
+duration: "23 hours"
+affected_services: ["Grafana Cloud", "Monitoring Services"]
+---
+
+On January 23, 2020, Grafana Labs experienced a 23-hour outage caused by a Google Cloud Platform persistent disk incident. What started as a GCP infrastructure issue snowballed into an extended outage due to cascading failures and recovery complications.
+
+The incident demonstrated how cloud provider issues can compound with application-level problems, teaching important lessons about cloud dependency management and disaster recovery planning.
+
+<!--more-->
+
+## Sources
+
+- [How a GCP Persistent Disk incident snowballed into a 23-hour outage - Grafana Labs](https://grafana.com/blog/2020/01/23/how-a-gcp-persistent-disk-incident-snowballed-into-a-23-hour-outage-and-taught-us-some-important-lessons/)
diff --git a/_posts/2020-02-12-destiny-2-outage.md b/_posts/2020-02-12-destiny-2-outage.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Destiny 2 Outage"
+date: 2020-02-12
+categories: [outage, gaming, infrastructure]
+tags: [bungie, destiny-2, gaming, infrastructure]
+company: Bungie
+incident_date: 2020-02-12
+duration: "Several hours"
+affected_services: ["Destiny 2", "Game Servers", "Matchmaking"]
+---
+
+On February 12, 2020, Bungie's Destiny 2 experienced a significant outage affecting game servers and matchmaking services. The incident prevented players from accessing the online multiplayer game and highlighted the infrastructure challenges of maintaining always-online gaming services.
+
+Gaming platforms face unique scalability and reliability challenges, as players expect near-zero downtime for their entertainment and social experiences.
+
+<!--more-->
+
+## Sources
+
+- [Destiny 2 Outage Feb 12, 2020](https://www.bungie.net/en/Explore/Detail/News/48723)
diff --git a/_posts/2020-02-28-netflix-container-incident.md b/_posts/2020-02-28-netflix-container-incident.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Netflix Container Infrastructure Incident"
+date: 2020-02-28
+categories: [outage, containers, infrastructure]
+tags: [netflix, containers, kubernetes, infrastructure]
+company: Netflix
+incident_date: 2020-02-28
+duration: "Several hours"
+affected_services: ["Netflix Streaming", "Container Infrastructure"]
+---
+
+Netflix experienced an incident where containers were taking out nodes in their infrastructure, affecting streaming services. The issue highlighted the challenges of container orchestration at Netflix's massive scale and the potential for container-level issues to impact entire nodes.
+
+The incident demonstrated the complexities of managing containerized workloads in production environments and the importance of proper resource isolation and monitoring.
+
+<!--more-->
+
+## Sources
+
+- [Containers taking out nodes](https://twitter.com/sargun/status/1228495222658613250?s=19)
diff --git a/_posts/2020-04-09-github-april-disruptions.md b/_posts/2020-04-09-github-april-disruptions.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "GitHub April 2020 Disruptions"
+date: 2020-04-09
+categories: [outage, infrastructure, git]
+tags: [github, infrastructure, git, distributed-systems]
+company: GitHub
+incident_date: 2020-04-09
+duration: "Several hours"
+affected_services: ["GitHub.com", "Git Operations", "Actions", "Pages"]
+---
+
+In April 2020, GitHub experienced multiple service disruptions affecting various platform components including Git operations, GitHub Actions, and Pages. The incidents occurred during increased usage due to remote work adoption during the COVID-19 pandemic.
+
+The disruptions highlighted the challenges of maintaining service reliability during unprecedented traffic growth and the critical role of version control platforms in remote development workflows.
+
+<!--more-->
+
+## Sources
+
+- [April 2020 disruptions](https://github.blog/2020-05-22-april-service-disruptions-analysis/)
diff --git a/_posts/2020-05-03-algolia-salt-incident.md b/_posts/2020-05-03-algolia-salt-incident.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Salt Incident: May 3rd, 2020 Retrospective and Update"
+date: 2020-05-03
+categories: [outage, infrastructure, security]
+tags: [algolia, salt, configuration-management, security]
+company: Algolia
+incident_date: 2020-05-03
+duration: "Several hours"
+affected_services: ["Algolia Search", "Configuration Management"]
+---
+
+On May 3, 2020, Algolia experienced a service disruption related to their Salt configuration management system. The incident affected search services and highlighted vulnerabilities in configuration management infrastructure.
+
+The outage demonstrated the critical role of configuration management systems in maintaining service reliability and the potential security implications when these systems are compromised or misconfigured.
+
+<!--more-->
+
+## Sources
+
+- [Salt Incident May 2020](https://blog.algolia.com/salt-incident-may-3rd-2020-retrospective-and-update/)
diff --git a/_posts/2020-05-12-slack-may-12-outage.md b/_posts/2020-05-12-slack-may-12-outage.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Slack May 12, 2020 Outage"
+date: 2020-05-12
+categories: [outage, infrastructure, distributed-systems]
+tags: [slack, infrastructure, messaging, distributed-systems]
+company: Slack
+incident_date: 2020-05-12
+duration: "8+ hours"
+affected_services: ["Slack Messaging", "File Sharing", "Search"]
+---
+
+On May 12, 2020, Slack experienced a major outage caused by HAProxy configuration bugs that prevented proper service discovery with Consul. The incident began with database issues at 8:30 AM Pacific, followed by the main outage at 4:45 PM Pacific, which was resolved by rolling restart of the HAProxy fleet.
+
+The outage occurred during peak remote work adoption due to COVID-19, highlighting the critical dependency of distributed teams on messaging platforms and the cascading effects of infrastructure configuration issues.
+
+<!--more-->
+
+## Sources
+
+- [A terrible, horrible, no good, very bad day at Slack - Slack Engineering](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
diff --git a/_posts/2020-05-30-algolia-ssl-incident.md b/_posts/2020-05-30-algolia-ssl-incident.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "May 30 SSL incident"
+date: 2020-05-30
+categories: [outage, ssl, security]
+tags: [algolia, ssl, certificate, security]
+company: Algolia
+incident_date: 2020-05-30
+duration: "Several hours"
+affected_services: ["Algolia Search", "SSL/TLS Services"]
+---
+
+On May 30, 2020, Algolia experienced an SSL certificate-related incident that affected their search services. The incident involved certificate management issues that prevented secure connections to Algolia's services.
+
+SSL certificate management remains a common source of outages across the industry, highlighting the importance of automated certificate renewal and monitoring systems.
+
+<!--more-->
+
+## Sources
+
+- [SSL incident May 30](https://www.algolia.com/blog/engineering/may-30-ssl-incident/)
diff --git a/_posts/2020-07-15-twitter-security-incident.md b/_posts/2020-07-15-twitter-security-incident.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "Twitter Security Incident"
+date: 2020-07-15
+categories: [security, social-engineering, incident]
+tags: [twitter, security, social-engineering, bitcoin, scam]
+company: Twitter
+incident_date: 2020-07-15
+duration: "Several hours"
+affected_services: ["Twitter", "High-Profile Accounts"]
+---
+
+On July 15, 2020, Twitter experienced a major security incident where attackers used social engineering to gain access to internal tools and compromise high-profile accounts including Barack Obama, Elon Musk, and Bill Gates. The compromised accounts were used to promote a Bitcoin scam.
+
+The incident highlighted vulnerabilities in social engineering attacks against employees with privileged access and led to significant improvements in Twitter's internal security practices and access controls.
+
+<!--more-->
+
+## Sources
+
+- [An update on our security incident - Twitter Blog](https://blog.twitter.com/en_us/topics/company/2020/an-update-on-our-security-incident.html)
diff --git a/_posts/2020-07-19-debugging-distributed-system.md b/_posts/2020-07-19-debugging-distributed-system.md
@@ -0,0 +1,16 @@
+---
+layout: post
+title: "Debugging a Misbehaving Distributed System"
+date: 2020-07-19
+company: "Community Story"
+duration: "N/A"
+tags: [debugging, distributed-systems, community, networking]
+---
+
+A community-shared story about debugging complex distributed system issues, where seemingly unrelated symptoms led to discovering deep architectural problems.
+
+This Twitter thread by Erin provides insights into the detective work required when distributed systems start misbehaving in unexpected ways.
+
+## Sources
+
+- [Original Twitter Thread by Erin](https://twitter.com/erincandescent/status/1281280157073002496)