Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ show_excerpts: true
# pages linked on top header
# NOTE: items are rendered in the order here
header_pages:
- pages/stories.md
- pages/learn.md
- pages/meetup.md
- pages/meetup-pune.md
Expand All @@ -47,14 +46,13 @@ kramdown:
input: GFM
hard_wrap: false

# posts to /incidents/ url parmalink
# FIXME: find a better
permalink: /:year-:month-:day/:title/
# posts to /stories/ url permalink
permalink: /stories/:title/


# pagination config
paginate: 5
paginate_path: "/page:num/"
paginate: 10
paginate_path: "/stories/page:num/"
excerpt_separator: <!--more-->

# Build settings
Expand Down
3 changes: 2 additions & 1 deletion _includes/header.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@
</label>

<div class="trigger">
<a class="page-link" href="/stories/">Stories</a>
{%- for path in page_paths -%}
{%- assign my_page = site.pages | where: "path", path | first -%}
{%- if my_page.title -%}
{%- if my_page.title and my_page.title != 'Stories' -%}
<a class="page-link" href="{{ my_page.url | relative_url }}">{{ my_page.title | escape }}</a>
{%- endif -%}
{%- endfor -%}
Expand Down
24 changes: 24 additions & 0 deletions _posts/2017-01-31-gitlab-database-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
layout: post
title: "GitLab Database Incident"
date: 2017-01-31
categories: [outage, database, human-error]
tags: [gitlab, database, postgresql, backup, human-error]
company: GitLab
incident_date: 2017-01-31
duration: "18+ hours"
affected_services: ["GitLab.com", "GitLab CI", "GitLab Pages"]
---

On January 31, 2017, a GitLab engineer accidentally deleted the production database directory while troubleshooting performance issues, running `rm -rf` on the wrong server. The incident revealed that all five backup systems had been failing silently, resulting in 18 hours of downtime and 6 hours of data loss.

GitLab's transparent communication during the crisis - including live blogs, real-time updates, and a publicly shared recovery document - turned a disaster into a case study in crisis management and operational transparency.

<!--more-->


## Sources

- [GitLab.com database incident - January 31st 2017](https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/)
- [Postmortem of database outage of January 31](https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/)
- [Google Doc - Live incident response](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub) (Historical)
16 changes: 16 additions & 0 deletions _posts/2017-04-15-engineering-war-stories.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
layout: post
title: "Engineering War Stories Collection"
date: 2017-04-15
company: "HackerNews Community"
duration: "N/A"
tags: [hackernews, community, engineering, war-stories, collection]
---

A HackerNews thread collecting engineering war stories from the community. Software engineers share their most memorable production incidents, debugging adventures, and lessons learned from system failures.

The thread contains dozens of real-world experiences from engineers across different companies and industries, providing insights into common failure patterns and recovery strategies.

## Sources

- [Ask HN: Tell me an engineering war story from your career](https://news.ycombinator.com/item?id=13987098)
21 changes: 21 additions & 0 deletions _posts/2018-10-21-github-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "GitHub Outage"
date: 2018-10-21
categories: [outage, database, infrastructure]
tags: [github, database, mysql, infrastructure, git]
company: GitHub
incident_date: 2018-10-21
duration: "24+ hours"
affected_services: ["GitHub.com", "Git Operations", "Issues", "Pull Requests"]
---

On October 21, 2018, GitHub experienced a major outage lasting over 24 hours due to network connectivity issues between their East and West Coast data centers. The incident resulted in database inconsistencies and required extensive data reconciliation to restore service.

The outage affected millions of developers worldwide and highlighted the challenges of maintaining consistency in distributed systems during network partitions, leading to significant improvements in GitHub's infrastructure resilience.

<!--more-->

## Sources

- [October 21 post-incident analysis - GitHub Blog](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
21 changes: 21 additions & 0 deletions _posts/2019-06-03-google-cloud-disruption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Google Cloud June 3 2019 Service Disruption"
date: 2019-06-03
categories: [outage, cloud, networking]
tags: [google-cloud, networking, infrastructure]
company: Google Cloud
incident_date: 2019-06-03
duration: "Several hours"
affected_services: ["Google Cloud Platform", "G Suite", "YouTube"]
---

On June 3, 2019, Google Cloud experienced a widespread service disruption affecting Google Cloud Platform, G Suite, and YouTube. The incident was caused by network congestion in the eastern United States that impacted Google's global infrastructure.

The outage demonstrated how network-level issues can cascade through interconnected cloud services, affecting not just enterprise customers but also consumer services used by millions worldwide.

<!--more-->

## Sources

- [June 3 2019 disruption](https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption)
24 changes: 24 additions & 0 deletions _posts/2019-07-02-cloudflare-regex-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
layout: post
title: "Details of the Cloudflare outage on July 2, 2019"
date: 2019-07-02
categories: [outage, performance, regex]
tags: [cloudflare, regex, cpu, performance, waf]
company: Cloudflare
incident_date: 2019-07-02
duration: "27 minutes"
affected_services: ["Cloudflare CDN", "Web Application Firewall"]
---

On July 2, 2019, Cloudflare deployed a Web Application Firewall rule containing a regex with catastrophic backtracking behavior. The pattern consumed excessive CPU resources, causing global server unresponsiveness and widespread internet disruption for 27 minutes.

The incident demonstrated how a single line of code can have massive global impact. Recovery was achieved by disabling the problematic WAF rule, leading to improvements in regex testing, staged rollouts, and automated circuit breakers for resource-intensive operations.

<!--more-->


## Sources

- [Details of the Cloudflare outage on July 2, 2019](https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/)
- [Regex Performance Analysis Tools](https://regex101.com/) - for testing regex patterns
- [Regular Expression Denial of Service (ReDoS)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)
16 changes: 16 additions & 0 deletions _posts/2019-11-18-launchpad-upstream-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
layout: post
title: "Launchpad Upstream Outage Chain Reaction"
date: 2019-11-18
company: "Ubuntu/Canonical"
duration: "Several hours"
tags: [launchpad, ubuntu, upstream, chain-reaction, dependencies]
---

An upstream outage at Launchpad caused a cascading chain of issues across multiple Ubuntu and open-source project infrastructure systems. This incident demonstrates how upstream service failures can trigger widespread downstream effects.

The outage highlighted the interconnected nature of modern software development infrastructure and the importance of understanding dependency chains when building resilient systems.

## Sources

- [Impact of an upstream outage - chain of issues due to launchpad outage](https://github.com/electron0zero/failure-modes/issues/58)
21 changes: 21 additions & 0 deletions _posts/2020-01-23-grafana-labs-gcp-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Grafana Labs GCP Outage"
date: 2020-01-23
categories: [outage, cloud, infrastructure]
tags: [grafana, gcp, persistent-disk, cloud, monitoring]
company: Grafana Labs
incident_date: 2020-01-23
duration: "23 hours"
affected_services: ["Grafana Cloud", "Monitoring Services"]
---

On January 23, 2020, Grafana Labs experienced a 23-hour outage caused by a Google Cloud Platform persistent disk incident. What started as a GCP infrastructure issue snowballed into an extended outage due to cascading failures and recovery complications.

The incident demonstrated how cloud provider issues can compound with application-level problems, teaching important lessons about cloud dependency management and disaster recovery planning.

<!--more-->

## Sources

- [How a GCP Persistent Disk incident snowballed into a 23-hour outage - Grafana Labs](https://grafana.com/blog/2020/01/23/how-a-gcp-persistent-disk-incident-snowballed-into-a-23-hour-outage-and-taught-us-some-important-lessons/)
21 changes: 21 additions & 0 deletions _posts/2020-02-12-destiny-2-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Destiny 2 Outage"
date: 2020-02-12
categories: [outage, gaming, infrastructure]
tags: [bungie, destiny-2, gaming, infrastructure]
company: Bungie
incident_date: 2020-02-12
duration: "Several hours"
affected_services: ["Destiny 2", "Game Servers", "Matchmaking"]
---

On February 12, 2020, Bungie's Destiny 2 experienced a significant outage affecting game servers and matchmaking services. The incident prevented players from accessing the online multiplayer game and highlighted the infrastructure challenges of maintaining always-online gaming services.

Gaming platforms face unique scalability and reliability challenges, as players expect near-zero downtime for their entertainment and social experiences.

<!--more-->

## Sources

- [Destiny 2 Outage Feb 12, 2020](https://www.bungie.net/en/Explore/Detail/News/48723)
21 changes: 21 additions & 0 deletions _posts/2020-02-28-netflix-container-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Netflix Container Infrastructure Incident"
date: 2020-02-28
categories: [outage, containers, infrastructure]
tags: [netflix, containers, kubernetes, infrastructure]
company: Netflix
incident_date: 2020-02-28
duration: "Several hours"
affected_services: ["Netflix Streaming", "Container Infrastructure"]
---

Netflix experienced an incident where containers were taking out nodes in their infrastructure, affecting streaming services. The issue highlighted the challenges of container orchestration at Netflix's massive scale and the potential for container-level issues to impact entire nodes.

The incident demonstrated the complexities of managing containerized workloads in production environments and the importance of proper resource isolation and monitoring.

<!--more-->

## Sources

- [Containers taking out nodes](https://twitter.com/sargun/status/1228495222658613250?s=19)
21 changes: 21 additions & 0 deletions _posts/2020-04-09-github-april-disruptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "GitHub April 2020 Disruptions"
date: 2020-04-09
categories: [outage, infrastructure, git]
tags: [github, infrastructure, git, distributed-systems]
company: GitHub
incident_date: 2020-04-09
duration: "Several hours"
affected_services: ["GitHub.com", "Git Operations", "Actions", "Pages"]
---

In April 2020, GitHub experienced multiple service disruptions affecting various platform components including Git operations, GitHub Actions, and Pages. The incidents occurred during increased usage due to remote work adoption during the COVID-19 pandemic.

The disruptions highlighted the challenges of maintaining service reliability during unprecedented traffic growth and the critical role of version control platforms in remote development workflows.

<!--more-->

## Sources

- [April 2020 disruptions](https://github.blog/2020-05-22-april-service-disruptions-analysis/)
21 changes: 21 additions & 0 deletions _posts/2020-05-03-algolia-salt-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Salt Incident: May 3rd, 2020 Retrospective and Update"
date: 2020-05-03
categories: [outage, infrastructure, security]
tags: [algolia, salt, configuration-management, security]
company: Algolia
incident_date: 2020-05-03
duration: "Several hours"
affected_services: ["Algolia Search", "Configuration Management"]
---

On May 3, 2020, Algolia experienced a service disruption related to their Salt configuration management system. The incident affected search services and highlighted vulnerabilities in configuration management infrastructure.

The outage demonstrated the critical role of configuration management systems in maintaining service reliability and the potential security implications when these systems are compromised or misconfigured.

<!--more-->

## Sources

- [Salt Incident May 2020](https://blog.algolia.com/salt-incident-may-3rd-2020-retrospective-and-update/)
21 changes: 21 additions & 0 deletions _posts/2020-05-12-slack-may-12-outage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Slack May 12, 2020 Outage"
date: 2020-05-12
categories: [outage, infrastructure, distributed-systems]
tags: [slack, infrastructure, messaging, distributed-systems]
company: Slack
incident_date: 2020-05-12
duration: "8+ hours"
affected_services: ["Slack Messaging", "File Sharing", "Search"]
---

On May 12, 2020, Slack experienced a major outage caused by HAProxy configuration bugs that prevented proper service discovery with Consul. The incident began with database issues at 8:30 AM Pacific, followed by the main outage at 4:45 PM Pacific, which was resolved by rolling restart of the HAProxy fleet.

The outage occurred during peak remote work adoption due to COVID-19, highlighting the critical dependency of distributed teams on messaging platforms and the cascading effects of infrastructure configuration issues.

<!--more-->

## Sources

- [A terrible, horrible, no good, very bad day at Slack - Slack Engineering](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
21 changes: 21 additions & 0 deletions _posts/2020-05-30-algolia-ssl-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "May 30 SSL incident"
date: 2020-05-30
categories: [outage, ssl, security]
tags: [algolia, ssl, certificate, security]
company: Algolia
incident_date: 2020-05-30
duration: "Several hours"
affected_services: ["Algolia Search", "SSL/TLS Services"]
---

On May 30, 2020, Algolia experienced an SSL certificate-related incident that affected their search services. The incident involved certificate management issues that prevented secure connections to Algolia's services.

SSL certificate management remains a common source of outages across the industry, highlighting the importance of automated certificate renewal and monitoring systems.

<!--more-->

## Sources

- [SSL incident May 30](https://www.algolia.com/blog/engineering/may-30-ssl-incident/)
21 changes: 21 additions & 0 deletions _posts/2020-07-15-twitter-security-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: post
title: "Twitter Security Incident"
date: 2020-07-15
categories: [security, social-engineering, incident]
tags: [twitter, security, social-engineering, bitcoin, scam]
company: Twitter
incident_date: 2020-07-15
duration: "Several hours"
affected_services: ["Twitter", "High-Profile Accounts"]
---

On July 15, 2020, Twitter experienced a major security incident where attackers used social engineering to gain access to internal tools and compromise high-profile accounts including Barack Obama, Elon Musk, and Bill Gates. The compromised accounts were used to promote a Bitcoin scam.

The incident highlighted vulnerabilities in social engineering attacks against employees with privileged access and led to significant improvements in Twitter's internal security practices and access controls.

<!--more-->

## Sources

- [An update on our security incident - Twitter Blog](https://blog.twitter.com/en_us/topics/company/2020/an-update-on-our-security-incident.html)
16 changes: 16 additions & 0 deletions _posts/2020-07-19-debugging-distributed-system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
layout: post
title: "Debugging a Misbehaving Distributed System"
date: 2020-07-19
company: "Community Story"
duration: "N/A"
tags: [debugging, distributed-systems, community, networking]
---

A community-shared story about debugging complex distributed system issues, where seemingly unrelated symptoms led to discovering deep architectural problems.

This Twitter thread by Erin provides insights into the detective work required when distributed systems start misbehaving in unexpected ways.

## Sources

- [Original Twitter Thread by Erin](https://twitter.com/erincandescent/status/1281280157073002496)
Loading