Skip to content

Commit 1ca4b80

Browse files
doc: add web-team IRP (#42)
* doc: add web-team IRP Refs: #40 * Update INCIDENT_RESPONSE_PLAN.md Co-authored-by: Aviv Keller <[email protected]> * Update INCIDENT_RESPONSE_PLAN.md Co-authored-by: Aviv Keller <[email protected]> * Update INCIDENT_RESPONSE_PLAN.md Co-authored-by: Aviv Keller <[email protected]> * Update INCIDENT_RESPONSE_PLAN.md Co-authored-by: Aviv Keller <[email protected]> * Update INCIDENT_RESPONSE_PLAN.md Co-authored-by: Aviv Keller <[email protected]> --------- Co-authored-by: Aviv Keller <[email protected]>
1 parent 17e88e1 commit 1ca4b80

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed

INCIDENT_RESPONSE_PLAN.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Incident Response Plan (IRP)
2+
3+
## Scope
4+
5+
This IRP covers incidents affecting Node.js web properties and supporting services operated by the **@nodejs/web** team.
6+
7+
For a list of covered services and repositories, refer to [PERMISSIONS.md](./PERMISSIONS.md).
8+
9+
## IC & Escalation
10+
11+
* **Incident Commander (IC):** Any `@nodejs/web` member who first takes charge.
12+
13+
**Escalation:**
14+
IC → `@nodejs/web-infra``@nodejs/web-admins``@nodejs/build` (Cloudflare account/zone-critical) and/or `@nodejs/security-wg` (security incidents) -> `@nodejs/tsc`.
15+
16+
## Severity Levels & SLAs
17+
18+
* **P0 – Critical user impact** (global outage/defacement/security breach):
19+
20+
* Acknowledge: TBD
21+
22+
* **P1 – Major degradation** (partial outage, broken downloads/docs on a locale/route):
23+
24+
* Acknowledge: TBD
25+
26+
* **P2 – Minor** (noncritical errors, single integration down):
27+
28+
* Acknowledge: TBD
29+
30+
When in doubt, start at higher severity and downgrade later.
31+
32+
## Canonical Response Workflow
33+
34+
1. **Declare** severity; assign IC and Comms Lead.
35+
36+
2. **Stabilize users first:**
37+
* Roll back to last good deploy
38+
* If needed, enable Cloudflare “Under Attack/WAF rules” and emergency caching on critical paths.
39+
40+
3. **Communicate:** post an initial status summary and known impact; repeat per SLA. (Use blog/announcements or org channel as appropriate; precedent: public [post-mortem for March 17 incident](https://nodejs.org/en/blog/announcements/node-js-march-17-incident).
41+
42+
4. **Contain & eradicate:** revoke keys/tokens, disable compromised deploy hooks, patch, and purge caches safely.
43+
44+
5. **Recover:** redeploy clean artifact, validate, then progressively relax mitigations.
45+
46+
6. **Review:** draft a blameless post-mortem, impact, root cause, and follow-up engineering actions \+ process fixes
47+
48+
## Common Incidents — What Happens & What They Cause
49+
50+
| Incident | Likely Cause | What users see | Immediate actions | Primary owner |
51+
| ----------------------------------- | ------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ---------------------------- |
52+
| **Token/secret leak** | Accidental commit or exposed CI logs. | Subsequent unauthorized changes/deploys. | Invalidate in provider; rotate in 1Password; hunt for usage in audit logs; force redeploy clean. | Service owner + Web-Admins. |
53+
| **Expired TLS/SSL certificate** | Missed renewal or misconfigured auto-renew. | Browser warnings (“Connection not secure”), failed API calls. | Renew/redeploy certificate; validate chain; confirm monitoring alerts. | Infra + Build. |
54+
| **Outage due to misconfigured DNS** | Incorrect DNS update or provider outage. | Users can’t reach service; domain not resolving. | Roll back DNS change; verify propagation; coordinate with DNS provider. | Infra + Build. |
55+
| **Compromised admin account** | Phishing or weak MFA. | Unauthorized changes in systems. | Disable account; rotate credentials; audit changes; notify security. | Security WG + Account owner. |
56+
57+
## Communications
58+
59+
**Internal (private):** `@nodejs/web` or `@nodejs/web-infra` channel/thread; if Cloudflare account action is required, loop in `@nodejs/build`.
60+
61+
**Public (as needed):** short status updates; if user impact was material, publish a brief blog post or addendum to an incident page (example precedent exists).
62+
63+
### Notes on authority & ownership
64+
65+
* Cloudflare account-level actions (e.g., role changes) are coordinated with **@nodejs/build**; Web-Infra holds write/admin depending on team (`web-infra` vs `web-admins`). Keep this in mind when planning mitigations that require account scope.

0 commit comments

Comments
 (0)