|
| 1 | +# Incident Response Plan (IRP) |
| 2 | + |
| 3 | +## Scope |
| 4 | + |
| 5 | +This IRP covers incidents affecting Node.js web properties and supporting services operated by the **@nodejs/web** team. |
| 6 | + |
| 7 | +For a list of covered services and repositories, refer to [PERMISSIONS.md](./PERMISSIONS.md). |
| 8 | + |
| 9 | +## IC & Escalation |
| 10 | + |
| 11 | +* **Incident Commander (IC):** Any `@nodejs/web` member who first takes charge. |
| 12 | + |
| 13 | +**Escalation:** |
| 14 | + IC → `@nodejs/web-infra` → `@nodejs/web-admins` → `@nodejs/build` (Cloudflare account/zone-critical) and/or `@nodejs/security-wg` (security incidents) -> `@nodejs/tsc`. |
| 15 | + |
| 16 | +## Severity Levels & SLAs |
| 17 | + |
| 18 | +* **P0 – Critical user impact** (global outage/defacement/security breach): |
| 19 | + |
| 20 | + * Acknowledge: TBD |
| 21 | + |
| 22 | +* **P1 – Major degradation** (partial outage, broken downloads/docs on a locale/route): |
| 23 | + |
| 24 | + * Acknowledge: TBD |
| 25 | + |
| 26 | +* **P2 – Minor** (noncritical errors, single integration down): |
| 27 | + |
| 28 | + * Acknowledge: TBD |
| 29 | + |
| 30 | +When in doubt, start at higher severity and downgrade later. |
| 31 | + |
| 32 | +## Canonical Response Workflow |
| 33 | + |
| 34 | +1. **Declare** severity; assign IC and Comms Lead. |
| 35 | + |
| 36 | +2. **Stabilize users first:** |
| 37 | + * Roll back to last good deploy |
| 38 | + * If needed, enable Cloudflare “Under Attack/WAF rules” and emergency caching on critical paths. |
| 39 | + |
| 40 | +3. **Communicate:** post an initial status summary and known impact; repeat per SLA. (Use blog/announcements or org channel as appropriate; precedent: public [post-mortem for March 17 incident](https://nodejs.org/en/blog/announcements/node-js-march-17-incident). |
| 41 | + |
| 42 | +4. **Contain & eradicate:** revoke keys/tokens, disable compromised deploy hooks, patch, and purge caches safely. |
| 43 | + |
| 44 | +5. **Recover:** redeploy clean artifact, validate, then progressively relax mitigations. |
| 45 | + |
| 46 | +6. **Review:** draft a blameless post-mortem, impact, root cause, and follow-up engineering actions \+ process fixes |
| 47 | + |
| 48 | +## Common Incidents — What Happens & What They Cause |
| 49 | + |
| 50 | +| Incident | Likely Cause | What users see | Immediate actions | Primary owner | |
| 51 | +| ----------------------------------- | ------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ---------------------------- | |
| 52 | +| **Token/secret leak** | Accidental commit or exposed CI logs. | Subsequent unauthorized changes/deploys. | Invalidate in provider; rotate in 1Password; hunt for usage in audit logs; force redeploy clean. | Service owner + Web-Admins. | |
| 53 | +| **Expired TLS/SSL certificate** | Missed renewal or misconfigured auto-renew. | Browser warnings (“Connection not secure”), failed API calls. | Renew/redeploy certificate; validate chain; confirm monitoring alerts. | Infra + Build. | |
| 54 | +| **Outage due to misconfigured DNS** | Incorrect DNS update or provider outage. | Users can’t reach service; domain not resolving. | Roll back DNS change; verify propagation; coordinate with DNS provider. | Infra + Build. | |
| 55 | +| **Compromised admin account** | Phishing or weak MFA. | Unauthorized changes in systems. | Disable account; rotate credentials; audit changes; notify security. | Security WG + Account owner. | |
| 56 | + |
| 57 | +## Communications |
| 58 | + |
| 59 | +**Internal (private):** `@nodejs/web` or `@nodejs/web-infra` channel/thread; if Cloudflare account action is required, loop in `@nodejs/build`. |
| 60 | + |
| 61 | +**Public (as needed):** short status updates; if user impact was material, publish a brief blog post or addendum to an incident page (example precedent exists). |
| 62 | + |
| 63 | +### Notes on authority & ownership |
| 64 | + |
| 65 | +* Cloudflare account-level actions (e.g., role changes) are coordinated with **@nodejs/build**; Web-Infra holds write/admin depending on team (`web-infra` vs `web-admins`). Keep this in mind when planning mitigations that require account scope. |
0 commit comments