Add Investigation Guides for Rules

shashank-elastic · shashank-elastic · commit 8afc81ea46ac · 2025-11-25T00:35:44.000+05:30
diff --git a/rules/cross-platform/reconnaissance_web_server_unusual_spike_in_error_logs.toml b/rules/cross-platform/reconnaissance_web_server_unusual_spike_in_error_logs.toml
@@ -17,6 +17,37 @@ interval = "10m"
 language = "esql"
 license = "Elastic License v2"
 name = "Potential Spike in Web Server Error Logs"
+note = """ ## Triage and analysis
+
+> **Disclaimer**:
+> This investigation guide was created using generative AI technology and has been reviewed to improve its accuracy and relevance. While every effort has been made to ensure its quality, we recommend validating the content and adapting it to suit your specific environment and operational needs.
+
+### Investigating Potential Spike in Web Server Error Logs
+
+This detection flags spikes of web server error responses across HTTP/TLS and common server platforms, signaling active scanning or fuzzing that can expose misconfigurations or exploitable paths. A typical pattern is an automated scanner sweeping endpoints like /admin/, /debug/, /.env, /.git, and backup archives while mutating query parameters, producing repeated 404/403 and occasional 500 responses across multiple applications within minutes.
+
+### Possible investigation steps
+
+- Pivot on the noisy client IP(s) to build a minute-by-minute timeline across affected hosts showing request rate, status codes, methods, and top paths to distinguish automated scanning from a localized application failure.
+- Enrich the client with ASN, geolocation, hosting/Tor/proxy reputation, historical sightings, and maintenance windows to quickly decide if it matches a known external scanner or an internal scheduled test.
+- Aggregate the most requested URIs and verbs and look for telltale patterns such as /.env, /.git, backup archives, admin consoles, or unusual verbs like PROPFIND/TRACE, then correlate any 5xx bursts with application and server error logs and recent deploys or config changes.
+- Hunt for follow-on success from the same client by checking for subsequent 200/302s to sensitive paths, authentication events and session creation, or evidence of file writes and suspicious child processes on the web hosts.
+- If traffic traverses a CDN/WAF/load balancer, pivot to those logs to recover true client IPs, review rule matches and throttling, and determine whether similar patterns occurred across multiple edges or regions.
+
+### False positive analysis
+
+- Internal QA or integration tests that systematically crawl application routes after a deployment can generate bursts of 404/403 and occasional 500s from a single client IP, closely resembling active scanning.
+- A transient backend outage or misconfiguration (broken asset paths or auth flows) can cause legitimate traffic to return many errors aggregated under a shared egress IP (NAT), pushing per-IP counts above the threshold without adversary activity.
+
+### Response and remediation
+
+- Immediately block or throttle the noisy client IPs at the WAF/CDN and load balancer by enabling per-IP rate limits and signatures for scanner patterns such as repeated hits to /.env, /.git, /admin, backup archives, or unusual verbs like PROPFIND/TRACE.
+- If errors include concentrated 5xx responses from one web host, drain that node from service behind the load balancer, capture its web and application error logs, and roll back the most recent deploy or config change until error rates normalize.
+- Remove risky exposures uncovered by the scan by denying access to environment files and VCS directories (.env, .git), disabling directory listing, locking down admin consoles, and rejecting unsupported HTTP methods at the web server.
+- Escalate to Incident Response if the same client shifts from errors to successful access on sensitive endpoints (200/302 to /admin, /login, or API keys), if you observe file writes under the webroot or suspicious child processes, or if multiple unrelated clients show the same pattern across regions.
+- Recover service by redeploying known-good builds, re-enabling health checks, running smoke tests against top routes, and restoring normal WAF/CDN policies while keeping a temporary blocklist for the offending IPs.
+- Harden long term by tuning WAF/CDN to auto-throttle bursty 404/403/500 patterns, disabling TRACE/OPTIONS where unused, minimizing verbose error pages, and ensuring logs capture the true client IP via X-Forwarded-For or True-Client-IP headers.
+"""
 risk_score = 21
 rule_id = "6631a759-4559-4c33-a392-13f146c8bcc4"
 severity = "low"
@@ -28,6 +59,7 @@ tags = [
     "Data Source: Apache",
     "Data Source: Apache Tomcat",
     "Data Source: IIS",
+    "Resources: Investigation Guide",
 ]
 timestamp_override = "event.ingested"
 type = "esql"
diff --git a/rules/cross-platform/reconnaissance_web_server_unusual_spike_in_error_response_codes.toml b/rules/cross-platform/reconnaissance_web_server_unusual_spike_in_error_response_codes.toml
@@ -2,7 +2,7 @@
 creation_date = "2025/11/19"
 integration = ["network_traffic", "nginx", "apache", "apache_tomcat", "iis"]
 maturity = "production"
-updated_date = "2025/11/19"
+updated_date = "2025/11/24"
 
 [rule]
 author = ["Elastic"]
@@ -17,6 +17,37 @@ interval = "10m"
 language = "esql"
 license = "Elastic License v2"
 name = "Web Server Potential Spike in Error Response Codes"
+note = """ ## Triage and analysis
+
+> **Disclaimer**:
+> This investigation guide was created using generative AI technology and has been reviewed to improve its accuracy and relevance. While every effort has been made to ensure its quality, we recommend validating the content and adapting it to suit your specific environment and operational needs.
+
+### Investigating Web Server Potential Spike in Error Response Codes
+
+This rule detects bursts of 5xx errors (500–504) from GET traffic, highlighting abnormal server behavior that accompanies active scanning or fuzzing and exposes fragile code paths or misconfigured proxies. Attackers sweep common and generated endpoints while mutating query params and headers—path traversal, template syntax, large payloads—to repeatedly force backend exceptions and gateway timeouts, enumerate which routes fail, and pinpoint inputs that leak stack traces or crash components for follow-on exploitation.
+
+### Possible investigation steps
+
+- Plot error rates per minute by server and client around the alert window to confirm the spike, determine scope, and separate a single noisy client from a platform-wide issue.
+- Aggregate the failing URL paths and query strings from the flagged client and look for enumeration sequences, traversal encoding, template injection markers, or oversized inputs indicative of fuzzing.
+- Examine User-Agent, Referer, header mix, and TLS JA3 for generic scanner signatures or reuse across multiple clients, and enrich the originating IP with reputation and hosting-provider attribution.
+- Correlate the timeframe with reverse proxy/WAF/IDS and application error logs or stack traces to identify which routes threw exceptions or timeouts and whether they align with the client’s input patterns.
+- Validate backend and dependency health (upstreams, databases, caches, deployments) to rule out infrastructure regressions, then compare whether only the suspicious client experiences disproportionate failures.
+
+### False positive analysis
+
+- A scheduled deployment or upstream dependency issue can cause normal GET traffic to fail with 502/503/504, and many users egressing through a shared NAT or reverse proxy may be aggregated as one source IP that triggers the spike.
+- An internal health-check, load test, or site crawler running from a single host can rapidly traverse endpoints and induce 500 errors on fragile routes, mimicking scanner-like behavior without malicious intent.
+
+### Response and remediation
+
+- Immediately rate-limit or block the originating client(s) at the edge (reverse proxy/WAF) using the observed source IPs, User-Agent/TLS fingerprints, and the failing URL patterns generating 5xx bursts.
+- Drain the origin upstream(s) showing repeated 500/502/503/504 on the probed routes, roll back the latest deployment or config change for those services, and disable any unstable endpoint or plugin that is crashing under input fuzzing.
+- Restart affected application workers and proxies, purge bad cache entries, re-enable traffic gradually with canary percentage, and confirm normal response rates via synthetic checks against the previously failing URLs.
+- Escalate to Security Operations and Incident Response if 5xx spikes persist after blocking or if error pages expose stack traces, credentials, or admin route disclosures, or if traffic originates from multiple global hosting ASNs.
+- Deploy targeted WAF rules for path traversal and injection markers seen in the URLs, enforce per-IP and per-route rate limits, tighten upstream timeouts/circuit breakers, and replace verbose error pages with generic responses that omit stack details.
+- Add bot management and IP reputation blocking at the CDN/edge, lock down unauthenticated access to admin/debug routes, and instrument alerts that trigger on sustained 5xx bursts per client and per route with automatic edge throttling.
+"""
 risk_score = 21
 rule_id = "6fa3abe3-9cd8-41de-951b-51ed8f710523"
 severity = "low"
@@ -30,6 +61,7 @@ tags = [
     "Data Source: Apache",
     "Data Source: Apache Tomcat",
     "Data Source: IIS",
+    "Resources: Investigation Guide",
 ]
 timestamp_override = "event.ingested"
 type = "esql"