Commit ca97c25
committed
Improve security detection rules documentation for DevOps engineers
This commit implements critical improvements to the 'Create a detection rule'
documentation based on a comprehensive usability review from the DevOps engineer
persona perspective.
Key improvements:
## Resource Planning and Performance (Critical)
- Added 'Resource planning and performance considerations' section with:
- Detailed resource requirements by rule type (execution time, memory)
- Capacity planning guidance for running multiple rules
- Circuit breaker prevention and troubleshooting
- Guidance on staggering rule activation to prevent thundering herd
## Rule Type Decision Support (Critical)
- Added 'Understanding rule types' comparison table
- Clear guidance on when to use each rule type
- Query language quick reference (KQL vs Lucene vs EQL vs ES|QL)
- Recommendation: 90% of use cases use Custom Query + KQL
## Infrastructure-Focused Examples (Critical)
- Added practical examples for DevOps use cases:
- Detect failed SSH login attempts
- Detect unusual outbound network connections
- Each example includes prerequisites, testing steps, expected behavior, and tuning tips
## Enhanced Scheduling Guidance (Critical)
- Reframed 'Additional look-back time' as CRITICAL not optional
- Explained three failure scenarios (execution delay, ingestion delay, Kibana restarts)
- Added scheduling strategy for multiple rules with load distribution
- Performance-based interval recommendations by rule type
## Improved Threshold Rule Documentation (Critical)
- Added specific cardinality definitions (low/medium/high risk levels)
- Provided diagnostic query to check cardinality before creating rule
- Explained circuit breaker error messages with exact text users will see
- Step-by-step resolution procedures
## Enhanced ML Rule Warnings (Critical)
- Added comprehensive warning about ML job startup (30-60s delay)
- Resource requirements (2GB RAM per job)
- Baseline period explanation (7-14 days for learning)
- Production deployment best practices
## Max Alerts Per Run Clarification (Critical)
- Explained that rule STOPS processing when limit reached (not just warning)
- Added detection methods for when limit is hit
- Performance impact data (100ms per 100 alerts)
- Decision framework for appropriate values
## Improved Rule Actions/Notifications (Major)
- Clear licensing requirements (Gold+ for Stack, included in Serverless)
- Common notification patterns (severity-based routing, on-call integration)
- Action reliability and failure handling (3 retries, then dropped)
- How to diagnose failed notifications
## Strengthened Response Actions Warning (Critical)
- Elevated to comprehensive warning with real-world failure scenarios
- Three-phase safe deployment process (notifications → manual → limited automation)
- 'Never automate response for' list (prod databases, k8s masters, CI/CD)
- Required safeguards and emergency rollback procedure
## Integrated Troubleshooting Section (Critical)
- Added 'Common issues after creating rules' section at end of document
- Six common problems with diagnosis and solutions:
- Rule shows Warning status
- Rule creates zero alerts
- Too many alerts
- Gaps in rule execution
- Actions not sending
- Performance degradation over time
- Links to additional troubleshooting resources
These changes address the top 7 critical issues and 5 major issues identified
in the usability review, significantly improving the documentation for operations
teams managing security detection rules at scale.1 parent a15fff0 commit ca97c25
File tree
1 file changed
+489
-20
lines changed- solutions/security/detect-and-alert
1 file changed
+489
-20
lines changed
0 commit comments