Skip to content

Commit ca97c25

Browse files
Improve security detection rules documentation for DevOps engineers
This commit implements critical improvements to the 'Create a detection rule' documentation based on a comprehensive usability review from the DevOps engineer persona perspective. Key improvements: ## Resource Planning and Performance (Critical) - Added 'Resource planning and performance considerations' section with: - Detailed resource requirements by rule type (execution time, memory) - Capacity planning guidance for running multiple rules - Circuit breaker prevention and troubleshooting - Guidance on staggering rule activation to prevent thundering herd ## Rule Type Decision Support (Critical) - Added 'Understanding rule types' comparison table - Clear guidance on when to use each rule type - Query language quick reference (KQL vs Lucene vs EQL vs ES|QL) - Recommendation: 90% of use cases use Custom Query + KQL ## Infrastructure-Focused Examples (Critical) - Added practical examples for DevOps use cases: - Detect failed SSH login attempts - Detect unusual outbound network connections - Each example includes prerequisites, testing steps, expected behavior, and tuning tips ## Enhanced Scheduling Guidance (Critical) - Reframed 'Additional look-back time' as CRITICAL not optional - Explained three failure scenarios (execution delay, ingestion delay, Kibana restarts) - Added scheduling strategy for multiple rules with load distribution - Performance-based interval recommendations by rule type ## Improved Threshold Rule Documentation (Critical) - Added specific cardinality definitions (low/medium/high risk levels) - Provided diagnostic query to check cardinality before creating rule - Explained circuit breaker error messages with exact text users will see - Step-by-step resolution procedures ## Enhanced ML Rule Warnings (Critical) - Added comprehensive warning about ML job startup (30-60s delay) - Resource requirements (2GB RAM per job) - Baseline period explanation (7-14 days for learning) - Production deployment best practices ## Max Alerts Per Run Clarification (Critical) - Explained that rule STOPS processing when limit reached (not just warning) - Added detection methods for when limit is hit - Performance impact data (100ms per 100 alerts) - Decision framework for appropriate values ## Improved Rule Actions/Notifications (Major) - Clear licensing requirements (Gold+ for Stack, included in Serverless) - Common notification patterns (severity-based routing, on-call integration) - Action reliability and failure handling (3 retries, then dropped) - How to diagnose failed notifications ## Strengthened Response Actions Warning (Critical) - Elevated to comprehensive warning with real-world failure scenarios - Three-phase safe deployment process (notifications → manual → limited automation) - 'Never automate response for' list (prod databases, k8s masters, CI/CD) - Required safeguards and emergency rollback procedure ## Integrated Troubleshooting Section (Critical) - Added 'Common issues after creating rules' section at end of document - Six common problems with diagnosis and solutions: - Rule shows Warning status - Rule creates zero alerts - Too many alerts - Gaps in rule execution - Actions not sending - Performance degradation over time - Links to additional troubleshooting resources These changes address the top 7 critical issues and 5 major issues identified in the usability review, significantly improving the documentation for operations teams managing security detection rules at scale.
1 parent a15fff0 commit ca97c25

File tree

1 file changed

+489
-20
lines changed

1 file changed

+489
-20
lines changed

0 commit comments

Comments
 (0)