|
| 1 | +# KRKN Chaos Templates |
| 2 | + |
| 3 | +This guide covers the KRKN Chaos Template Library, which provides pre-configured chaos scenarios for quick execution and testing. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The KRKN Chaos Template Library offers ready-to-use chaos engineering scenarios that can be easily customized and executed. These templates follow a standardized structure and cover common failure patterns in Kubernetes environments. |
| 8 | + |
| 9 | +## Available Templates |
| 10 | + |
| 11 | +### Core Templates |
| 12 | + |
| 13 | +| Template | Description | Risk Level | Category | |
| 14 | +|----------|-------------|------------|----------| |
| 15 | +| **pod-failure** | Simulates pod crash to test application resiliency | Medium | Availability | |
| 16 | +| **node-failure** | Simulates node failure to test cluster resiliency | High | Availability | |
| 17 | +| **network-latency** | Introduces network latency to test performance | Low | Performance | |
| 18 | +| **cpu-stress** | Applies CPU stress to test performance under load | Medium | Performance | |
| 19 | +| **disk-stress** | Applies disk I/O stress to test storage performance | Medium | Performance | |
| 20 | +| **pod-kill** | Forcefully terminates pods to test recovery | Medium | Availability | |
| 21 | +| **container-restart** | Restarts containers to test container-level recovery | Low | Availability | |
| 22 | +| **vm-outage** | Simulates VM outage for OpenShift Virtualization | High | Availability | |
| 23 | +| **resource-failure** | Simulates Kubernetes resource failures | Medium | Availability | |
| 24 | + |
| 25 | +## Quick Start |
| 26 | + |
| 27 | +### Installation |
| 28 | + |
| 29 | +The template system is included with KRKN. No additional installation is required. |
| 30 | + |
| 31 | +### Listing Available Templates |
| 32 | + |
| 33 | +```bash |
| 34 | +# Using the template manager directly |
| 35 | +python krkn/template_manager.py list |
| 36 | + |
| 37 | +# Using the template wrapper |
| 38 | +python krkn-template list |
| 39 | +``` |
| 40 | + |
| 41 | +### Running a Template |
| 42 | + |
| 43 | +```bash |
| 44 | +# Run a template with default parameters |
| 45 | +python krkn/template_manager.py run pod-failure |
| 46 | + |
| 47 | +# Or using the template wrapper |
| 48 | +python krkn-template run pod-failure |
| 49 | +``` |
| 50 | + |
| 51 | +### Viewing Template Details |
| 52 | + |
| 53 | +```bash |
| 54 | +# Show detailed information about a template |
| 55 | +python krkn/template_manager.py show pod-failure |
| 56 | + |
| 57 | +# Include README content |
| 58 | +python krkn/template_manager.py show pod-failure --show-readme |
| 59 | +``` |
| 60 | + |
| 61 | +## Template Customization |
| 62 | + |
| 63 | +### Parameter Overrides |
| 64 | + |
| 65 | +You can customize templates by overriding parameters: |
| 66 | + |
| 67 | +```bash |
| 68 | +python krkn/template_manager.py run pod-failure \ |
| 69 | + --param name_pattern="^nginx-.*$" \ |
| 70 | + --param namespace_pattern="^production$" \ |
| 71 | + --param kill=2 |
| 72 | +``` |
| 73 | + |
| 74 | +### Common Parameters |
| 75 | + |
| 76 | +Most templates support these common parameters: |
| 77 | + |
| 78 | +- **name_pattern**: Regex pattern for resource names |
| 79 | +- **namespace_pattern**: Regex pattern for namespaces |
| 80 | +- **timeout**: Operation timeout in seconds |
| 81 | +- **recovery_time**: Recovery monitoring duration |
| 82 | + |
| 83 | +## Template Structure |
| 84 | + |
| 85 | +Each template follows this structure: |
| 86 | + |
| 87 | +``` |
| 88 | +templates/chaos-scenarios/ |
| 89 | +└── template-name/ |
| 90 | + ├── scenario.yaml # Main chaos configuration |
| 91 | + ├── metadata.yaml # Template metadata and parameters |
| 92 | + └── README.md # Detailed documentation |
| 93 | +``` |
| 94 | + |
| 95 | +### scenario.yaml |
| 96 | + |
| 97 | +Contains the actual chaos scenario configuration in KRKN format. |
| 98 | + |
| 99 | +### metadata.yaml |
| 100 | + |
| 101 | +Contains template metadata including: |
| 102 | + |
| 103 | +```yaml |
| 104 | +name: template-name |
| 105 | +description: Brief description of the template |
| 106 | +target: kubernetes-pod|kubernetes-node|kubernetes-network |
| 107 | +risk_level: low|medium|high |
| 108 | +category: availability|performance |
| 109 | +version: "1.0" |
| 110 | +author: KRKN Team |
| 111 | +tags: |
| 112 | + - tag1 |
| 113 | + - tag2 |
| 114 | +estimated_duration: "2-5 minutes" |
| 115 | +dependencies: [] |
| 116 | +parameters: |
| 117 | + - name: parameter_name |
| 118 | + type: string|integer|boolean |
| 119 | + description: Parameter description |
| 120 | + default: default_value |
| 121 | +``` |
| 122 | +
|
| 123 | +### README.md |
| 124 | +
|
| 125 | +Comprehensive documentation including: |
| 126 | +
|
| 127 | +- Use cases |
| 128 | +- Prerequisites |
| 129 | +- Usage examples |
| 130 | +- Expected behavior |
| 131 | +- Customization options |
| 132 | +- Troubleshooting guide |
| 133 | +
|
| 134 | +## Usage Examples |
| 135 | +
|
| 136 | +### Pod Failure Testing |
| 137 | +
|
| 138 | +```bash |
| 139 | +# Test pod failure with default settings |
| 140 | +python krkn-template run pod-failure |
| 141 | + |
| 142 | +# Target specific application |
| 143 | +python krkn-template run pod-failure \ |
| 144 | + --param name_pattern="^frontend-.*$" \ |
| 145 | + --param namespace_pattern="^production$" |
| 146 | + |
| 147 | +# Kill multiple pods |
| 148 | +python krkn-template run pod-failure \ |
| 149 | + --param kill=3 \ |
| 150 | + --param krkn_pod_recovery_time=180 |
| 151 | +``` |
| 152 | + |
| 153 | +### Network Latency Testing |
| 154 | + |
| 155 | +```bash |
| 156 | +# Add 100ms latency |
| 157 | +python krkn-template run network-latency |
| 158 | + |
| 159 | +# Custom latency settings |
| 160 | +python krkn-template run network-latency \ |
| 161 | + --param latency="200ms" \ |
| 162 | + --param jitter="20ms" \ |
| 163 | + --param duration=120 |
| 164 | +``` |
| 165 | + |
| 166 | +### CPU Stress Testing |
| 167 | + |
| 168 | +```bash |
| 169 | +# Apply 80% CPU load |
| 170 | +python krkn-template run cpu-stress |
| 171 | + |
| 172 | +# High intensity stress |
| 173 | +python krkn-template run cpu-stress \ |
| 174 | + --param cpu-load-percentage=95 \ |
| 175 | + --param duration=300 \ |
| 176 | + --param number-of-nodes=2 |
| 177 | +``` |
| 178 | + |
| 179 | +### Node Failure Testing |
| 180 | + |
| 181 | +```bash |
| 182 | +# Test single node failure |
| 183 | +python krkn-template run node-failure |
| 184 | + |
| 185 | +# Target specific nodes |
| 186 | +python krkn-template run node-failure \ |
| 187 | + --param label_selector="node-role.kubernetes.io/app=" \ |
| 188 | + --param instance_count=1 |
| 189 | +``` |
| 190 | + |
| 191 | +## Best Practices |
| 192 | + |
| 193 | +### Before Running Templates |
| 194 | + |
| 195 | +1. **Test in Non-Production**: Always test templates in development/staging environments first. |
| 196 | +2. **Check Prerequisites**: Ensure all prerequisites are met for the target template. |
| 197 | +3. **Monitor Resources**: Verify sufficient cluster resources are available. |
| 198 | +4. **Backup Data**: Ensure critical data is backed up before running high-risk templates. |
| 199 | + |
| 200 | +### During Execution |
| 201 | + |
| 202 | +1. **Monitor Health**: Watch cluster and application health metrics. |
| 203 | +2. **Check Logs**: Monitor KRKN and application logs for issues. |
| 204 | +3. **Abort if Necessary**: Stop execution if unexpected issues occur. |
| 205 | +4. **Document Results**: Record outcomes and observations. |
| 206 | + |
| 207 | +### After Execution |
| 208 | + |
| 209 | +1. **Verify Recovery**: Ensure all resources have recovered properly. |
| 210 | +2. **Review Logs**: Analyze logs for insights and improvements. |
| 211 | +3. **Update Configurations**: Adjust application configurations based on results. |
| 212 | +4. **Document Learnings**: Record findings for future reference. |
| 213 | + |
| 214 | +## Risk Management |
| 215 | + |
| 216 | +### Risk Levels |
| 217 | + |
| 218 | +- **Low**: Minimal impact, unlikely to cause service disruption |
| 219 | +- **Medium**: May cause temporary service disruption |
| 220 | +- **High**: Can cause significant service disruption |
| 221 | + |
| 222 | +### Safety Measures |
| 223 | + |
| 224 | +1. **Start Small**: Begin with low-risk templates and low intensity settings. |
| 225 | +2. **Gradual Increase**: Slowly increase intensity and complexity. |
| 226 | +3. **Time Restrictions**: Run chaos experiments during maintenance windows. |
| 227 | +4. **Rollback Plans**: Have clear rollback procedures ready. |
| 228 | + |
| 229 | +## Integration with CI/CD |
| 230 | + |
| 231 | +### GitHub Actions Example |
| 232 | + |
| 233 | +```yaml |
| 234 | +- name: Run Chaos Test |
| 235 | + run: | |
| 236 | + python krkn-template run pod-failure \ |
| 237 | + --param name_pattern="^app-.*$" \ |
| 238 | + --param namespace_pattern="^testing$" |
| 239 | +``` |
| 240 | +
|
| 241 | +### Jenkins Pipeline Example |
| 242 | +
|
| 243 | +```groovy |
| 244 | +stage('Chaos Test') { |
| 245 | + steps { |
| 246 | + sh 'python krkn-template run network-latency --param latency="50ms"' |
| 247 | + } |
| 248 | +} |
| 249 | +``` |
| 250 | + |
| 251 | +## Troubleshooting |
| 252 | + |
| 253 | +### Common Issues |
| 254 | + |
| 255 | +1. **Template Not Found**: Check template name spelling and templates directory path. |
| 256 | +2. **Permission Denied**: Verify RBAC permissions for KRKN service account. |
| 257 | +3. **Resource Not Found**: Ensure target resources exist and are accessible. |
| 258 | +4. **Timeout Errors**: Increase timeout values for slow clusters. |
| 259 | + |
| 260 | +### Debug Mode |
| 261 | + |
| 262 | +Enable debug logging for detailed troubleshooting: |
| 263 | + |
| 264 | +```bash |
| 265 | +python krkn-template run pod-failure --debug |
| 266 | +``` |
| 267 | + |
| 268 | +### Log Locations |
| 269 | + |
| 270 | +- KRKN logs: Console output and report files |
| 271 | +- Application logs: Kubernetes pod logs |
| 272 | +- System logs: Node system logs (if accessible) |
| 273 | + |
| 274 | +## Contributing Templates |
| 275 | + |
| 276 | +### Creating New Templates |
| 277 | + |
| 278 | +1. Create directory under `templates/chaos-scenarios/` |
| 279 | +2. Add `scenario.yaml`, `metadata.yaml`, and `README.md` |
| 280 | +3. Follow the established structure and naming conventions |
| 281 | +4. Test thoroughly before submitting |
| 282 | + |
| 283 | +### Template Guidelines |
| 284 | + |
| 285 | +- Use descriptive names and clear documentation |
| 286 | +- Include comprehensive parameter descriptions |
| 287 | +- Provide multiple usage examples |
| 288 | +- Include troubleshooting sections |
| 289 | +- Follow KRKN coding standards |
| 290 | + |
| 291 | +## Support |
| 292 | + |
| 293 | +For issues related to the template system: |
| 294 | + |
| 295 | +1. Check the template README files |
| 296 | +2. Review KRKN documentation |
| 297 | +3. Search existing GitHub issues |
| 298 | +4. Create new issues with detailed information |
| 299 | + |
| 300 | +## Integration with Scenarios Hub |
| 301 | + |
| 302 | +The template system is designed to integrate with the [KRKN Scenarios Hub](https://github.com/krkn-chaos/scenarios-hub). Templates can be contributed to the hub for community sharing and collaboration. |
0 commit comments