Automating for Failures - Chaos Engineering with Chef

This project was submitted to Automate for Good by Chef

Automating for Failures - Chaos Engineering with Chef

About this Project:

Failures happen in large distributed systems. We have seen recent worldwide outage of Facebook, Instagram, and WhatsApp. Fastly's outage happened mid this year and Google's outage at end of 2020. Failures happen!

But, can we identify these failures proactively? Chaos Engineering has been practiced widely to solve this problem. With Chef, we are automating the Chaos Engineering experiments that create turbulent attacks in your system to:

Build confidence in your systems
Uncover point of failures
Detect failures proactively

What it does:

Our project - Automating for Failures, uses Chef Infra to automate the attacks on your system hosted on AWS, Google Cloud, or in your own infrastructure. These attacks generate several turbulent conditions in your system like:

Rebooting servers
Stoping services
Killing running processes
Blocking networks
Simulating a failover in the database
Using all of the server's CPU cores
List goes on

How we built it:

Our Chaos Attacks Pipeline involves following components:

Chef Workstation: Develop and test attacks
Managed Chef server: Manage cookbooks, roles, policies, and nodes
Chef nodes: Run chef-client to trigger attacks

Chef + Chaos Engineering:

Chaos Principles(Advanced):

Reference: https://principlesofchaos.org/

Build a Hypothesis around Steady State Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
Minimize Blast Radius

With this project, our goal is to attempt one the Chaos Engineering principle with Chef -

Automate Experiments to Run Continuously

High Level Design:

The following approach explains how we strctured our cookbooks & recipes to enable code-reusability as well as cohesion

Host Level Attacks:

Cookbook: host_attacks
- Recipes - Reboot, shutdown, terminate, block_network

Service Level Attacks:

Cookbook: service_attacks
- Recipes - stop, restart, kill

Server Resource Level Attacks:

Cookbook: resource_attacks
- Recipes -CPU, Memory, IO, Disk

MongoDB Attacks:

Cookbook: mongo_attacks
- Recipes: exhaust_connections, stepdown primary, kill primary, kill

Redis Attacks:

Cookbook: redis_attacks
- Recipes: sleep server, saturate IOPS, stop service

Cassandra Attacks: WIP

Cookbook: cassandra_attacks
- Recipes: stop_majority_nodes in 1 DC, stop 1 node, stop all nodes in DC

How to use this project:

Clone this github repo: git clone https://github.com/150astro/chef_autoamte_for_good.git
Upload the cookbooks you're interested in to your Chef server using:
- knife upload cookbooks/chaos_attacks/cookbooks/host_attacks
- knife upload cookbooks/chaos_attacks/cookbooks/service_attacks
- knife upload cookbooks/chaos_attacks/cookbooks/resource_attacks
- knife upload cookbooks/chaos_attacks/cookbooks/redis_attacks
- knife upload cookbooks/chaos_attacks/cookbooks/mongodb_attacks
Bootstrap your nodes to Chef server if you haven't already:
- knife bootstrap <node_fqdn> -x <host_username> -i <host_key_file_name> --sudo --node-name <node_name_of_your_choice> -r "recipe[starter::default]"

How to test an attack:

Please test the attacks only on your test nodes

Example: Run attack to reboot your host -

Select the name of cookbook you want to run the attack from. Example: host_attacks
Choose the attack you want to simulate on the node. Example: reboot
The recipe name is derived from the name of the attack: Example - for reboot attack the recipe to use is reboot_host.rb
Test recipe from the node: chef-client -o 'recipe[host_attacks::reboot_host]'
On successful run of the chef-client, the node will be rebooted.

Example: Run attack to consume all CPU cores on node

Select the name of cookbook you want to run the attack from. Example: resource_attacks
Choose the attack you want to simulate on the node. Example:
The recipe name is derived from the name of the attack: Example - for all_cpu_spikes it will be all_cpu_spikes.rb
Test recipe from the node: chef-client -o 'recipe[resource_attacks::all_cpu_spikes]'
On successful run of the chef-client, the node will have all of it's CPU cores used upto 100% for 60 seconds.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cookbooks		cookbooks
data_bags		data_bags
policyfiles		policyfiles
.chef-repo.txt		.chef-repo.txt
.gitignore		.gitignore
README.md		README.md
chefignore		chefignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automating for Failures - Chaos Engineering with Chef

About this Project:

What it does:

How we built it:

Chef + Chaos Engineering:

Chaos Principles(Advanced):

High Level Design:

How to use this project:

How to test an attack:

Example: Run attack to reboot your host -

Example: Run attack to consume all CPU cores on node

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automating for Failures - Chaos Engineering with Chef

About this Project:

What it does:

How we built it:

Chef + Chaos Engineering:

Chaos Principles(Advanced):

High Level Design:

How to use this project:

How to test an attack:

Example: Run attack to reboot your host -

Example: Run attack to consume all CPU cores on node

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages