|
| 1 | +# Chaos Engineering as a Service (CAAS) |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +Build a unified dashboard to manage and monitor experiments for multiple |
| 6 | +platforms and multiple clusters. |
| 7 | + |
| 8 | +Related Issue: https://github.com/chaos-mesh/chaos-mesh/issues/1462 |
| 9 | + |
| 10 | +## Motivation |
| 11 | + |
| 12 | +Following problem exists with the current chaos-mesh design: |
| 13 | + |
| 14 | +1. Cost of maintainance and change is high, since chaos-daemon and chaosd are |
| 15 | + two seperate programs that serve similar purpose. |
| 16 | + **Goal is to unify the two.** |
| 17 | +2. Poor observability of experiment results from within the dashboard |
| 18 | + **Goal is to collect the metrics by Prometheus and show in dashboard.** |
| 19 | + |
| 20 | +## Detailed design |
| 21 | + |
| 22 | +### Architecture redesign |
| 23 | + |
| 24 | +Current architecture: https://chaos-mesh.org/docs/overview/architecture |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +So, unlike the current architecture, Chaos Dashboard is truly multi-cluster |
| 29 | +since it can exist outside the cluster and manage multiple chaos controllers |
| 30 | +and even chaosd for physic nodes. |
| 31 | + |
| 32 | +There's an addition for Prometheus to collect node metrics to incorporate |
| 33 | +better visibility of experiment results in the dashboard. |
| 34 | + |
| 35 | +### Unify chaosd and chaos-daemon |
| 36 | + |
| 37 | +chaosd and chaos-daemon both serve similar purposes but targeted at different |
| 38 | +types of nodes - physic nodes and kubernetes' worker nodes respectively. |
| 39 | +And that's why their communication mechanism is totally different: |
| 40 | + |
| 41 | +- chaosd lacks server support and so it's only a CLI for now |
| 42 | +- chaos-daemon communicates with chaos controller via gRPC |
| 43 | + |
| 44 | +So most of the logic (the ones causing chaos amongst other things) can be |
| 45 | +abstracted down to a common library for both chaosd and chaos-daemon. |
| 46 | +This ensures easy maintainance and easy-to-change for these components. |
| 47 | + |
| 48 | +Server support needs to be added to chaosd so it listens for authenticated |
| 49 | +requests on some port of the host machine. |
| 50 | + |
| 51 | +### Web Dashboard |
| 52 | + |
| 53 | +With this new powerful dashboard, chaos-mesh will be one step closer to |
| 54 | +making **Chaos Engineering as a Service** possible. End user can manage |
| 55 | +multiple node groups (both kubernetes' and physic) from within this dashboard, |
| 56 | +adding/removing cluster configuration from the UI. |
| 57 | + |
| 58 | +For physic nodes, a URL pointing to chaosd server needs to be provided, |
| 59 | +along with authentication credentials. Whereas for kubernetes' nodes, user |
| 60 | +needs to provide kubeconfig of the cluster. |
| 61 | + |
| 62 | +It'll also collect prometheus metrics for better visibility of the experiment |
| 63 | +from within the dashboard itself. |
| 64 | + |
| 65 | +## Drawbacks |
| 66 | + |
| 67 | +1. The dashboard will be storing both authentication credentials and |
| 68 | + kubeconfig in the DB, so there's a security risk unless handled properly |
| 69 | + and securely. |
| 70 | + |
| 71 | +## Alternatives |
| 72 | + |
| 73 | +NA |
| 74 | + |
| 75 | +## Unresolved questions |
| 76 | + |
| 77 | +1. How to securely store auth credentials in the dashboard? |
| 78 | + (could refer GitHub Secrets) |
| 79 | +2. What authentication mechanism to use for chaosd on Physic node? |
0 commit comments