docs: new rfc: chaos engineering as a service

shivanshs9 · shivanshs9 · commit a6ce0c195a79 · 2021-03-02T04:19:09.000+05:30
diff --git a/text/2021-02-26-chaos-engg-as-service.md b/text/2021-02-26-chaos-engg-as-service.md
@@ -0,0 +1,79 @@
+# Chaos Engineering as a Service (CAAS)
+
+## Summary
+
+Build a unified dashboard to manage and monitor experiments for multiple
+platforms and multiple clusters.
+
+Related Issue: https://github.com/chaos-mesh/chaos-mesh/issues/1462
+
+## Motivation
+
+Following problem exists with the current chaos-mesh design:
+
+1. Cost of maintainance and change is high, since chaos-daemon and chaosd are
+   two seperate programs that serve similar purpose.
+   **Goal is to unify the two.**
+2. Poor observability of experiment results from within the dashboard
+   **Goal is to collect the metrics by Prometheus and show in dashboard.**
+
+## Detailed design
+
+### Architecture redesign
+
+Current architecture: https://chaos-mesh.org/docs/overview/architecture
+
+![New architecture](https://user-images.githubusercontent.com/5793595/106101841-7235d600-6179-11eb-8d57-eadd51ac1e6a.png)
+
+So, unlike the current architecture, Chaos Dashboard is truly multi-cluster
+since it can exist outside the cluster and manage multiple chaos controllers
+and even chaosd for physic nodes.
+
+There's an addition for Prometheus to collect node metrics to incorporate
+better visibility of experiment results in the dashboard.
+
+### Unify chaosd and chaos-daemon
+
+chaosd and chaos-daemon both serve similar purposes but targeted at different
+types of nodes - physic nodes and kubernetes' worker nodes respectively.
+And that's why their communication mechanism is totally different:
+
+- chaosd lacks server support and so it's only a CLI for now
+- chaos-daemon communicates with chaos controller via gRPC
+
+So most of the logic (the ones causing chaos amongst other things) can be
+abstracted down to a common library for both chaosd and chaos-daemon.
+This ensures easy maintainance and easy-to-change for these components.
+
+Server support needs to be added to chaosd so it listens for authenticated
+requests on some port of the host machine.
+
+### Web Dashboard
+
+With this new powerful dashboard, chaos-mesh will be one step closer to
+making **Chaos Engineering as a Service** possible. End user can manage
+multiple node groups (both kubernetes' and physic) from within this dashboard,
+adding/removing cluster configuration from the UI.
+
+For physic nodes, a URL pointing to chaosd server needs to be provided,
+along with authentication credentials. Whereas for kubernetes' nodes, user
+needs to provide kubeconfig of the cluster.
+
+It'll also collect prometheus metrics for better visibility of the experiment
+from within the dashboard itself.
+
+## Drawbacks
+
+1. The dashboard will be storing both authentication credentials and
+   kubeconfig in the DB, so there's a security risk unless handled properly
+   and securely.
+
+## Alternatives
+
+NA
+
+## Unresolved questions
+
+1. How to securely store auth credentials in the dashboard?
+   (could refer GitHub Secrets)
+2. What authentication mechanism to use for chaosd on Physic node?