Skip to content

Commit a6ce0c1

Browse files
committed
docs: new rfc: chaos engineering as a service
1 parent 4666a60 commit a6ce0c1

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Chaos Engineering as a Service (CAAS)
2+
3+
## Summary
4+
5+
Build a unified dashboard to manage and monitor experiments for multiple
6+
platforms and multiple clusters.
7+
8+
Related Issue: https://github.com/chaos-mesh/chaos-mesh/issues/1462
9+
10+
## Motivation
11+
12+
Following problem exists with the current chaos-mesh design:
13+
14+
1. Cost of maintainance and change is high, since chaos-daemon and chaosd are
15+
two seperate programs that serve similar purpose.
16+
**Goal is to unify the two.**
17+
2. Poor observability of experiment results from within the dashboard
18+
**Goal is to collect the metrics by Prometheus and show in dashboard.**
19+
20+
## Detailed design
21+
22+
### Architecture redesign
23+
24+
Current architecture: https://chaos-mesh.org/docs/overview/architecture
25+
26+
![New architecture](https://user-images.githubusercontent.com/5793595/106101841-7235d600-6179-11eb-8d57-eadd51ac1e6a.png)
27+
28+
So, unlike the current architecture, Chaos Dashboard is truly multi-cluster
29+
since it can exist outside the cluster and manage multiple chaos controllers
30+
and even chaosd for physic nodes.
31+
32+
There's an addition for Prometheus to collect node metrics to incorporate
33+
better visibility of experiment results in the dashboard.
34+
35+
### Unify chaosd and chaos-daemon
36+
37+
chaosd and chaos-daemon both serve similar purposes but targeted at different
38+
types of nodes - physic nodes and kubernetes' worker nodes respectively.
39+
And that's why their communication mechanism is totally different:
40+
41+
- chaosd lacks server support and so it's only a CLI for now
42+
- chaos-daemon communicates with chaos controller via gRPC
43+
44+
So most of the logic (the ones causing chaos amongst other things) can be
45+
abstracted down to a common library for both chaosd and chaos-daemon.
46+
This ensures easy maintainance and easy-to-change for these components.
47+
48+
Server support needs to be added to chaosd so it listens for authenticated
49+
requests on some port of the host machine.
50+
51+
### Web Dashboard
52+
53+
With this new powerful dashboard, chaos-mesh will be one step closer to
54+
making **Chaos Engineering as a Service** possible. End user can manage
55+
multiple node groups (both kubernetes' and physic) from within this dashboard,
56+
adding/removing cluster configuration from the UI.
57+
58+
For physic nodes, a URL pointing to chaosd server needs to be provided,
59+
along with authentication credentials. Whereas for kubernetes' nodes, user
60+
needs to provide kubeconfig of the cluster.
61+
62+
It'll also collect prometheus metrics for better visibility of the experiment
63+
from within the dashboard itself.
64+
65+
## Drawbacks
66+
67+
1. The dashboard will be storing both authentication credentials and
68+
kubeconfig in the DB, so there's a security risk unless handled properly
69+
and securely.
70+
71+
## Alternatives
72+
73+
NA
74+
75+
## Unresolved questions
76+
77+
1. How to securely store auth credentials in the dashboard?
78+
(could refer GitHub Secrets)
79+
2. What authentication mechanism to use for chaosd on Physic node?

0 commit comments

Comments
 (0)