Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions text/2021-02-26-chaos-engg-as-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Chaos Engineering as a Service (CAAS)

## Summary

Build a unified dashboard to manage and monitor experiments for multiple
platforms and multiple clusters.

Related Issue: https://github.com/chaos-mesh/chaos-mesh/issues/1462

## Motivation

Following problem exists with the current chaos-mesh design:

1. Cost of maintainance and change is high, since chaos-daemon and chaosd are
two seperate programs that serve similar purpose.
**Goal is to abstract out common parts of two into a common library.**
2. Poor observability of experiment results from within the dashboard
**Goal is to collect the metrics by Prometheus and show in dashboard.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe collect the metrics by Prometheus and show in dashboard is just one way of improve the observability. 😃

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WangXiangUSTC What do you suggest we could do besides Prometheus integration for this?

Copy link
Contributor

@WangXiangUSTC WangXiangUSTC Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some data of the chaos experiment itself. In fact, I don’t have a definite idea yet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good examples of metrics from Litmus as mentioned in this blog.

I think we must have metrics around pass/fail/awaited for all chaos operations which client is going to perform. We already have one pass metric but we certainly can have more metric around this (like per namespace) to make data more visible and easy to consume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with you


## Detailed design

### Architecture redesign

Current architecture: https://chaos-mesh.org/docs/overview/architecture

![New architecture](https://user-images.githubusercontent.com/5793595/106101841-7235d600-6179-11eb-8d57-eadd51ac1e6a.png)

So, unlike the current architecture, Chaos Dashboard is truly multi-cluster
since it can exist outside the cluster and manage multiple chaos controllers
and even chaosd for physic nodes.

There's an addition for Prometheus to collect node metrics to incorporate
better visibility of experiment results in the dashboard.

### Unify chaosd and chaos-daemon

chaosd and chaos-daemon both serve similar purposes but targeted at different
types of nodes - physic nodes and kubernetes' worker nodes respectively.
And that's why their communication mechanism is totally different:

- chaosd lacks server support and so it's only a CLI for now
- chaos-daemon communicates with chaos controller via gRPC

So most of the logic (the ones causing chaos amongst other things) can be
abstracted down to a common library for both chaosd and chaos-daemon.
This ensures easy maintainance and easy-to-change for these components.

Server support needs to be added to chaosd so it listens for authenticated
requests on some port of the host machine.

### Authentication & Authorization

#### Chaosd

Chaosd runs on physic nodes outside kubernetes cluster, so it is vulnerable to attack
from internet. To prevent misuse of chaosd, it needs to allow only authenticated
requests. The easiest and secure setup is to use SSL certificates to both encrypt
the request data and for authentication.
Comment on lines +55 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should unit them into one line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some problems below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm.. basically combine to one sentence??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to one line


From the perspective of communication, the dashboard will represent the end user
and so act as a client, whereas chaosd instance would represent the server.
The client can be authenticated here by making use of
[SSL Client Authentication](https://aboutssl.org/ssl-tls-client-authentication-how-does-it-works/)
technique.

In this setup, private key of the certificate will be generated and kept with the
dashboard and public key would be stored on chaosd nodes. On any request,
chaosd would first verify the digital signatures presented by the client to
authenticate the request.
Comment on lines +66 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question, the private key is kept by client(dashboard),is it looks strange?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For authentication using certificates, the requesting entity must have the private key. Since dashboard will be calling http endpoints of chaosd, it'll have private key


#### Chaos Mesh

Chaos Mesh is by default authenticated using kubernetes token provided.
If needed, requests could be further protected using SSL certificates.

#### Dashboard

In dashboard, basic authentication protocol using username/password can be
implemented and the data of users can be stored in DB. To implement RBAC
(Role-based access control), **roles** can be defined to comprise of allowed
permissions for that role. User and Role and related by many-to-many relationship,
i.e. user can have many roles and a role can belong to many users.
Only the user with admin privilege can add/edit users and roles.

To allow access of a role to a particular chaos nodes (whether physic/kubernetes),
admin can permit the role to have access to nodes with particular tag,
which is set in the dashboard.

### Web Dashboard

With this new powerful dashboard, chaos-mesh will be one step closer to
making **Chaos Engineering as a Service** possible. End user can manage
multiple node groups (both kubernetes' and physic) from within this dashboard,
adding/removing cluster configuration from the UI.

For physic nodes, a URL pointing to chaosd server needs to be provided,
along with authentication credentials. Whereas for kubernetes' nodes, user
needs to provide kubeconfig of the cluster.

It'll also collect prometheus metrics for better visibility of the experiment
from within the dashboard itself.

## Drawbacks

1. The dashboard will be storing both authentication credentials and
kubeconfig in the DB, so there's a security risk unless handled properly
and securely.

## Alternatives

NA

## Unresolved questions

1. How to securely store auth credentials in the dashboard?
(could refer GitHub Secrets)