Skip to content

Commit 0b72d1e

Browse files
adding doc for 2.9.0 (#187)
Signed-off-by: Adarshkumar14 <[email protected]>
1 parent ceeb413 commit 0b72d1e

File tree

280 files changed

+10342
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

280 files changed

+10342
-0
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
id: architecture-summary
3+
title: Architecture Summary
4+
sidebar_label: Architecture Summary
5+
---
6+
7+
---
8+
<img src={require("../assets/architecture-summary.png").default} alt="Architecture Overview" />
9+
10+
The Litmus architecture can be segregated into two parts:
11+
12+
1. **Control Plane:** Contains the components required for the functioning of Chaos Center, the website-based portal for Litmus.
13+
14+
2. **Execution Plane:** Contains the components required for the injection of chaos in the target resources.
15+
16+
Chaos Center can be used for creating, scheduling, and monitoring Chaos Workflows, a set of chaos experiments defined in a definitive sequence to achieve desired chaos impact on the target resources upon execution. Users can log in to the Chaos Center using valid login credentials and leverage the interactive web UI to define their chaos workflow to target multiple aspects of their infrastructure. Once the user creates a Chaos Workflow using the Chaos Center, it is passed on to the Execution Plane. The Execution Plane can be present either in the host cluster containing the Control Plane if the self agent is being used, or in the target cluster if an external agent is being used. The Execution Plane interprets the Chaos Workflow as a list of steps required for injecting chaos into the target resources. It ensures efficient orchestration of chaos in cloud-native environments using various Kubernetes CRs. Once the Chaos Workflow is executed, Execution Plane sends the chaos result to the Control Plane for their post-processing using either the built-in monitoring dashboard of Litmus or using external observability tools such as Prometheus DB and Grafana dashboard. Litmus also achieves automated Chaos Workflow runs to execute chaos as part of the CI/CD pipeline based on a set of defined conditions using GitOps.
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
id: chaos-control-plane
3+
title: Chaos Control Plane
4+
sidebar_label: Chaos Control Plane
5+
---
6+
7+
---
8+
9+
<img src={require("../assets/chaos-control-plane.png").default} alt="Chaos Control Plane" />
10+
11+
Chaos Control Plane consists of micro-services responsible for the functioning of the Chaos Center, the website-based portal that can be used for interacting with Litmus, apart from the CLI. Chaos Plane facilitates the creation and scheduling of chaos workflows, system observability during the event of chaos, and post-processing and analysis of experiment results.
12+
13+
## Chaos Control Plane Components
14+
15+
* **Authentication Server:** A Golang micro-service that is responsible for authorizing, authenticating the requests received from Chaos Center and managing users along with their projects. It primarily serves the cause of user creation, user login, resetting the password, updating user information, creating project, managing project related operations.
16+
17+
* **Backend Server:** A GraphQL based Golang micro-service that serves the requests received from Chaos Center, by either querying the database for the relevant information or by fetching information from the Execution Plane.
18+
19+
* **Database:** A NoSQL MongoDB database micro-service that is accountable for storing users' information, past workflows, saved workflow templates, user projects, ChaosHubs, and GitOps details, among the other information.
20+
21+
* **Chaos Center:** Refers to the interfaces used by Litmus for creation and scheduling of chaos workflows, system observability during chaos injection, and post chaos result analysis. It includes:
22+
23+
* **Web UI:** A React.js based frontend application micro-service with built-in system observability capabilities and an analytics dashboard. It also facilitates teams of users to collaborate over chaos workflows using role-based user accounts.
24+
25+
* **Litmusctl:** A command-line tool that allows management of Litmus Agent Infrastructure components. It can be used to create agents, project, and manage multiple Litmus accounts.
26+
27+
* **Litmus API:** Refers to two different Litmus APIs, namely Litmus Authentication API and Litmus Portal API:
28+
29+
* **Litmus Authentication API:** Used to authenticate the identity of a user and to perform several user and project specific tasks like create new users, update profile, update password, create project, invite users to project, get project details etc. It uses the Authentication Server to perform these tasks.
30+
31+
* **Litmus Portal API:** Provides command-line and UI experience for managing and monitoring the events around chaos workflows. It uses the Backend Server to perform its functions.
32+
33+
## Standard Chaos Control Plane Flow
34+
35+
1. The User logs in to the ChaosCenter using a valid login credential. A default project is created for the user on initial login. Every user is a part of a project and has a role assigned to them. To schedule a workflow, the user needs to have an Editor or Owner role assigned in the project.
36+
2. The user uploads a Chaos Workflow manifest using the ChaosCenter, which is received by the Backend Server.
37+
3. Backend Server stores the manifest in the Database and also sends it to the Chaos Agent.
38+
4. Chaos Agent uses the Chaos Workflow manifest to inject chaos into the target resources. The steps of the Chaos Workflow execution can be visualized using the ChaosCenter.
39+
5. Chaos Agent returns the results of the chaos experiments that were a part of the workflow back to the Backend Server, along with the experiment logs.
40+
6. Backend Server then sends the chaos experiment results and logs to the ChaosCenter. It also stores the results into the Database for generating post-chaos workflow statistics and information.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
id: chaos-execution-plane
3+
title: Chaos Execution Plane
4+
sidebar_label: Chaos Execution Plane
5+
---
6+
7+
---
8+
9+
<img src={require("../assets/chaos-execution-plane.png").default} alt="Chaos Execution Plane" />
10+
11+
Chaos Execution Plane contains the components responsible for orchestrating the chaos injection in the target resources. They get installed in either an external target cluster if an external agent is being used or in the host cluster containing the control plane if a self-agent is being used. It can be further segregated into Litmus Agent Infrastructure components and Litmus Backend Execution Infrastructure components.
12+
13+
## Litmus Execution Plane Components
14+
15+
Litmus Agent Infrastructure components help facilitate the chaos injection, manage chaos observability, and enable chaos automation for target resources. These components include:
16+
17+
1. **Workflow Controller:** The Argo Workflow Controller responsible for the creation of Chaos Workflows using the Chaos Workflow CR.
18+
19+
2. **Subscriber:** Serves as the link between the Chaos Execution Plane and the Control Plane. It has a few distinct responsibilities such as performing health check of all the components in Chaos Execution Plane, creation of a Chaos Workflow CR from a Chaos Workflow template, watching for Chaos Workflow events during its execution, and sending the chaos workflow result to the Control Plane.
20+
21+
3. **Event Tracker:** An optional component that is capable of triggering automated chaos workflow runs based on a set of defined conditions for any given resources in the cluster. It is a controller that manages EventTrackerPolicy CR, which is basically the set of defined conditions that is validated by Event Tracker. If the current state of the tracked resources match with the state defined in the EventTrackerPolicy CR, the workflow run gets triggered. This feature can only be used if GitOps is enabled.
22+
23+
4. **Chaos Exporter:** An optional component that facilitates external observability in Litmus by exporting the chaos metrics generated during the chaos injection as time-series data to the Prometheus DB for its processing and analysis.
24+
25+
26+
Litmus Backend Execution Infrastructure components orchestrate the execution of Chaos Workflow in target resources. These components include:
27+
28+
1. **Chaos Workflow CR:** Refers to the Argo Workflow CR which describes the steps that are executed as a part of the chaos workflow. It is used to define failures during a certain workload condition (such as, say, percentage load), multiple (parallel) failures of dependent and independent services etc.
29+
30+
2. **ChaosExperiment CR:** Used for defining the low-level execution information for any Litmus chaos experiment as well as to store the various experiment tunables.
31+
32+
3. **ChaosEngine CR:** Used to hold information about how the chaos experiments are executed. It connects an application instance with one or more chaos experiments while allowing the users to specify run-level details.
33+
34+
4. **Chaos Operator:** A Kubernetes custom-controller that manages the lifecycle of certain resources or applications intending to validate their "desired state". It helps reconcile the state of the ChaosEngine by performing specific actions upon CRUD of the ChaosEngine. It also defines a secondary resource (the ChaosEngine Runner pod), which is created & managed by it to implement the reconcile functions.
35+
36+
<div style={{textAlign: 'center'}}>
37+
<img src={require("../assets/chaos-execution-plane-chaos-operator.png").default} alt="Chaos Operator" />
38+
</div>
39+
40+
5. **ChaosResult CR:** Holds the results of a chaos experiment, such as ChaosEngine reference, Experiment State, Verdict of the experiment (on completion), salient application/result attributes. It also acts as a source for metrics collection for observability.
41+
42+
6. **Chaos Runner:** Acts as a bridge between the Chaos Operator and Chaos Experiments. It is a lifecycle manager for the chaos experiments that creates Experiment Jobs for the execution of experiment business logic and monitors the experiment pods (jobs) until completion.
43+
44+
<div style={{textAlign: 'center'}}>
45+
<img src={require("../assets/chaos-execution-plane-chaos-runner.png").default} alt="Chaos Runner" />
46+
</div>
47+
48+
7. **Experiment Jobs:** Refers to the pods that execute the experiment logic. One experiment pod is created per chaos experiment in the workflow.
49+
50+
## Standard Chaos Execution Plane Flow
51+
52+
1. Subscriber receives the Chaos Workflow manifest from the Control Plane and applies the manifest to create a Chaos Workflow CR.
53+
2. Chaos Workflow CRs are tracked by the Argo Workflow Controller. When the Workflow Controller finds a new Chaos Workflow CR, it creates the ChaosExperiment CRs and the ChaosEngine CRs for the chaos experiments that are a part of the workflow.
54+
3. ChaosEngine CRs are tracked by the Chaos Operator. Once a ChaosEngine CR is ready, the Chaos Operator updates the ChaosEngine state to reflect that the particular ChaosEngine is now being executed.
55+
4. For each ChaosEngine resource, a Chaos Runner is created by the Chaos Operator.
56+
5. Chaos Runner firstly reads the chaos parameters from the ChaosExperiment CR and overrides them with values from the ChaosEngine CR. It then constructs the Experiment Jobs and monitors them until their completion.
57+
6. Experiment Jobs execute the experiment business logic and undertake chaos injection on target resources. Once done, the ChaosResult is updated with the experiment verdict.
58+
7. Chaos Runner then fetches the updated ChaosResult and updates the ChaosEngine status as well as the verdict.
59+
8. Once the ChaosEngine is updated, Subscriber fetches the ChaosEngine details and the ChaosResult and forwards them to Chaos Control Plane.
60+
61+
It is worth noticing that:
62+
- If configured, Chaos Exporter fetches data from the ChaosResult CR and converts it in a time-series format to be consumed by the Prometheus DB.
63+
64+
- An Event Tracker Policy can also be set up as part of the Backend GitOps, where the Backend GitOps Controller tracks a set of specified resources in the target cluster for any change. If any of the tracked resources undergo any change and their resulting state matches the state defined in the Event Tracker Policy, then a pre-defined Chaos Workflow is executed.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
id: chaos-experiment-flow
3+
title: Chaos Experiment Flow
4+
sidebar_label: Chaos Experiment Flow
5+
---
6+
7+
---
8+
9+
<img src={require("../assets/experiment-flow.png").default} alt="Chaos Experiment Flow" />
10+
11+
The experiment execution is triggered upon the creation of a ChaosEngine resource. The ChaosEngine resource interacts with Chaos Runner, which is created by the Chaos Operator. The Chaos Runner creates Experiment Jobs that execute the experiment business logic. Typically, these ChaosEngines are embedded within the 'steps' of a Litmus Chaos Workflow. However, one may also create and apply the Chaos Engines manually, and then the chaos-operator reconciles this resource and triggers the experiment execution. Chaos experiments are classified as:
12+
13+
- Kubernetes Experiments
14+
- Pod-Level Chaos
15+
- Node-Level Chaos
16+
- Application Chaos
17+
- Cloud Infrastructure
18+
19+
## Chaos Experiment Flow Steps
20+
21+
1. Chaos experiment execution gets triggered by the Experiment Job.
22+
2. Experiment tunables and low-level execution details are fetched.
23+
3. ChaosResult gets initialized and its verdict is updated as "Awaited" to indicate that the experiment is currently running.
24+
4. Steady-state condition for the respective experiment is validated. If the condition is found to be invalid, the experiment execution is stopped and the ChaosResult is updated as "Fail".
25+
5. Once the steady-state condition is validated, experiment resources are created to facilitate the chaos injection.
26+
6. Chaos injection is performed on the target resources for the specified chaos duration.
27+
7. Chaos injection gets reverted.
28+
8. Post chaos status-check is performed to ensure that the steady-state is still maintained.
29+
9. If the check is invalid, the ChaosEngine and ChaosResult verdicts are updated as "Fail", otherwise they are updated as "Pass".
30+
10. Experiment execution ends.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
id: chaos-observability-flow-analytics
3+
title: Analytics
4+
sidebar_label: Analytics
5+
---
6+
7+
---
8+
9+
<img src={require("../assets/chaos-observability-flow-analytics.png").default} alt="Chaos Observability Flow Analytics" />
10+
11+
Analytics is an integral part of Chaos Engineering, as it offers key insights that are required to fully understand a system during the chaos and functions as a decision-making tool for improving system resiliency.
12+
13+
In Litmus, workflow run statistics and information are generated post the chaos workflow execution, which can be accessed directly using the ChaosCenter.
14+
15+
## Observability Flow for Analytics
16+
1. In the Chaos Execution Plane, the ChaosEngine Details and ChaosResult are fetched by the Chaos Agent.
17+
2. Chaos Agent then forwards them to the Backend Server in the Chaos Control Plane and later they get stored into the Database.
18+
3. User specifies the Chaos Workflow Schedule for which the workflow statistics and information is to be fetched as an input in the ChaosCenter.
19+
4. The request for the workflow statistics and information is received by the Backend Server.
20+
5. Backend Server queries the Database for the details of past Workflow Runs.
21+
6. Aggregated workflow statistics based on the ChaosResult verdict and probe success percentage are fetched from the Database by Backend Server.
22+
7. Workflow statistics and information are forwarded to ChaosCenter by Backend Server.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
id: chaos-observability-flow-logging
3+
title: Logging
4+
sidebar_label: Logging
5+
---
6+
7+
---
8+
9+
<img src={require("../assets/chaos-observability-flow-logging.png").default} alt="Chaos Observability Flow Logging" />
10+
11+
Logging is a pivotal observability aspect in LitmusChaos as it allows the user to track the exact system behavior during the scenario of a chaos. The logs can be classified into one of the following:
12+
13+
- **Litmus Checker Logs:** Logs generated as part of the validation for chaos resources that are required to execute a chaos experiment.
14+
- **Experiment Logs:** Logs generated as part of the steps performed during the chaos experiment, including pre-chaos check logs, chaos injection logs, chaos probes logs, and post-chaos check logs.
15+
- **Non-Chaos Workflow Step Logs:** Logs generated as part of the workflow steps that facilitate the execution of the chaos experiment, such as chaos experiment installation step logs, chaos revert step logs, etc.
16+
17+
## Observability Flow for Logging
18+
1. User requests the logs for any particular workflow step using the ChaosCenter.
19+
2. The request for the logs is received by the Backend Server and is forwarded to the Subscriber.
20+
3. The subscriber checks if the workflow step is a Chaos Experiment step or not.
21+
4. If the workflow step is a Chaos Experiment step, then the Litmus Checker logs and the Chaos Experiment Logs are fetched from the ChaosEngine CR by the subscriber. Else, the logs of the workflow step pod is fetched from the respective workflow step pod by the subscriber.
22+
5. The fetched logs are returned to the Backend Server, which returns them to the ChaosCenter.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
id: chaos-observability-flow-monitoring
3+
title: Monitoring
4+
sidebar_label: Monitoring
5+
---
6+
7+
---
8+
9+
Monitoring is a key part to effectively watch and understand the state of the systems. In Litmus, the monitoring is split into a two-fold approach:
10+
- Metrics Monitoring
11+
- Events Monitoring
12+
13+
## Metrics Monitoring
14+
15+
<div style={{textAlign: 'center'}}>
16+
<img src={require("../assets/chaos-observability-flow-metrics.png").default} alt="Chaos Observability Flow Metrics" />
17+
</div>
18+
19+
Metrics Monitoring enables the users to monitor the chaos metrics generated during chaos injection, which is exported by the chaos exporter to be consumed as time-series information.
20+
21+
### Observability Flow for Metrics Monitoring
22+
1. During the event of chaos, the ChaosResult CR and the ChaosEngine CR are fetched by the Chaos Exporter.
23+
2. If ChaosResult verdict is "Awaited", then the Continuous Event Metrics are fetched by Chaos Exporter. Else, if the verdict is "Pass" or "Fail" or "Stopped", then the Gauge Metrics are fetched by Chaos Exporter at a fixed TSDB Scrape Interval.
24+
3. The fetched metrics are then exposed at the Chaos Exporter Kubernetes service.
25+
4. TSDBs consume these metrics and store them as time-series value.
26+
5. APMs and Visualisation Tools query and fetch the chaos metric from the TSDBs.
27+
28+
## Events Monitoring
29+
30+
<div style={{textAlign: 'center'}}>
31+
<img src={require("../assets/chaos-observability-flow-events.png").default} alt="Chaos Observability Flow Events" />
32+
</div>
33+
34+
Events Monitoring enables the users to monitor the Kubernetes events that are created as part of the orchestration of chaos injection by Litmus. These events occur through different points in the lifetime of a Chaos Experiment's execution to accomplish.
35+
36+
- Like any other Kubernetes Events, these events also get stored inside the etcd.
37+
- The ChaosEngine CR events are initiated by the Operator or the Chaos Runner or the Chaos Experiment itself.
38+
- The ChaosResult CR events are initiated by the Chaos Experiment itself.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
id: chaos-observability-flow-overview
3+
title: Overview
4+
sidebar_label: Overview
5+
---
6+
7+
---
8+
9+
Observability in Litmus serves a two-fold cause:
10+
11+
1. To provide the right hooks to APM platforms so as to enable visualization and understand the behavior of application/microservices under chaotic conditions.
12+
13+
2. Ability to gather, record & factor in data provided by standard observability frameworks as part of SLO validation in automated chaos experiment runs - the results of which can be stored & analyzed as experiment “verdicts” or “metadata”.
14+
15+
Chaos Observability in Litmus can be sectioned into the following:
16+
1. **[Visualising Chaos Workflow (Visualization)](chaos-observability-flow-visualization.md)**
17+
- Workflow Execution Graph
18+
2. **[Fetching Logs (Logging)](chaos-observability-flow-logging.md)**
19+
- Litmus Checker Logs
20+
- Experiment Logs
21+
- Non-Chaos Workflow Logs
22+
3. **[Monitoring Systems in Real Time During Chaos (Monitoring)](chaos-observability-flow-monitoring.md)**
23+
- Metrics
24+
- Events
25+
4. **[Viewing Experiment Verdict and Summary (Summarisation)](chaos-observability-flow-summarisation.md)**
26+
- Chaos Result
27+
5. **[Post-Chaos Workflow Analytics (Analytics)](chaos-observability-flow-analytics.md)**
28+
- Workflow Statistics and Information

0 commit comments

Comments
 (0)