|
11 | 11 | - [Contributing](#contributing)
|
12 | 12 | - [Building](#building)
|
13 | 13 | - [Adding a new investigation](#adding-a-new-investigation)
|
| 14 | + - [Graduating an investigation](#graduating-an-investigation) |
14 | 15 | - [Testing locally](#testing-locally)
|
15 | 16 | - [Pre-requirements](#pre-requirements)
|
16 | 17 | - [Running cadctl for an incident ID](#running-cadctl-for-an-incident-id)
|
|
19 | 20 | - [Integrations](#integrations)
|
20 | 21 | - [Templates](#templates)
|
21 | 22 | - [Dashboards](#dashboards)
|
22 |
| - - [Deployment](#deployment) |
23 | 23 | - [Boilerplate](#boilerplate)
|
24 | 24 | - [PipelinePruner](#pipelinepruner)
|
25 | 25 | - [Required ENV variables](#required-env-variables)
|
@@ -71,6 +71,29 @@ To add a new alert investigation:
|
71 | 71 | - investigation.Resources contain initialized clients for the clusters aws environment, ocm and more. See [Integrations](#integrations)
|
72 | 72 | - Add test objects or scripts used to recreate the alert symptoms to the `pkg/investigations/$INVESTIGATION_NAME/testing/` directory for future use. Be sure to clearly document the testing procedure under the `Testing` section of the investigation-specific README.md file
|
73 | 73 |
|
| 74 | +### Graduating an investigation |
| 75 | + |
| 76 | +New investigations and their remediation steps should be deployed in advancing stages through a progressive deployment strategy. |
| 77 | + |
| 78 | +1. **Informing Stage (Read-only):** |
| 79 | + The investigation is merely informative through PagerDuty at this stage; remediation _**does not involve any write operations**_. Notes are collected throughout the investigation, and upon the investigation's conclusion are posted to PagerDuty. |
| 80 | + |
| 81 | + **Aim:** Validating the investigation's accuracy and usefulness **without performing any write actions**. |
| 82 | + |
| 83 | + **Validation Criteria:** |
| 84 | + * The investigation successfully carries out each step on it's respective incident type, on both staging and production environments. |
| 85 | + * It provides useful information (equivalent to a manual investigation) to SREs through PagerDuty. |
| 86 | + * The investigation should be accompanied by unit tests and/or step-by-step manual tests in the investigation's testing README, including: |
| 87 | + * A clear step-by-step process to manually test the investigation (e.g. cluster setup, other expected conditions). |
| 88 | + |
| 89 | +2. **Actioning Stage (Read/Write):** |
| 90 | + The investigation's remediation capabilities, including **read and write** operations, are performed on all applicable clusters. |
| 91 | + |
| 92 | + **Validation Criteria:** |
| 93 | + * The investigation is verified to conduct remediations on staging as expected. |
| 94 | + * The investigation should be locally tested in staging against a live alert. |
| 95 | + * E2E testing is desired for actioning investigations; the tests should cover the execution of remediative steps as well as verification of their effectiveness. |
| 96 | + |
74 | 97 | ### Integrations
|
75 | 98 |
|
76 | 99 | > **Note:** When writing an investiation, you can use them right away.
|
@@ -180,12 +203,6 @@ Investigation specific documentation can be found in the according investigation
|
180 | 203 |
|
181 | 204 | Grafana dashboard configmaps are stored in the [Dashboards](./dashboards/) directory. See app-interface for further documentation on dashboards.
|
182 | 205 |
|
183 |
| -### Deployment |
184 |
| -
|
185 |
| -* [Tekton](./deploy/README.md) -- Installation/configuration of Tekton and triggering pipeline runs. |
186 |
| -* [Skip Webhooks](./deploy/skip-webhook/README.md) -- Skipping the eventlistener and creating the pipelinerun directly. |
187 |
| -* [Namespace](./deploy/namespace/README.md) -- Allowing the code to ignore the namespace. |
188 |
| -
|
189 | 206 | ### Boilerplate
|
190 | 207 |
|
191 | 208 | * [Boilerplate](./boilerplate/openshift/osd-container-image/README.md) -- Conventions for OSD containers.
|
@@ -223,4 +240,4 @@ For Red Hat employees, these environment variables can be found in the SRE-P vau
|
223 | 240 |
|
224 | 241 | - `LOG_LEVEL`: refers to the CAD log level, if not set, the default is `info`. See
|
225 | 242 |
|
226 |
| -- `CAD_HCM_AI_TOKEN`: required for requests to the ai model |
| 243 | +- `CAD_HCM_AI_TOKEN`: required for requests to the ai model |
0 commit comments