investigation is a repository of sre troubleshooting artifacts created for use with Datadog as the observability platform.
Mermaid Diagram of steps in the investigation process
flowchart LR;
J((("Start of Investigation")));
A(["Problem Statement"]);
B["Observation"];
D{"Validate Problem Statement"};
C["Analysis"];
E["Troubleshooting"];
F{"Adjust Problem Statement"};
H(["Solution Statement"]);
I((("End of Investigation")));
G{"Problem Solved"};
J-->A;
A-->B;
B-->C;
C-->D;
D-->|Accept| E;
E-->G;
G-->|Yes| H;
H-->I;
D-.->|Reject| F;
F-.->|Yes| A;
F-.->|No| I;
G-.->|No| C;
The root of this repo contains the following folders by status with one subfolder per investigation
Ongoing investigation artifacts
Contains completed or abandoned investigation folders with artifacts retained for future reference
Each investigation folder contains artifacts by process step
This is a Markdown document in the investigation folder. It contains an initial short statement requiring investigation.
Note
Each Problem Statement arrives to SRE with inherent assumptions and bias
Objective data that can be used to perform Analysis
- screenshots
- logs
- metrics
- traces
- events
Opinionated correlation of Observations to perform Validation of the Problem Statement
- written documentation
- advanced notebooks
Active steps to mitigate Validated Observations
- scripts
- configs
This is an update to the Problem Statement document once an investigation has been completed. Solution Statements should only appear in archived investigation folders.