You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/sre/SKILL.md
+37-7Lines changed: 37 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ You are an expert SRE. You stay calm under pressure. You stabilize first, debug
11
11
12
12
## Golden Rules
13
13
14
-
1.**NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries.
14
+
1.**NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries. Using field names or values from memory without running `getschema` and `distinct`/`topk` on the actual dataset IS guessing.
15
15
16
16
2.**Follow the data.** Every claim must trace to a query result. Say "the logs show X" not "this is probably X". If you catch yourself saying "so this means..."—STOP. Query to verify.
17
17
@@ -105,10 +105,38 @@ scripts/init
105
105
106
106
Follow this loop strictly.
107
107
108
-
### A. DISCOVER
109
-
- Review `scripts/init` output
110
-
- Map your mental model to available datasets
111
-
- If you see `['k8s-logs-prod']`, use that—not `['logs']`
108
+
### A. DISCOVER (MANDATORY — DO NOT SKIP)
109
+
110
+
**Before writing ANY query against a dataset, you MUST discover its schema.** This is not optional. Skipping schema discovery is the #1 cause of lazy, wrong queries.
111
+
112
+
**Step 1: Identify datasets** — Review `scripts/init` output. Use ONLY dataset names from discovery. If you see `['k8s-logs-prod']`, use that—not `['logs']`.
113
+
114
+
**Step 2: Get schema** — Run `getschema` on every dataset you plan to query:
115
+
```apl
116
+
['dataset'] | getschema
117
+
```
118
+
119
+
**Step 3: Discover values of low-cardinality fields** — For fields you plan to filter on (service names, labels, status codes, log levels), enumerate their actual values:
120
+
```apl
121
+
['dataset'] | where _time > ago(15m) | distinct field_name
122
+
['dataset'] | where _time > ago(15m) | summarize count() by field_name | top 20 by count_
123
+
```
124
+
125
+
**Step 4: Discover map type schemas** — Fields typed as `map[string]` (e.g., `attributes.custom`, `attributes`, `resource`) don't show their keys in `getschema`. You MUST sample them to discover their internal structure:
126
+
```apl
127
+
// Sample 1 raw event to see all map keys
128
+
['dataset'] | where _time > ago(15m) | take 1
129
+
130
+
// If too wide, project just the map column and sample
131
+
['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
132
+
133
+
// Discover distinct keys inside a map column
134
+
['dataset'] | where _time > ago(15m) | extend keys = ['attributes.custom'] | mv-expand keys | summarize count() by tostring(keys) | top 20 by count_
135
+
```
136
+
137
+
**Why this matters:** Map fields (common in OTel traces/spans) contain nested key-value pairs that are invisible to `getschema`. If you query `['attributes.http.status_code']` without first confirming that key exists, you're guessing. The actual field might be `['attributes.http.response.status_code']` or stored inside `['attributes.custom']` as a map key.
138
+
139
+
**NEVER assume field names inside map types.** Always sample first.
112
140
113
141
### B. CODE CONTEXT
114
142
-**Locate Code:** Find the relevant service in the repository
@@ -424,8 +452,10 @@ See `reference/postmortem-template.md` for retrospective format.
424
452
425
453
**If `scripts/init` warns of BLOAT:**
426
454
1.**Finish task:** Solve the current incident first
427
-
2.**Request sleep:** "Memory is full. Start a new session with `scripts/sleep` to consolidate."
428
-
3.**Consolidate:** Read raw facts, synthesize into patterns, clean noise
455
+
2.**Request sleep:** "Memory is full. Start a new session with sleep cycle."
456
+
3.**Run packaged sleep:**`scripts/sleep --org axiom` (default is full preset)
457
+
4.**Distill via fixed prompt:** write exactly one incidents/facts/patterns/queries sleep-cycle entry set (use `-v2`/`-v3` if same-day key exists and add `Supersedes`).
458
+
5.**No improvisation:** Use the script output and prompt template; do not invent details.
['dataset'] | extend value = tostring(['attributes']['nested.key'])
46
46
```
47
47
48
+
### Map Type Discovery (CRITICAL for OTel Traces)
49
+
50
+
Fields typed as `map[string]` in `getschema` (e.g., `attributes`, `attributes.custom`, `resource`, `resource.attributes`) are opaque containers — `getschema` only shows the column name and type `map[string]`, NOT the keys inside. You must discover map contents explicitly.
51
+
52
+
**Step 1: Identify map columns** — Run `getschema` and look for `map` types:
53
+
```apl
54
+
['traces-dataset'] | getschema
55
+
// Look for: attributes map[string]...
56
+
// attributes.custom map[string]...
57
+
// resource map[string]...
58
+
```
59
+
60
+
**Step 2: Sample raw events** — The fastest way to see actual map keys:
61
+
```apl
62
+
// See full event structure including all map keys
63
+
['traces-dataset'] | where _time > ago(15m) | take 1
64
+
65
+
// Project just the map column to reduce noise
66
+
['traces-dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
67
+
['traces-dataset'] | where _time > ago(15m) | project attributes | take 5
68
+
```
69
+
70
+
**Step 3: Enumerate distinct keys** — For high-cardinality maps, find what keys exist:
71
+
```apl
72
+
// List keys and their frequency
73
+
['traces-dataset'] | where _time > ago(15m)
74
+
| extend keys = ['attributes.custom']
75
+
| mv-expand keys
76
+
| summarize count() by tostring(keys)
77
+
| top 30 by count_
78
+
```
79
+
80
+
**Step 4: Access map values in queries** — Use bracket notation:
**WARNING:** Do NOT assume key names inside maps. The same semantic attribute may appear under different keys depending on instrumentation library, OTel SDK version, or custom configuration. Always sample first.
['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
22
+
['dataset'] | where _time > ago(15m) | project attributes | take 5
23
+
```
24
+
25
+
**Rule:** If your first filter query returns 0 results, run schema discovery before trying another filter.
26
+
27
+
### Map Type Key Discovery (OTel Traces)
28
+
29
+
Map columns (`map[string]` type) are common in OTel traces datasets. `getschema` shows the column exists but NOT its internal keys. You must sample to discover them.
30
+
31
+
```apl
32
+
// Sample map column contents
33
+
['traces'] | where _time > ago(15m) | project ['attributes.custom'] | take 3
34
+
35
+
// Enumerate all distinct keys in a map column
36
+
['traces'] | where _time > ago(15m)
37
+
| extend keys = ['attributes.custom']
38
+
| mv-expand keys
39
+
| summarize count() by tostring(keys)
40
+
| top 30 by count_
41
+
42
+
// Access specific map values (use bracket notation)
43
+
['traces'] | where _time > ago(15m)
44
+
| extend status = toint(['attributes.custom']['http.response.status_code']),
45
+
method = tostring(['attributes']['http.method'])
46
+
```
47
+
3
48
Ready-to-use APL queries for common investigation scenarios.
4
49
5
50
## Error Analysis
@@ -112,16 +157,10 @@ Ready-to-use APL queries for common investigation scenarios.
112
157
| project _time, request_id, service, uri, status
113
158
```
114
159
115
-
## Schema Discovery
160
+
## General Schema Helpers
116
161
117
162
```apl
118
-
// Get schema with types (Fastest)
119
-
['dataset'] | getschema
120
-
121
-
// Sample data to see specific fields
122
-
['dataset'] | where _time between (ago(1h) .. now()) | project _time, message, level | take 5
123
-
124
-
// Top values for a field
163
+
// Top values for any field
125
164
['dataset'] | where _time between (ago(1h) .. now()) | summarize topk(field, 10)
0 commit comments