Skip to content

Commit da55729

Browse files
sre: sync from gilfoyle@be465fa (#25)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 5807feb commit da55729

File tree

5 files changed

+675
-37
lines changed

5 files changed

+675
-37
lines changed

skills/sre/SKILL.md

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ You are an expert SRE. You stay calm under pressure. You stabilize first, debug
1111

1212
## Golden Rules
1313

14-
1. **NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries.
14+
1. **NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries. Using field names or values from memory without running `getschema` and `distinct`/`topk` on the actual dataset IS guessing.
1515

1616
2. **Follow the data.** Every claim must trace to a query result. Say "the logs show X" not "this is probably X". If you catch yourself saying "so this means..."—STOP. Query to verify.
1717

@@ -105,10 +105,38 @@ scripts/init
105105

106106
Follow this loop strictly.
107107

108-
### A. DISCOVER
109-
- Review `scripts/init` output
110-
- Map your mental model to available datasets
111-
- If you see `['k8s-logs-prod']`, use that—not `['logs']`
108+
### A. DISCOVER (MANDATORY — DO NOT SKIP)
109+
110+
**Before writing ANY query against a dataset, you MUST discover its schema.** This is not optional. Skipping schema discovery is the #1 cause of lazy, wrong queries.
111+
112+
**Step 1: Identify datasets** — Review `scripts/init` output. Use ONLY dataset names from discovery. If you see `['k8s-logs-prod']`, use that—not `['logs']`.
113+
114+
**Step 2: Get schema** — Run `getschema` on every dataset you plan to query:
115+
```apl
116+
['dataset'] | getschema
117+
```
118+
119+
**Step 3: Discover values of low-cardinality fields** — For fields you plan to filter on (service names, labels, status codes, log levels), enumerate their actual values:
120+
```apl
121+
['dataset'] | where _time > ago(15m) | distinct field_name
122+
['dataset'] | where _time > ago(15m) | summarize count() by field_name | top 20 by count_
123+
```
124+
125+
**Step 4: Discover map type schemas** — Fields typed as `map[string]` (e.g., `attributes.custom`, `attributes`, `resource`) don't show their keys in `getschema`. You MUST sample them to discover their internal structure:
126+
```apl
127+
// Sample 1 raw event to see all map keys
128+
['dataset'] | where _time > ago(15m) | take 1
129+
130+
// If too wide, project just the map column and sample
131+
['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
132+
133+
// Discover distinct keys inside a map column
134+
['dataset'] | where _time > ago(15m) | extend keys = ['attributes.custom'] | mv-expand keys | summarize count() by tostring(keys) | top 20 by count_
135+
```
136+
137+
**Why this matters:** Map fields (common in OTel traces/spans) contain nested key-value pairs that are invisible to `getschema`. If you query `['attributes.http.status_code']` without first confirming that key exists, you're guessing. The actual field might be `['attributes.http.response.status_code']` or stored inside `['attributes.custom']` as a map key.
138+
139+
**NEVER assume field names inside map types.** Always sample first.
112140

113141
### B. CODE CONTEXT
114142
- **Locate Code:** Find the relevant service in the repository
@@ -424,8 +452,10 @@ See `reference/postmortem-template.md` for retrospective format.
424452

425453
**If `scripts/init` warns of BLOAT:**
426454
1. **Finish task:** Solve the current incident first
427-
2. **Request sleep:** "Memory is full. Start a new session with `scripts/sleep` to consolidate."
428-
3. **Consolidate:** Read raw facts, synthesize into patterns, clean noise
455+
2. **Request sleep:** "Memory is full. Start a new session with sleep cycle."
456+
3. **Run packaged sleep:** `scripts/sleep --org axiom` (default is full preset)
457+
4. **Distill via fixed prompt:** write exactly one incidents/facts/patterns/queries sleep-cycle entry set (use `-v2`/`-v3` if same-day key exists and add `Supersedes`).
458+
5. **No improvisation:** Use the script output and prompt template; do not invent details.
429459

430460
---
431461

skills/sre/reference/apl.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,63 @@ axiom-query staging -f /tmp/q.apl
4545
['dataset'] | extend value = tostring(['attributes']['nested.key'])
4646
```
4747

48+
### Map Type Discovery (CRITICAL for OTel Traces)
49+
50+
Fields typed as `map[string]` in `getschema` (e.g., `attributes`, `attributes.custom`, `resource`, `resource.attributes`) are opaque containers — `getschema` only shows the column name and type `map[string]`, NOT the keys inside. You must discover map contents explicitly.
51+
52+
**Step 1: Identify map columns** — Run `getschema` and look for `map` types:
53+
```apl
54+
['traces-dataset'] | getschema
55+
// Look for: attributes map[string]...
56+
// attributes.custom map[string]...
57+
// resource map[string]...
58+
```
59+
60+
**Step 2: Sample raw events** — The fastest way to see actual map keys:
61+
```apl
62+
// See full event structure including all map keys
63+
['traces-dataset'] | where _time > ago(15m) | take 1
64+
65+
// Project just the map column to reduce noise
66+
['traces-dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
67+
['traces-dataset'] | where _time > ago(15m) | project attributes | take 5
68+
```
69+
70+
**Step 3: Enumerate distinct keys** — For high-cardinality maps, find what keys exist:
71+
```apl
72+
// List keys and their frequency
73+
['traces-dataset'] | where _time > ago(15m)
74+
| extend keys = ['attributes.custom']
75+
| mv-expand keys
76+
| summarize count() by tostring(keys)
77+
| top 30 by count_
78+
```
79+
80+
**Step 4: Access map values in queries** — Use bracket notation:
81+
```apl
82+
// Access a specific key inside a map column
83+
['traces-dataset'] | where _time > ago(15m)
84+
| extend http_status = toint(['attributes.custom']['http.response.status_code'])
85+
86+
// Filter on map values
87+
['traces-dataset'] | where _time > ago(15m)
88+
| where tostring(['attributes.custom']['db.system']) == "redis"
89+
90+
// Multiple map fields
91+
['traces-dataset'] | where _time > ago(15m)
92+
| extend method = tostring(['attributes']['http.method']),
93+
route = tostring(['attributes']['http.route']),
94+
status = toint(['attributes']['http.response.status_code'])
95+
```
96+
97+
**Common OTel map columns and what they contain:**
98+
- `attributes` — Span attributes (HTTP method, status, DB queries, custom tags)
99+
- `attributes.custom` — Non-standard/user-defined span attributes
100+
- `resource` — Resource attributes (service.name, host, k8s metadata)
101+
- `resource.attributes` — Additional resource metadata
102+
103+
**WARNING:** Do NOT assume key names inside maps. The same semantic attribute may appear under different keys depending on instrumentation library, OTel SDK version, or custom configuration. Always sample first.
104+
48105
**Common escaped fields in k8s-logs-prod:**
49106
- `kubernetes.node_labels.karpenter\\.sh/nodepool`
50107
- `kubernetes.node_labels.nodepool\\.axiom\\.co/name`

skills/sre/reference/memory-system.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,10 +139,20 @@ Connection pool exhausted. Found leak in payment handler.
139139

140140
Run after incidents or periodically:
141141
```bash
142-
scripts/sleep # Review recent additions for consolidation
142+
scripts/sleep # default full preset: clean + share + prompt
143+
scripts/sleep --org axiom # same full preset, scoped to one org
144+
scripts/sleep --org axiom --dry-run # analyze + prompt only
143145
```
144146

145-
This will dump recent memory additions for you to review and synthesize into patterns.
147+
Deep sleep phases:
148+
- `N1 review` recent entries in the selected window.
149+
- `N2 analysis` entry counts, duplicate keys, and type drift.
150+
- `N3 apply` deterministic cleanup (keep newest duplicate, drop `Supersedes` targets, normalize `type` in incidents/patterns/queries).
151+
- `REM share` commit/push org repo changes.
152+
153+
Safety defaults:
154+
- no mode flags => full preset.
155+
- `--dry-run` never modifies files and never pushes.
146156

147157
## Health Check
148158

skills/sre/reference/query-patterns.md

Lines changed: 47 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,50 @@
11
# Signal Reading Query Patterns
22

3+
## Schema & Value Discovery (MANDATORY FIRST STEP)
4+
5+
**Always run schema discovery before writing investigation queries.** Do not guess field names.
6+
7+
```apl
8+
// Step 1: Get schema with types
9+
['dataset'] | getschema
10+
11+
// Step 2: Sample raw events to see actual data shape (especially map fields)
12+
['dataset'] | where _time > ago(15m) | take 1
13+
14+
// Step 3: Discover values of low-cardinality fields you plan to filter on
15+
['dataset'] | where _time > ago(15m) | distinct ['kubernetes.labels.app']
16+
['dataset'] | where _time > ago(15m) | summarize count() by ['service.name'] | top 20 by count_
17+
['dataset'] | where _time > ago(15m) | summarize count() by level | top 10 by count_
18+
19+
// Step 4: Discover keys inside map[string] columns (getschema won't show these)
20+
// OTel traces datasets commonly have: attributes, attributes.custom, resource
21+
['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
22+
['dataset'] | where _time > ago(15m) | project attributes | take 5
23+
```
24+
25+
**Rule:** If your first filter query returns 0 results, run schema discovery before trying another filter.
26+
27+
### Map Type Key Discovery (OTel Traces)
28+
29+
Map columns (`map[string]` type) are common in OTel traces datasets. `getschema` shows the column exists but NOT its internal keys. You must sample to discover them.
30+
31+
```apl
32+
// Sample map column contents
33+
['traces'] | where _time > ago(15m) | project ['attributes.custom'] | take 3
34+
35+
// Enumerate all distinct keys in a map column
36+
['traces'] | where _time > ago(15m)
37+
| extend keys = ['attributes.custom']
38+
| mv-expand keys
39+
| summarize count() by tostring(keys)
40+
| top 30 by count_
41+
42+
// Access specific map values (use bracket notation)
43+
['traces'] | where _time > ago(15m)
44+
| extend status = toint(['attributes.custom']['http.response.status_code']),
45+
method = tostring(['attributes']['http.method'])
46+
```
47+
348
Ready-to-use APL queries for common investigation scenarios.
449

550
## Error Analysis
@@ -112,16 +157,10 @@ Ready-to-use APL queries for common investigation scenarios.
112157
| project _time, request_id, service, uri, status
113158
```
114159

115-
## Schema Discovery
160+
## General Schema Helpers
116161

117162
```apl
118-
// Get schema with types (Fastest)
119-
['dataset'] | getschema
120-
121-
// Sample data to see specific fields
122-
['dataset'] | where _time between (ago(1h) .. now()) | project _time, message, level | take 5
123-
124-
// Top values for a field
163+
// Top values for any field
125164
['dataset'] | where _time between (ago(1h) .. now()) | summarize topk(field, 10)
126165
127166
// What services exist?

0 commit comments

Comments
 (0)