sre: sync from gilfoyle@be465fa (#25)

github-actions[bot] · web-flow · commit da55729c5aa0 · 2026-02-14T23:49:41.000+01:00
Co-authored-by: github-actions[bot] &lt;github-actions[bot]@users.noreply.github.com&gt;
diff --git a/skills/sre/SKILL.md b/skills/sre/SKILL.md
@@ -11,7 +11,7 @@ You are an expert SRE. You stay calm under pressure. You stabilize first, debug
 
 ## Golden Rules
 
-1. **NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries.
+1. **NEVER GUESS. EVER.** If you don't know, query. If you can't query, ask. Reading code tells you what COULD happen. Only data tells you what DID happen. "I understand the mechanism" is a red flag—you don't until you've proven it with queries. Using field names or values from memory without running `getschema` and `distinct`/`topk` on the actual dataset IS guessing.
 
 2. **Follow the data.** Every claim must trace to a query result. Say "the logs show X" not "this is probably X". If you catch yourself saying "so this means..."—STOP. Query to verify.
 
@@ -105,10 +105,38 @@ scripts/init
 
 Follow this loop strictly.
 
-### A. DISCOVER
-- Review `scripts/init` output
-- Map your mental model to available datasets
-- If you see `['k8s-logs-prod']`, use that—not `['logs']`
+### A. DISCOVER (MANDATORY — DO NOT SKIP)
+
+**Before writing ANY query against a dataset, you MUST discover its schema.** This is not optional. Skipping schema discovery is the #1 cause of lazy, wrong queries.
+
+**Step 1: Identify datasets** — Review `scripts/init` output. Use ONLY dataset names from discovery. If you see `['k8s-logs-prod']`, use that—not `['logs']`.
+
+**Step 2: Get schema** — Run `getschema` on every dataset you plan to query:
+```apl
+['dataset'] | getschema
+```
+
+**Step 3: Discover values of low-cardinality fields** — For fields you plan to filter on (service names, labels, status codes, log levels), enumerate their actual values:
+```apl
+['dataset'] | where _time > ago(15m) | distinct field_name
+['dataset'] | where _time > ago(15m) | summarize count() by field_name | top 20 by count_
+```
+
+**Step 4: Discover map type schemas** — Fields typed as `map[string]` (e.g., `attributes.custom`, `attributes`, `resource`) don't show their keys in `getschema`. You MUST sample them to discover their internal structure:
+```apl
+// Sample 1 raw event to see all map keys
+['dataset'] | where _time > ago(15m) | take 1
+
+// If too wide, project just the map column and sample
+['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
+
+// Discover distinct keys inside a map column
+['dataset'] | where _time > ago(15m) | extend keys = ['attributes.custom'] | mv-expand keys | summarize count() by tostring(keys) | top 20 by count_
+```
+
+**Why this matters:** Map fields (common in OTel traces/spans) contain nested key-value pairs that are invisible to `getschema`. If you query `['attributes.http.status_code']` without first confirming that key exists, you're guessing. The actual field might be `['attributes.http.response.status_code']` or stored inside `['attributes.custom']` as a map key.
+
+**NEVER assume field names inside map types.** Always sample first.
 
 ### B. CODE CONTEXT
 - **Locate Code:** Find the relevant service in the repository
@@ -424,8 +452,10 @@ See `reference/postmortem-template.md` for retrospective format.
 
 **If `scripts/init` warns of BLOAT:**
 1. **Finish task:** Solve the current incident first
-2. **Request sleep:** "Memory is full. Start a new session with `scripts/sleep` to consolidate."
-3. **Consolidate:** Read raw facts, synthesize into patterns, clean noise
+2. **Request sleep:** "Memory is full. Start a new session with sleep cycle."
+3. **Run packaged sleep:** `scripts/sleep --org axiom` (default is full preset)
+4. **Distill via fixed prompt:** write exactly one incidents/facts/patterns/queries sleep-cycle entry set (use `-v2`/`-v3` if same-day key exists and add `Supersedes`).
+5. **No improvisation:** Use the script output and prompt template; do not invent details.
 
 ---
 
diff --git a/skills/sre/reference/apl.md b/skills/sre/reference/apl.md
@@ -45,6 +45,63 @@ axiom-query staging -f /tmp/q.apl
 ['dataset'] | extend value = tostring(['attributes']['nested.key'])
 ```
 
+### Map Type Discovery (CRITICAL for OTel Traces)
+
+Fields typed as `map[string]` in `getschema` (e.g., `attributes`, `attributes.custom`, `resource`, `resource.attributes`) are opaque containers — `getschema` only shows the column name and type `map[string]`, NOT the keys inside. You must discover map contents explicitly.
+
+**Step 1: Identify map columns** — Run `getschema` and look for `map` types:
+```apl
+['traces-dataset'] | getschema
+// Look for: attributes        map[string]...
+//           attributes.custom  map[string]...
+//           resource           map[string]...
+```
+
+**Step 2: Sample raw events** — The fastest way to see actual map keys:
+```apl
+// See full event structure including all map keys
+['traces-dataset'] | where _time > ago(15m) | take 1
+
+// Project just the map column to reduce noise
+['traces-dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
+['traces-dataset'] | where _time > ago(15m) | project attributes | take 5
+```
+
+**Step 3: Enumerate distinct keys** — For high-cardinality maps, find what keys exist:
+```apl
+// List keys and their frequency
+['traces-dataset'] | where _time > ago(15m)
+| extend keys = ['attributes.custom']
+| mv-expand keys
+| summarize count() by tostring(keys)
+| top 30 by count_
+```
+
+**Step 4: Access map values in queries** — Use bracket notation:
+```apl
+// Access a specific key inside a map column
+['traces-dataset'] | where _time > ago(15m)
+| extend http_status = toint(['attributes.custom']['http.response.status_code'])
+
+// Filter on map values
+['traces-dataset'] | where _time > ago(15m)
+| where tostring(['attributes.custom']['db.system']) == "redis"
+
+// Multiple map fields
+['traces-dataset'] | where _time > ago(15m)
+| extend method = tostring(['attributes']['http.method']),
+         route = tostring(['attributes']['http.route']),
+         status = toint(['attributes']['http.response.status_code'])
+```
+
+**Common OTel map columns and what they contain:**
+- `attributes` — Span attributes (HTTP method, status, DB queries, custom tags)
+- `attributes.custom` — Non-standard/user-defined span attributes
+- `resource` — Resource attributes (service.name, host, k8s metadata)
+- `resource.attributes` — Additional resource metadata
+
+**WARNING:** Do NOT assume key names inside maps. The same semantic attribute may appear under different keys depending on instrumentation library, OTel SDK version, or custom configuration. Always sample first.
+
 **Common escaped fields in k8s-logs-prod:**
 - `kubernetes.node_labels.karpenter\\.sh/nodepool`
 - `kubernetes.node_labels.nodepool\\.axiom\\.co/name`
diff --git a/skills/sre/reference/memory-system.md b/skills/sre/reference/memory-system.md
@@ -139,10 +139,20 @@ Connection pool exhausted. Found leak in payment handler.
 
 Run after incidents or periodically:
 ```bash
-scripts/sleep    # Review recent additions for consolidation
+scripts/sleep                           # default full preset: clean + share + prompt
+scripts/sleep --org axiom               # same full preset, scoped to one org
+scripts/sleep --org axiom --dry-run     # analyze + prompt only
 ```
 
-This will dump recent memory additions for you to review and synthesize into patterns.
+Deep sleep phases:
+- `N1 review` recent entries in the selected window.
+- `N2 analysis` entry counts, duplicate keys, and type drift.
+- `N3 apply` deterministic cleanup (keep newest duplicate, drop `Supersedes` targets, normalize `type` in incidents/patterns/queries).
+- `REM share` commit/push org repo changes.
+
+Safety defaults:
+- no mode flags => full preset.
+- `--dry-run` never modifies files and never pushes.
 
 ## Health Check
 
diff --git a/skills/sre/reference/query-patterns.md b/skills/sre/reference/query-patterns.md
@@ -1,5 +1,50 @@
 # Signal Reading Query Patterns
 
+## Schema & Value Discovery (MANDATORY FIRST STEP)
+
+**Always run schema discovery before writing investigation queries.** Do not guess field names.
+
+```apl
+// Step 1: Get schema with types
+['dataset'] | getschema
+
+// Step 2: Sample raw events to see actual data shape (especially map fields)
+['dataset'] | where _time > ago(15m) | take 1
+
+// Step 3: Discover values of low-cardinality fields you plan to filter on
+['dataset'] | where _time > ago(15m) | distinct ['kubernetes.labels.app']
+['dataset'] | where _time > ago(15m) | summarize count() by ['service.name'] | top 20 by count_
+['dataset'] | where _time > ago(15m) | summarize count() by level | top 10 by count_
+
+// Step 4: Discover keys inside map[string] columns (getschema won't show these)
+// OTel traces datasets commonly have: attributes, attributes.custom, resource
+['dataset'] | where _time > ago(15m) | project ['attributes.custom'] | take 5
+['dataset'] | where _time > ago(15m) | project attributes | take 5
+```
+
+**Rule:** If your first filter query returns 0 results, run schema discovery before trying another filter.
+
+### Map Type Key Discovery (OTel Traces)
+
+Map columns (`map[string]` type) are common in OTel traces datasets. `getschema` shows the column exists but NOT its internal keys. You must sample to discover them.
+
+```apl
+// Sample map column contents
+['traces'] | where _time > ago(15m) | project ['attributes.custom'] | take 3
+
+// Enumerate all distinct keys in a map column
+['traces'] | where _time > ago(15m)
+| extend keys = ['attributes.custom']
+| mv-expand keys
+| summarize count() by tostring(keys)
+| top 30 by count_
+
+// Access specific map values (use bracket notation)
+['traces'] | where _time > ago(15m)
+| extend status = toint(['attributes.custom']['http.response.status_code']),
+         method = tostring(['attributes']['http.method'])
+```
+
 Ready-to-use APL queries for common investigation scenarios.
 
 ## Error Analysis
@@ -112,16 +157,10 @@ Ready-to-use APL queries for common investigation scenarios.
 | project _time, request_id, service, uri, status
 ```
 
-## Schema Discovery
+## General Schema Helpers
 
 ```apl
-// Get schema with types (Fastest)
-['dataset'] | getschema
-
-// Sample data to see specific fields
-['dataset'] | where _time between (ago(1h) .. now()) | project _time, message, level | take 5
-
-// Top values for a field
+// Top values for any field
 ['dataset'] | where _time between (ago(1h) .. now()) | summarize topk(field, 10)
 
 // What services exist?
diff --git a/skills/sre/scripts/sleep b/skills/sre/scripts/sleep