k8s-causal-memory

Operational Memory Architecture (OMA) for Kubernetes — an open-source system that captures, stores, and queries causal event chains in Kubernetes clusters, preserving diagnostic context that the platform's native event retention model discards within 90 seconds.

The Problem

When a Kubernetes pod crashes, the platform gives you approximately 90 seconds to capture the evidence before it's overwritten. The LastTerminationState field — which records the exact reason, exit code, and resource context of a container failure — is replaced the moment a new restart cycle begins.

T=0s    OOMKill fires      ← exit_code=137, memory=64Mi, ConfigMap=oom-app-config
T=15s   Pod restarts       ← LastTerminationState overwritten
T=90s   kubectl describe   ← Error: evidence rotated, partial data only
T+5min  On-call arrives    ← kubectl get pod: Error from server (NotFound)

Existing tools — Prometheus, Grafana, ELK — record what happened. None preserve the causal context linking why it happened, which configuration was active, or what the cluster state was at the exact moment of failure.

What OMA Captures

Three causal patterns encoded as first-class definitions:

Pattern	Trigger	What OMA Preserves
P001 OOMKill Chain	Container OOMKilled	Exit code, resource limits, ConfigMaps in effect, node state — frozen at kill time
P002 ConfigMap Env Var Stale	ConfigMap updated	Content hash delta, changed keys, list of pods still running with old values
P003 ConfigMap Mount Swap	ConfigMap updated	Kubelet symlink swap timestamp, propagation latency measurement

Architecture

OMA comprises four layers:

┌─────────────────────────────────────────────────────┐
│              Kubernetes API Server                   │
│        (Pod / Node / ConfigMap watch streams)        │
└──────────────┬──────────────────┬───────────────────┘
               │                  │
┌──────────────▼──────────────────▼───────────────────┐
│  Layer 1 — Go Collector (collector/)                 │
│  PodWatcher │ NodeWatcher │ ConfigMapWatcher         │
│  Captures events with full payload at moment of      │
│  occurrence → output/events.jsonl                    │
└──────────────────────────┬──────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────┐
│  Layer 2 — Operational Memory Store (storage/)       │
│  SQLite (WAL mode) — 4 tables:                       │
│  events │ causal_edges │ snapshots │ patterns        │
│  Causal edges built automatically on ingest          │
└──────────────────────────┬──────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────┐
│  Layer 3 — Query Interface (storage/query.py)        │
│  Q1: causal-chain   "What caused this failure?"      │
│  Q2: pattern-history "Has this happened before?"     │
│  Q3: state-at       "What was the state at time T?"  │
└──────────────────────────┬──────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────┐
│  Layer 4 — Integration Surface (storage/api.py)      │
│  REST API │ Alert webhooks │ AI diagnosis integrations│
└─────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Go 1.21+
Python 3.11+
kubectl configured against a cluster
Minikube (for local testing) or any Kubernetes cluster

Build the Collector

cd collector
go build -o bin/collector .

Run a Scenario

Terminal 1 — Start the collector:

./collector/bin/collector --namespace oma-demo --output ./output

Terminal 2 — Run a scenario:

bash scenarios/01-oomkill/trigger.sh

Ingest and Query

cd storage
pip install -r requirements.txt
python ingest.py --events ../output/events.jsonl --snapshots ../output/snapshots.jsonl
python query.py summary
python query.py causal-chain --pod <pod-name> --namespace oma-demo
python query.py pattern-history --pattern P001

Proof of Concept Results

All results are reproducible from the JSONL files committed in docs/poc-results/. The collector was run on two independent cluster environments.

Environment 1 — Minikube (local, 3 nodes)

Scenario 01: OOMKill Causal Chain (P001)

Events: 30  |  Causal edges: 13  |  Snapshots: 1
Pattern P001: 22 events across 4 restart cycles

Q1 Causal Chain:
  OOMKill  2026-02-27T00:10:48
    Node: opscart-m02
    Limits: cpu=100m  memory=64Mi
    ConfigMaps in effect: ['oom-app-config']
    Exit code: 1  Restart count: 4

⚠ Pattern P001 has fired 22 times — escalate to human review.

Scenario 02: ConfigMap Env Var (P002)

ConfigMapChanged  app-feature-config
  old_hash: 72f628cdff16ed24 → new_hash: 960c779cc1c53b0f
  changed_keys: [feature.flag, db.pool.size, api.timeout.ms, log.level]

Pod status after change:
  config-consumer-env-*: FEATURE_FLAG=disabled  ← STALE (ConfigMap now: enabled)
  config-consumer-env-*: FEATURE_FLAG=disabled  ← STALE
  Restart count: 0  (no restart triggered — this is the bug)

Environment 2 — Azure Kubernetes Service (AKS 1.32.10, 2× Standard_B2s)

Events: 20  |  Causal edges: 8  |  Snapshots: 1
Node: aks-nodepool1-78296979-vmss000000

Q1 Causal Chain:
  OOMKill  2026-03-01T17:19:44
    Node: aks-nodepool1-78296979-vmss000000
    Limits: cpu=100m  memory=64Mi
    ConfigMaps in effect: ['oom-app-config']
    Exit code: 137  Restart count: 3

Raw Causal Edges (8 total, all conf=1.0):
  OOMKill → OOMKillEvidence  (0.27ms gap)
  OOMKill → OOMKillEvidence  (1.09ms gap)
  ... (6 more)

Q3 Point-in-Time Snapshot:
  Pod/oom-victim-68f4d5ffd7-bvpcv (oma-demo)
  Trigger: PodDeleted
  Limits: {'oom-victim': {'cpu': '100m', 'memory': '64Mi'}}
  ConfigMaps: ['oom-app-config']
  Phase: Failed
  ← kubectl returns 404 for this pod. OMA returns full state.

All scenarios on AKS:

Scenario	Events	Key Metric	Result
P001 OOMKill	20	Causal edges	8 (conf=1.0), exit code 137, node aks-nodepool1-78296979-vmss000000
P002 ConfigMap env	2	Hash delta captured	72f628cd → 8ee0c528, 4 keys changed
P003 ConfigMap mount	2	Propagation latency	<30s symlink swap confirmed

Statistical Latency Analysis (30 Runs)

To quantify causal edge construction latency, we ran the P001 OOMKill scenario 30 independent times on Minikube, yielding 242 total causal edges.

The distribution is bimodal, reflecting two structurally distinct edge types:

Edge Class	Count	Min	Mean	Max
Intra-cycle (<100ms) — same restart cycle	88	0.089ms	0.702ms	2.607ms
Cross-cycle (≥100ms) — across restart boundaries	154	903ms	12,708ms	31,454ms

Intra-cycle edges: OOMKillEvidence captured within the same restart cycle — sub-millisecond latency confirms synchronous evidence capture before rotation
Cross-cycle edges: OOMKillEvidence events linked back to OOMKill events from prior restart cycles — latency reflects actual restart interval timing (10–30s), not processing delay

Run the full breakdown across all 30 runs:

bash scripts/analyze-latency.sh

Stress Evaluation (Concurrent OOMKill Pods)

We deployed 5, 10, and 20 simultaneous crash-looping pods on Minikube for 120 seconds each:

Pods	Events	Events/sec	Edges	Collector RAM	Collector CPU
5	95	0.77	51	7.9 MB	<0.1%
10	175	1.43	90	8.2 MB	<0.1%
20	355	2.86	197	8.8 MB	<0.1%

Event ingestion scales linearly with pod count. Collector memory stays flat at 8–9 MB regardless of load — the streaming JSONL model accumulates no in-memory state.

What kubectl Cannot Do

Capability	kubectl	OMA
OOMKill evidence after restart	Lost in <90s	Preserved indefinitely
Resource limits at kill time	Lost with pod	Frozen in snapshot
ConfigMap in effect at failure	Not available	Captured with refs
Stale env var detection	Not possible	P002 pattern
State of deleted objects	`Error (NotFound)`	Q3 state-at query
Causal chain reconstruction	Not possible	Q1 with edges
Pattern recurrence detection	Not possible	Q2 with escalation

Repository Structure

k8s-causal-memory/
├── collector/              # Go Kubernetes event collector
│   ├── main.go
│   ├── watcher/            # Pod, Node, ConfigMap watchers
│   ├── patterns/           # P001, P002, P003 encoders
│   └── emitter/            # JSONL output
├── storage/                # Python storage and query layer
│   ├── schema.sql          # SQLite schema (4 tables)
│   ├── ingest.py           # JSONL → SQLite + causal edge construction
│   ├── query.py            # Q1 / Q2 / Q3 canonical queries
│   └── api.py              # REST API (Layer 4)
├── scenarios/
│   ├── 01-oomkill/         # P001: OOMKill causal chain
│   ├── 02-configmap-env/   # P002: Env var silent misconfiguration
│   └── 03-configmap-mount/ # P003: Volume mount symlink swap
├── scripts/
│   └── analyze-latency.sh  # Bimodal latency breakdown across 30 runs
├── docs/
│   ├── architecture.md
│   └── poc-results/        # Committed JSONL + query outputs (reproducible)
│       ├── 01-oomkill/
│       ├── 02-configmap-env/
│       ├── 03-configmap-mount/
│       ├── aks-final/      # AKS 1.32.10 run
│       ├── latency-stats/  # 30-run statistical latency analysis
│       └── stress-eval/    # 5/10/20 pod concurrent stress evaluation
├── run-latency-stats.sh    # Automates 30-run latency collection
├── run-stress-eval.sh      # Automates stress evaluation
└── save-results.sh         # Preserve run output to docs/poc-results/

Scenarios

Scenario 01: OOMKill (P001)

Deploys a pod with a 64Mi memory limit configured to allocate 128Mi. Captures the full OOMKill causal chain before the 90-second evidence horizon.

bash scenarios/01-oomkill/trigger.sh

Scenario 02: ConfigMap Env Var Stale Config (P002)

Deploys 2 pods consuming a ConfigMap as environment variables. Updates the ConfigMap and proves pods continue running with stale values — zero restarts, zero awareness.

bash scenarios/02-configmap-env/trigger.sh

Scenario 03: ConfigMap Volume Mount Propagation (P003)

Deploys a pod consuming a ConfigMap as a volume mount. Measures kubelet symlink swap propagation latency after a ConfigMap update.

bash scenarios/03-configmap-mount/trigger.sh

Canonical Queries

# Q1: What caused this OOMKill? (causal chain reconstruction)
python query.py causal-chain --pod <pod-name> --namespace oma-demo

# Q2: Has this pattern occurred before? (recurrence detection)
python query.py pattern-history --pattern P001  # or P002, P003

# Q3: What was the cluster state at time T? (point-in-time, even after deletion)
python query.py state-at --kind Pod --name <pod-name> --namespace oma-demo --at "2026-03-01T17:19:44"

Contributing

Additional pattern encoders, storage backends, and integration adapters are welcome. See CONTRIBUTING.md.

Pattern contributions should follow the existing structure in collector/patterns/ and include:

A causal pattern definition (trigger, evidence, effect, temporal windows)
A scenario trigger script in scenarios/
Expected output in scenarios/<n>/expected-output.json

License

MIT License — see LICENSE.

Built and validated on Minikube and Azure Kubernetes Service 1.32.10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k8s-causal-memory

The Problem

What OMA Captures

Architecture

Quick Start

Prerequisites

Build the Collector

Run a Scenario

Ingest and Query

Proof of Concept Results

Environment 1 — Minikube (local, 3 nodes)

Environment 2 — Azure Kubernetes Service (AKS 1.32.10, 2× Standard_B2s)

Statistical Latency Analysis (30 Runs)

Stress Evaluation (Concurrent OOMKill Pods)

What kubectl Cannot Do

Repository Structure

Scenarios

Scenario 01: OOMKill (P001)

Scenario 02: ConfigMap Env Var Stale Config (P002)

Scenario 03: ConfigMap Volume Mount Propagation (P003)

Canonical Queries

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
collector		collector
docs		docs
scenarios		scenarios
scripts		scripts
storage		storage
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fix-save-results.sh		fix-save-results.sh
save-results.sh		save-results.sh

Folders and files

Latest commit

History

Repository files navigation

k8s-causal-memory

The Problem

What OMA Captures

Architecture

Quick Start

Prerequisites

Build the Collector

Run a Scenario

Ingest and Query

Proof of Concept Results

Environment 1 — Minikube (local, 3 nodes)

Environment 2 — Azure Kubernetes Service (AKS 1.32.10, 2× Standard_B2s)

Statistical Latency Analysis (30 Runs)

Stress Evaluation (Concurrent OOMKill Pods)

What kubectl Cannot Do

Repository Structure

Scenarios

Scenario 01: OOMKill (P001)

Scenario 02: ConfigMap Env Var Stale Config (P002)

Scenario 03: ConfigMap Volume Mount Propagation (P003)

Canonical Queries

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages