all-source-os
diff --git a/‎.claude/skills/chronos-data-flow-embedded/SKILL.md‎
Lines changed: 100 additions & 0 deletions b/‎.claude/skills/chronos-data-flow-embedded/SKILL.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎.claude/skills/chronos-embedded-durability/SKILL.md‎
Lines changed: 150 additions & 0 deletions b/‎.claude/skills/chronos-embedded-durability/SKILL.md‎
Lines changed: 150 additions & 0 deletions
diff --git a/‎.claude/skills/chronos-release/SKILL.md‎
Lines changed: 25 additions & 5 deletions b/‎.claude/skills/chronos-release/SKILL.md‎
Lines changed: 25 additions & 5 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 52 additions & 2 deletions b/‎Cargo.lock‎
Lines changed: 52 additions & 2 deletions
@@ -0,0 +1,100 @@
+---
+name: chronos-data-flow-embedded
+description: Test Chronos MCP server with embedded Core backend (CORE_MODE=embedded). Verifies Rustler NIF prerequisites, event ingest/query through embedded backend, schema operations, stats/health, and data persistence across restarts. Triggers on "test embedded data flow", "test embedded", "embedded health", "test embedded backend", "verify embedded", "test nif".
+category: testing
+color: cyan
+displayName: Chronos Embedded Data Flow Test
+---
+
+# Chronos Embedded Data Flow Test
+
+Tests the MCP server running with an embedded Core backend — no separate Core container needed. Core runs in-process via Rustler NIFs (`CORE_MODE=embedded`).
+
+## When invoked:
+
+1. Run the embedded data flow test script:
+   ```bash
+   bash tooling/embedded-data-flow-test/test-embedded-data-flow.sh
+   ```
+
+2. Analyze the output and report:
+   - Whether prerequisites are met (NIF compiled, CoreBackend behaviour, CoreEmbedded module)
+   - Which tests passed, failed, or were skipped
+   - Whether the embedded backend is fully functional or still in development
+   - Root cause analysis for any failures
+
+3. If specific sections were requested, use the appropriate flag:
+   - `--prereqs` — check prerequisites only (no server start)
+   - `--ingest` — event ingestion tests only
+   - `--query` — event querying tests only
+   - `--schema` — schema operations tests only
+   - `--stats` — stats/health tests only
+   - `--persist` — persistence test (restart cycle, opt-in)
+   - `--json` — append for machine-readable output
+
+4. If the user just wants to check implementation status, run with `--prereqs` only.
+
+## Test Coverage
+
+The script tests these embedded data flow paths:
+
+| Test Area | What's Checked |
+|-----------|---------------|
+| **Prerequisites** | MCP dir, mix.exs, Elixir runtime, Rustler NIF source/binary, CoreBackend behaviour, CoreEmbedded module, CORE_MODE config |
+| **Server Start** | Start MCP server with `CORE_MODE=embedded`, MCP initialize handshake |
+| **Event Ingestion** | Ingest single event, ingest batch (3 events) via `ingest_event` tool |
+| **Event Querying** | Query by entity_id, query by event_type, query with limit, state reconstruction |
+| **Schema Ops** | Register schema, list schemas, get schema by subject |
+| **Stats/Health** | get_stats, storage_stats, wal_status, deep health check |
+| **Persistence** | Write 5 events, stop server, restart with same data dir, verify events survived |
+
+## Architecture Context
+
+In embedded mode, the data flow is:
+
+```
+MCP Tool Call → CoreBackend behaviour → CoreEmbedded (Rustler NIF) → Core Rust Engine
+                                                                        ↓
+                                                              WAL + Parquet + DashMap
+```
+
+vs. remote mode (the Docker stack):
+
+```
+MCP Tool Call → CoreBackend behaviour → CoreClient (HTTP) → Core Container (port 3900)
+```
+
+Both backends implement the same `CoreBackend` behaviour (27 callbacks). The embedded test verifies the NIF path works identically to the HTTP path.
+
+## Implementation Status
+
+The embedded backend is being built in phases (see `docs/proposals/MCP_EMBEDDED_BACKEND.md`):
+
+| Phase | Component | Status |
+|-------|-----------|--------|
+| 1 | CoreBackend behaviour | Done |
+| 1 | CoreClient implements @behaviour | Done |
+| 1 | mcp_tools.ex uses state.backend | Not started |
+| 1 | server.ex stores backend in state | Not started |
+| 2 | CoreEmbedded NIF module | Not started |
+| 2 | native/core_nif/ Rustler crate | Not started |
+| 3 | CORE_MODE config switch | Not started |
+| 4 | application.ex conditional children | Not started |
+
+Run `--prereqs` to get a live status check of which components exist.
+
+## Common Failure Patterns
+
+- **"NIF not compiled"**: Expected during development. Run `--prereqs` to verify implementation status.
+- **"Cannot start MCP server"**: The CoreEmbedded module or CORE_MODE config switch isn't implemented yet.
+- **"Events not found after restart"**: WAL/Parquet persistence in the embedded backend isn't wired up — check that `ALLSOURCE_DATA_DIR` is being passed to the Rust engine via NIF.
+- **"Timeout on tool call"**: NIF is blocking the BEAM scheduler — ensure `schedule = "DirtyCpu"` is set on all Rustler functions.
+
+## Environment Variables
+
+```bash
+# Override defaults
+MCP_DIR=/path/to/mcp-server-elixir \
+DATA_DIR=/tmp/my-test-data \
+bash tooling/embedded-data-flow-test/test-embedded-data-flow.sh
+```
@@ -0,0 +1,150 @@
+---
+name: chronos-embedded-durability
+description: Test AllSource Core embedded durability by running Rust integration tests that verify WAL recovery, crash recovery, and checkpoint correctness. Catches silent-no-op checkpoint bugs like issue #84. Triggers on "test embedded durability", "embedded crash recovery", "test wal recovery", "test embedded persistence", "verify embedded durability", "issue 84".
+category: testing
+color: red
+displayName: Chronos Embedded Durability Test
+---
+
+# Chronos Embedded Durability Test
+
+Tests the `allsource-core` Rust crate's durability guarantees directly — no Docker, no HTTP, no MCP. Exercises the exact code paths that run when `EmbeddedCore::open` is called on a directory with existing WAL/Parquet data.
+
+## Why this exists
+
+Issue [#84](https://github.com/all-source-os/all-source/issues/84) revealed a class of bug where:
+1. WAL recovery loads events into memory correctly
+2. `flush_storage()` returns `Ok(())` as a **silent no-op** (empty Parquet batch)
+3. WAL is truncated unconditionally after "successful" checkpoint
+4. Events exist only in memory → process exit → data loss
+
+The existing Docker durability test (`chronos-durability`) doesn't catch this because it tests the HTTP server path, not the embedded library path. The embedded data flow test uses clean shutdown, which works fine — the bug only manifests on **unclean restart**.
+
+## When invoked:
+
+1. Run the targeted Rust integration tests:
+   ```bash
+   cd apps/core
+   cargo test --test integration_tests test_wal_durability_and_recovery -- --nocapture
+   cargo test --test embedded_core_api --features embedded events_survive -- --nocapture
+   ```
+
+2. If those pass, run the full embedded durability suite:
+   ```bash
+   cd apps/core
+   cargo test --test embedded_core_api --features embedded -- --nocapture
+   cargo test --test integration_tests -- --nocapture
+   ```
+
+3. If the `durability_status()` API is available, run the #84 regression suite:
+   ```bash
+   bash tooling/embedded-durability-test/test-embedded-durability.sh --status
+   ```
+
+4. For the full suite including all scenarios:
+   ```bash
+   bash tooling/embedded-durability-test/test-embedded-durability.sh
+   ```
+
+5. Analyze the output and report:
+   - Which tests passed/failed
+   - Whether the WAL recovery checkpoint bug (#84 pattern) is present
+   - Whether events survive unclean shutdown (drop without `shutdown()`)
+   - Whether Parquet files are actually written during checkpoint-on-open
+
+## Test Matrix
+
+The script and Rust tests cover these scenarios:
+
+| Scenario | What's Tested | Bug #84 Relevant? |
+|----------|--------------|-------------------|
+| **Clean shutdown + reopen** | `shutdown()` → drop → `open()` → query | No (this path works) |
+| **Drop without shutdown + reopen** | drop (no `shutdown()`) → `open()` → query | **YES — this is the #84 path** |
+| **WAL-only recovery** | Write events, skip Parquet flush, reopen | **YES** |
+| **Parquet-only recovery** | Flush to Parquet, delete WAL, reopen | No (different path) |
+| **WAL + Parquet recovery** | Both exist on disk, reopen | Partially |
+| **Checkpoint verification** | After recovery, assert Parquet files exist on disk | **YES — catches silent no-op** |
+| **durability_status() after ingest** | `memory > 0, wal > 0, durable=true` | Validates fsync is working |
+| **durability_status() after recovery** | `parquet_pending_batch == memory_events` | **YES — the exact #84 invariant** |
+| **durability_status() warns on memory-only** | `warnings.len() > 0` for dangerous state | Runtime detection of #84 |
+| **durability_status() unclean restart** | Drop → reopen → `durable=true, no warnings` | **End-to-end #84 regression** |
+| **Large volume recovery** | 1000+ events, drop, reopen, verify count | Stress variant of #84 |
+| **Concurrent write + kill** | Spawn writer tasks, kill mid-write, reopen | Edge case |
+
+## Key Assertions (what distinguishes this from other durability tests)
+
+1. **After unclean restart, `query()` returns all events** — not just the ones that were in Parquet before the crash
+2. **After recovery checkpoint, Parquet files exist on disk** — `ls storage/` is not empty
+3. **After recovery checkpoint, WAL can be safely truncated** — events are in Parquet, not just memory
+4. **`flush_storage()` after WAL recovery writes > 0 bytes** — catches the silent no-op
+
+## `durability_status()` API (#84 regression tests)
+
+The other thread is adding `EmbeddedCore::durability_status()` which returns the exact internal state
+that caused #84. The `--status` flag exercises this API through 5 invariant checks:
+
+```json
+{
+  "memory_events": 63,
+  "wal_entries": 63,
+  "wal_bytes": 91000,
+  "wal_sequence": 63,
+  "parquet_files": 0,
+  "parquet_bytes": 0,
+  "parquet_pending_batch": 63,
+  "durable": false,
+  "warnings": ["63 events in memory but 0 in Parquet and 0 in WAL — data loss on restart"]
+}
+```
+
+| Test | Invariant Checked | #84 Signal |
+|------|-------------------|------------|
+| `durability_status_after_ingest` | `memory_events > 0 && wal_entries > 0 && durable == true` | If `wal_entries == 0`, fsync isn't working |
+| `durability_status_after_flush` | `parquet_files > 0 && parquet_pending_batch == 0` | Parquet actually wrote to disk |
+| `durability_status_after_recovery` | `memory == wal && parquet_pending_batch == memory && durable == true` | **The #84 check** — after WAL recovery, events must be in `parquet_pending_batch` before truncation |
+| `durability_status_warns_on_memory_only` | `warnings.len() > 0` when events in memory but not WAL/Parquet | Would have flagged #84 at runtime |
+| `durability_status_survives_unclean_restart` | Drop (no shutdown) → reopen → `durable == true && warnings.is_empty()` | End-to-end #84 regression |
+
+Run with:
+```bash
+bash tooling/embedded-durability-test/test-embedded-durability.sh --status
+```
+
+These tests expect corresponding `#[tokio::test]` functions in `apps/core/tests/embedded_core_api.rs`
+that call `core.durability_status()` and assert the invariants above. The Rust tests are the source of truth;
+this shell script just runs them and reports results.
+
+## Common Failure Patterns
+
+- **"events lost after restart"**: The #84 bug — WAL events loaded into memory but not checkpointed to Parquet before WAL truncation
+- **"storage/ directory empty after recovery"**: Same root cause — `flush_storage()` is a no-op because `current_batch` was never populated from WAL recovery
+- **"test_wal_durability_and_recovery fails"**: The EventStore-level recovery path has the bug
+- **"events_survive_store_restart_via_wal passes but unclean test fails"**: Clean shutdown works (events go through `ingest()` → `append_event()` → `current_batch`), but recovery path doesn't populate `current_batch`
+
+## Relationship to Other Skills
+
+| Skill | Scope | Catches #84? |
+|-------|-------|-------------|
+| `chronos-durability` | Docker container restart (HTTP path) | No |
+| `chronos-data-flow` | Docker stack connectivity | No |
+| `chronos-data-flow-embedded` | MCP embedded backend (clean shutdown) | No |
+| **`chronos-embedded-durability`** | **Rust crate crash recovery (unclean shutdown)** | **Yes** |
+
+## Running Manually
+
+```bash
+# Quick: just the #84-relevant tests
+bash tooling/embedded-durability-test/test-embedded-durability.sh --quick
+
+# durability_status() API regression tests only
+bash tooling/embedded-durability-test/test-embedded-durability.sh --status
+
+# Full: all embedded durability tests
+bash tooling/embedded-durability-test/test-embedded-durability.sh
+
+# Or run cargo tests directly:
+cd apps/core
+cargo test --test integration_tests test_wal_durability -- --nocapture
+cargo test --test embedded_core_api --features embedded durability_status -- --nocapture
+cargo test --test embedded_core_api --features embedded events_survive -- --nocapture
+```
@@ -52,17 +52,37 @@ Check for any other version references that `set-version` might miss:
 
 ### 4. Run CI to green
 
+**IMPORTANT: Run the three quality gates in PARALLEL, not sequentially.**
+
+`make ci` runs Rust → Go → Elixir sequentially, which takes 10+ minutes per iteration. Instead, launch all three as background tasks:
+
+```bash
+# Launch all three in parallel (use run_in_background for each)
+make quality-rust 2>&1 | tail -5      # ~3-4 min (clippy + test + doc)
+make quality-go 2>&1 | tail -5        # ~30 sec
+make quality-elixir-full 2>&1 | tail -10  # ~4-5 min (dialyzer + tests)
+```
+
+Wait for all three to complete, then check results. This cuts CI time from 10+ min to ~4 min per iteration.
+
+**For targeted re-checks after fixes**, only re-run the affected gate:
+- Rust fix → `make quality-rust`
+- Elixir fix → `make quality-elixir-full`
+- Go fix → `make quality-go`
+
+**Pre-fix common issues before running gates** to minimize iterations:
 ```bash
-make ci
+# Always run these before the first CI attempt:
+cargo +nightly fmt --all
+cargo +nightly sort --workspace
+cd apps/mcp-server-elixir && mix format && cd ../query-service && mix format && cd ../..
 ```
 
-If CI fails, fix all issues iteratively:
-- **Rust**: `cargo +nightly fmt`, `cargo +nightly sort`, clippy fixes, doc link fixes
+If CI fails, fix issues and re-run only the failing gate(s):
+- **Rust**: `cargo +nightly fmt --all`, `cargo +nightly sort --workspace`, clippy fixes, doc link fixes
 - **Go**: `gofmt`, golangci-lint fixes
 - **Elixir**: `mix format`, `mix deps.unlock --unused`, credo fixes, test fixes
 
-Re-run `make ci` after each round of fixes until it passes.
-
 ### 5. Commit (single squashed commit)
 
 Stage all changes and create exactly ONE commit: