Skip to content

Commit 7133d44

Browse files
decebalclaude
andcommitted
release: v0.13.1 — embedded Core backend, durability status, CI fixes
- feat: MCP embedded backend via Rustler NIF (CoreBackend behaviour, CoreClient/CoreEmbedded implementations, CORE_MODE=embedded|remote config switch) - feat: CoreEmbedded.Supervisor lifecycle + SyncWorker for periodic cloud sync - feat: durability_status API for Core health deep checks - feat: new MCP tools — wal_status, storage_stats, partition_info, durability_status - fix: collapsible_if clippy warnings in store.rs and embedded/core.rs - fix: fold pipeline test assertion for snapshot threshold below check - chore: parallel CI quality gates in release skill (3x faster) - docs: proposals, vision doc, roadmaps, embedded data flow test tooling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2c14c0a commit 7133d44

File tree

81 files changed

+7511
-350
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+7511
-350
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
name: chronos-data-flow-embedded
3+
description: Test Chronos MCP server with embedded Core backend (CORE_MODE=embedded). Verifies Rustler NIF prerequisites, event ingest/query through embedded backend, schema operations, stats/health, and data persistence across restarts. Triggers on "test embedded data flow", "test embedded", "embedded health", "test embedded backend", "verify embedded", "test nif".
4+
category: testing
5+
color: cyan
6+
displayName: Chronos Embedded Data Flow Test
7+
---
8+
9+
# Chronos Embedded Data Flow Test
10+
11+
Tests the MCP server running with an embedded Core backend — no separate Core container needed. Core runs in-process via Rustler NIFs (`CORE_MODE=embedded`).
12+
13+
## When invoked:
14+
15+
1. Run the embedded data flow test script:
16+
```bash
17+
bash tooling/embedded-data-flow-test/test-embedded-data-flow.sh
18+
```
19+
20+
2. Analyze the output and report:
21+
- Whether prerequisites are met (NIF compiled, CoreBackend behaviour, CoreEmbedded module)
22+
- Which tests passed, failed, or were skipped
23+
- Whether the embedded backend is fully functional or still in development
24+
- Root cause analysis for any failures
25+
26+
3. If specific sections were requested, use the appropriate flag:
27+
- `--prereqs` — check prerequisites only (no server start)
28+
- `--ingest` — event ingestion tests only
29+
- `--query` — event querying tests only
30+
- `--schema` — schema operations tests only
31+
- `--stats` — stats/health tests only
32+
- `--persist` — persistence test (restart cycle, opt-in)
33+
- `--json` — append for machine-readable output
34+
35+
4. If the user just wants to check implementation status, run with `--prereqs` only.
36+
37+
## Test Coverage
38+
39+
The script tests these embedded data flow paths:
40+
41+
| Test Area | What's Checked |
42+
|-----------|---------------|
43+
| **Prerequisites** | MCP dir, mix.exs, Elixir runtime, Rustler NIF source/binary, CoreBackend behaviour, CoreEmbedded module, CORE_MODE config |
44+
| **Server Start** | Start MCP server with `CORE_MODE=embedded`, MCP initialize handshake |
45+
| **Event Ingestion** | Ingest single event, ingest batch (3 events) via `ingest_event` tool |
46+
| **Event Querying** | Query by entity_id, query by event_type, query with limit, state reconstruction |
47+
| **Schema Ops** | Register schema, list schemas, get schema by subject |
48+
| **Stats/Health** | get_stats, storage_stats, wal_status, deep health check |
49+
| **Persistence** | Write 5 events, stop server, restart with same data dir, verify events survived |
50+
51+
## Architecture Context
52+
53+
In embedded mode, the data flow is:
54+
55+
```
56+
MCP Tool Call → CoreBackend behaviour → CoreEmbedded (Rustler NIF) → Core Rust Engine
57+
58+
WAL + Parquet + DashMap
59+
```
60+
61+
vs. remote mode (the Docker stack):
62+
63+
```
64+
MCP Tool Call → CoreBackend behaviour → CoreClient (HTTP) → Core Container (port 3900)
65+
```
66+
67+
Both backends implement the same `CoreBackend` behaviour (27 callbacks). The embedded test verifies the NIF path works identically to the HTTP path.
68+
69+
## Implementation Status
70+
71+
The embedded backend is being built in phases (see `docs/proposals/MCP_EMBEDDED_BACKEND.md`):
72+
73+
| Phase | Component | Status |
74+
|-------|-----------|--------|
75+
| 1 | CoreBackend behaviour | Done |
76+
| 1 | CoreClient implements @behaviour | Done |
77+
| 1 | mcp_tools.ex uses state.backend | Not started |
78+
| 1 | server.ex stores backend in state | Not started |
79+
| 2 | CoreEmbedded NIF module | Not started |
80+
| 2 | native/core_nif/ Rustler crate | Not started |
81+
| 3 | CORE_MODE config switch | Not started |
82+
| 4 | application.ex conditional children | Not started |
83+
84+
Run `--prereqs` to get a live status check of which components exist.
85+
86+
## Common Failure Patterns
87+
88+
- **"NIF not compiled"**: Expected during development. Run `--prereqs` to verify implementation status.
89+
- **"Cannot start MCP server"**: The CoreEmbedded module or CORE_MODE config switch isn't implemented yet.
90+
- **"Events not found after restart"**: WAL/Parquet persistence in the embedded backend isn't wired up — check that `ALLSOURCE_DATA_DIR` is being passed to the Rust engine via NIF.
91+
- **"Timeout on tool call"**: NIF is blocking the BEAM scheduler — ensure `schedule = "DirtyCpu"` is set on all Rustler functions.
92+
93+
## Environment Variables
94+
95+
```bash
96+
# Override defaults
97+
MCP_DIR=/path/to/mcp-server-elixir \
98+
DATA_DIR=/tmp/my-test-data \
99+
bash tooling/embedded-data-flow-test/test-embedded-data-flow.sh
100+
```
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
name: chronos-embedded-durability
3+
description: Test AllSource Core embedded durability by running Rust integration tests that verify WAL recovery, crash recovery, and checkpoint correctness. Catches silent-no-op checkpoint bugs like issue #84. Triggers on "test embedded durability", "embedded crash recovery", "test wal recovery", "test embedded persistence", "verify embedded durability", "issue 84".
4+
category: testing
5+
color: red
6+
displayName: Chronos Embedded Durability Test
7+
---
8+
9+
# Chronos Embedded Durability Test
10+
11+
Tests the `allsource-core` Rust crate's durability guarantees directly — no Docker, no HTTP, no MCP. Exercises the exact code paths that run when `EmbeddedCore::open` is called on a directory with existing WAL/Parquet data.
12+
13+
## Why this exists
14+
15+
Issue [#84](https://github.com/all-source-os/all-source/issues/84) revealed a class of bug where:
16+
1. WAL recovery loads events into memory correctly
17+
2. `flush_storage()` returns `Ok(())` as a **silent no-op** (empty Parquet batch)
18+
3. WAL is truncated unconditionally after "successful" checkpoint
19+
4. Events exist only in memory → process exit → data loss
20+
21+
The existing Docker durability test (`chronos-durability`) doesn't catch this because it tests the HTTP server path, not the embedded library path. The embedded data flow test uses clean shutdown, which works fine — the bug only manifests on **unclean restart**.
22+
23+
## When invoked:
24+
25+
1. Run the targeted Rust integration tests:
26+
```bash
27+
cd apps/core
28+
cargo test --test integration_tests test_wal_durability_and_recovery -- --nocapture
29+
cargo test --test embedded_core_api --features embedded events_survive -- --nocapture
30+
```
31+
32+
2. If those pass, run the full embedded durability suite:
33+
```bash
34+
cd apps/core
35+
cargo test --test embedded_core_api --features embedded -- --nocapture
36+
cargo test --test integration_tests -- --nocapture
37+
```
38+
39+
3. If the `durability_status()` API is available, run the #84 regression suite:
40+
```bash
41+
bash tooling/embedded-durability-test/test-embedded-durability.sh --status
42+
```
43+
44+
4. For the full suite including all scenarios:
45+
```bash
46+
bash tooling/embedded-durability-test/test-embedded-durability.sh
47+
```
48+
49+
5. Analyze the output and report:
50+
- Which tests passed/failed
51+
- Whether the WAL recovery checkpoint bug (#84 pattern) is present
52+
- Whether events survive unclean shutdown (drop without `shutdown()`)
53+
- Whether Parquet files are actually written during checkpoint-on-open
54+
55+
## Test Matrix
56+
57+
The script and Rust tests cover these scenarios:
58+
59+
| Scenario | What's Tested | Bug #84 Relevant? |
60+
|----------|--------------|-------------------|
61+
| **Clean shutdown + reopen** | `shutdown()` → drop → `open()` → query | No (this path works) |
62+
| **Drop without shutdown + reopen** | drop (no `shutdown()`) → `open()` → query | **YES — this is the #84 path** |
63+
| **WAL-only recovery** | Write events, skip Parquet flush, reopen | **YES** |
64+
| **Parquet-only recovery** | Flush to Parquet, delete WAL, reopen | No (different path) |
65+
| **WAL + Parquet recovery** | Both exist on disk, reopen | Partially |
66+
| **Checkpoint verification** | After recovery, assert Parquet files exist on disk | **YES — catches silent no-op** |
67+
| **durability_status() after ingest** | `memory > 0, wal > 0, durable=true` | Validates fsync is working |
68+
| **durability_status() after recovery** | `parquet_pending_batch == memory_events` | **YES — the exact #84 invariant** |
69+
| **durability_status() warns on memory-only** | `warnings.len() > 0` for dangerous state | Runtime detection of #84 |
70+
| **durability_status() unclean restart** | Drop → reopen → `durable=true, no warnings` | **End-to-end #84 regression** |
71+
| **Large volume recovery** | 1000+ events, drop, reopen, verify count | Stress variant of #84 |
72+
| **Concurrent write + kill** | Spawn writer tasks, kill mid-write, reopen | Edge case |
73+
74+
## Key Assertions (what distinguishes this from other durability tests)
75+
76+
1. **After unclean restart, `query()` returns all events** — not just the ones that were in Parquet before the crash
77+
2. **After recovery checkpoint, Parquet files exist on disk**`ls storage/` is not empty
78+
3. **After recovery checkpoint, WAL can be safely truncated** — events are in Parquet, not just memory
79+
4. **`flush_storage()` after WAL recovery writes > 0 bytes** — catches the silent no-op
80+
81+
## `durability_status()` API (#84 regression tests)
82+
83+
The other thread is adding `EmbeddedCore::durability_status()` which returns the exact internal state
84+
that caused #84. The `--status` flag exercises this API through 5 invariant checks:
85+
86+
```json
87+
{
88+
"memory_events": 63,
89+
"wal_entries": 63,
90+
"wal_bytes": 91000,
91+
"wal_sequence": 63,
92+
"parquet_files": 0,
93+
"parquet_bytes": 0,
94+
"parquet_pending_batch": 63,
95+
"durable": false,
96+
"warnings": ["63 events in memory but 0 in Parquet and 0 in WAL — data loss on restart"]
97+
}
98+
```
99+
100+
| Test | Invariant Checked | #84 Signal |
101+
|------|-------------------|------------|
102+
| `durability_status_after_ingest` | `memory_events > 0 && wal_entries > 0 && durable == true` | If `wal_entries == 0`, fsync isn't working |
103+
| `durability_status_after_flush` | `parquet_files > 0 && parquet_pending_batch == 0` | Parquet actually wrote to disk |
104+
| `durability_status_after_recovery` | `memory == wal && parquet_pending_batch == memory && durable == true` | **The #84 check** — after WAL recovery, events must be in `parquet_pending_batch` before truncation |
105+
| `durability_status_warns_on_memory_only` | `warnings.len() > 0` when events in memory but not WAL/Parquet | Would have flagged #84 at runtime |
106+
| `durability_status_survives_unclean_restart` | Drop (no shutdown) → reopen → `durable == true && warnings.is_empty()` | End-to-end #84 regression |
107+
108+
Run with:
109+
```bash
110+
bash tooling/embedded-durability-test/test-embedded-durability.sh --status
111+
```
112+
113+
These tests expect corresponding `#[tokio::test]` functions in `apps/core/tests/embedded_core_api.rs`
114+
that call `core.durability_status()` and assert the invariants above. The Rust tests are the source of truth;
115+
this shell script just runs them and reports results.
116+
117+
## Common Failure Patterns
118+
119+
- **"events lost after restart"**: The #84 bug — WAL events loaded into memory but not checkpointed to Parquet before WAL truncation
120+
- **"storage/ directory empty after recovery"**: Same root cause — `flush_storage()` is a no-op because `current_batch` was never populated from WAL recovery
121+
- **"test_wal_durability_and_recovery fails"**: The EventStore-level recovery path has the bug
122+
- **"events_survive_store_restart_via_wal passes but unclean test fails"**: Clean shutdown works (events go through `ingest()``append_event()``current_batch`), but recovery path doesn't populate `current_batch`
123+
124+
## Relationship to Other Skills
125+
126+
| Skill | Scope | Catches #84? |
127+
|-------|-------|-------------|
128+
| `chronos-durability` | Docker container restart (HTTP path) | No |
129+
| `chronos-data-flow` | Docker stack connectivity | No |
130+
| `chronos-data-flow-embedded` | MCP embedded backend (clean shutdown) | No |
131+
| **`chronos-embedded-durability`** | **Rust crate crash recovery (unclean shutdown)** | **Yes** |
132+
133+
## Running Manually
134+
135+
```bash
136+
# Quick: just the #84-relevant tests
137+
bash tooling/embedded-durability-test/test-embedded-durability.sh --quick
138+
139+
# durability_status() API regression tests only
140+
bash tooling/embedded-durability-test/test-embedded-durability.sh --status
141+
142+
# Full: all embedded durability tests
143+
bash tooling/embedded-durability-test/test-embedded-durability.sh
144+
145+
# Or run cargo tests directly:
146+
cd apps/core
147+
cargo test --test integration_tests test_wal_durability -- --nocapture
148+
cargo test --test embedded_core_api --features embedded durability_status -- --nocapture
149+
cargo test --test embedded_core_api --features embedded events_survive -- --nocapture
150+
```

.claude/skills/chronos-release/SKILL.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,17 +52,37 @@ Check for any other version references that `set-version` might miss:
5252

5353
### 4. Run CI to green
5454

55+
**IMPORTANT: Run the three quality gates in PARALLEL, not sequentially.**
56+
57+
`make ci` runs Rust → Go → Elixir sequentially, which takes 10+ minutes per iteration. Instead, launch all three as background tasks:
58+
59+
```bash
60+
# Launch all three in parallel (use run_in_background for each)
61+
make quality-rust 2>&1 | tail -5 # ~3-4 min (clippy + test + doc)
62+
make quality-go 2>&1 | tail -5 # ~30 sec
63+
make quality-elixir-full 2>&1 | tail -10 # ~4-5 min (dialyzer + tests)
64+
```
65+
66+
Wait for all three to complete, then check results. This cuts CI time from 10+ min to ~4 min per iteration.
67+
68+
**For targeted re-checks after fixes**, only re-run the affected gate:
69+
- Rust fix → `make quality-rust`
70+
- Elixir fix → `make quality-elixir-full`
71+
- Go fix → `make quality-go`
72+
73+
**Pre-fix common issues before running gates** to minimize iterations:
5574
```bash
56-
make ci
75+
# Always run these before the first CI attempt:
76+
cargo +nightly fmt --all
77+
cargo +nightly sort --workspace
78+
cd apps/mcp-server-elixir && mix format && cd ../query-service && mix format && cd ../..
5779
```
5880

59-
If CI fails, fix all issues iteratively:
60-
- **Rust**: `cargo +nightly fmt`, `cargo +nightly sort`, clippy fixes, doc link fixes
81+
If CI fails, fix issues and re-run only the failing gate(s):
82+
- **Rust**: `cargo +nightly fmt --all`, `cargo +nightly sort --workspace`, clippy fixes, doc link fixes
6183
- **Go**: `gofmt`, golangci-lint fixes
6284
- **Elixir**: `mix format`, `mix deps.unlock --unused`, credo fixes, test fixes
6385

64-
Re-run `make ci` after each round of fixes until it passes.
65-
6686
### 5. Commit (single squashed commit)
6787

6888
Stage all changes and create exactly ONE commit:

Cargo.lock

Lines changed: 52 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)