|
| 1 | +--- |
| 2 | +name: chronos-embedded-durability |
| 3 | +description: Test AllSource Core embedded durability by running Rust integration tests that verify WAL recovery, crash recovery, and checkpoint correctness. Catches silent-no-op checkpoint bugs like issue #84. Triggers on "test embedded durability", "embedded crash recovery", "test wal recovery", "test embedded persistence", "verify embedded durability", "issue 84". |
| 4 | +category: testing |
| 5 | +color: red |
| 6 | +displayName: Chronos Embedded Durability Test |
| 7 | +--- |
| 8 | + |
| 9 | +# Chronos Embedded Durability Test |
| 10 | + |
| 11 | +Tests the `allsource-core` Rust crate's durability guarantees directly — no Docker, no HTTP, no MCP. Exercises the exact code paths that run when `EmbeddedCore::open` is called on a directory with existing WAL/Parquet data. |
| 12 | + |
| 13 | +## Why this exists |
| 14 | + |
| 15 | +Issue [#84](https://github.com/all-source-os/all-source/issues/84) revealed a class of bug where: |
| 16 | +1. WAL recovery loads events into memory correctly |
| 17 | +2. `flush_storage()` returns `Ok(())` as a **silent no-op** (empty Parquet batch) |
| 18 | +3. WAL is truncated unconditionally after "successful" checkpoint |
| 19 | +4. Events exist only in memory → process exit → data loss |
| 20 | + |
| 21 | +The existing Docker durability test (`chronos-durability`) doesn't catch this because it tests the HTTP server path, not the embedded library path. The embedded data flow test uses clean shutdown, which works fine — the bug only manifests on **unclean restart**. |
| 22 | + |
| 23 | +## When invoked: |
| 24 | + |
| 25 | +1. Run the targeted Rust integration tests: |
| 26 | + ```bash |
| 27 | + cd apps/core |
| 28 | + cargo test --test integration_tests test_wal_durability_and_recovery -- --nocapture |
| 29 | + cargo test --test embedded_core_api --features embedded events_survive -- --nocapture |
| 30 | + ``` |
| 31 | + |
| 32 | +2. If those pass, run the full embedded durability suite: |
| 33 | + ```bash |
| 34 | + cd apps/core |
| 35 | + cargo test --test embedded_core_api --features embedded -- --nocapture |
| 36 | + cargo test --test integration_tests -- --nocapture |
| 37 | + ``` |
| 38 | + |
| 39 | +3. If the `durability_status()` API is available, run the #84 regression suite: |
| 40 | + ```bash |
| 41 | + bash tooling/embedded-durability-test/test-embedded-durability.sh --status |
| 42 | + ``` |
| 43 | + |
| 44 | +4. For the full suite including all scenarios: |
| 45 | + ```bash |
| 46 | + bash tooling/embedded-durability-test/test-embedded-durability.sh |
| 47 | + ``` |
| 48 | + |
| 49 | +5. Analyze the output and report: |
| 50 | + - Which tests passed/failed |
| 51 | + - Whether the WAL recovery checkpoint bug (#84 pattern) is present |
| 52 | + - Whether events survive unclean shutdown (drop without `shutdown()`) |
| 53 | + - Whether Parquet files are actually written during checkpoint-on-open |
| 54 | + |
| 55 | +## Test Matrix |
| 56 | + |
| 57 | +The script and Rust tests cover these scenarios: |
| 58 | + |
| 59 | +| Scenario | What's Tested | Bug #84 Relevant? | |
| 60 | +|----------|--------------|-------------------| |
| 61 | +| **Clean shutdown + reopen** | `shutdown()` → drop → `open()` → query | No (this path works) | |
| 62 | +| **Drop without shutdown + reopen** | drop (no `shutdown()`) → `open()` → query | **YES — this is the #84 path** | |
| 63 | +| **WAL-only recovery** | Write events, skip Parquet flush, reopen | **YES** | |
| 64 | +| **Parquet-only recovery** | Flush to Parquet, delete WAL, reopen | No (different path) | |
| 65 | +| **WAL + Parquet recovery** | Both exist on disk, reopen | Partially | |
| 66 | +| **Checkpoint verification** | After recovery, assert Parquet files exist on disk | **YES — catches silent no-op** | |
| 67 | +| **durability_status() after ingest** | `memory > 0, wal > 0, durable=true` | Validates fsync is working | |
| 68 | +| **durability_status() after recovery** | `parquet_pending_batch == memory_events` | **YES — the exact #84 invariant** | |
| 69 | +| **durability_status() warns on memory-only** | `warnings.len() > 0` for dangerous state | Runtime detection of #84 | |
| 70 | +| **durability_status() unclean restart** | Drop → reopen → `durable=true, no warnings` | **End-to-end #84 regression** | |
| 71 | +| **Large volume recovery** | 1000+ events, drop, reopen, verify count | Stress variant of #84 | |
| 72 | +| **Concurrent write + kill** | Spawn writer tasks, kill mid-write, reopen | Edge case | |
| 73 | + |
| 74 | +## Key Assertions (what distinguishes this from other durability tests) |
| 75 | + |
| 76 | +1. **After unclean restart, `query()` returns all events** — not just the ones that were in Parquet before the crash |
| 77 | +2. **After recovery checkpoint, Parquet files exist on disk** — `ls storage/` is not empty |
| 78 | +3. **After recovery checkpoint, WAL can be safely truncated** — events are in Parquet, not just memory |
| 79 | +4. **`flush_storage()` after WAL recovery writes > 0 bytes** — catches the silent no-op |
| 80 | + |
| 81 | +## `durability_status()` API (#84 regression tests) |
| 82 | + |
| 83 | +The other thread is adding `EmbeddedCore::durability_status()` which returns the exact internal state |
| 84 | +that caused #84. The `--status` flag exercises this API through 5 invariant checks: |
| 85 | + |
| 86 | +```json |
| 87 | +{ |
| 88 | + "memory_events": 63, |
| 89 | + "wal_entries": 63, |
| 90 | + "wal_bytes": 91000, |
| 91 | + "wal_sequence": 63, |
| 92 | + "parquet_files": 0, |
| 93 | + "parquet_bytes": 0, |
| 94 | + "parquet_pending_batch": 63, |
| 95 | + "durable": false, |
| 96 | + "warnings": ["63 events in memory but 0 in Parquet and 0 in WAL — data loss on restart"] |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +| Test | Invariant Checked | #84 Signal | |
| 101 | +|------|-------------------|------------| |
| 102 | +| `durability_status_after_ingest` | `memory_events > 0 && wal_entries > 0 && durable == true` | If `wal_entries == 0`, fsync isn't working | |
| 103 | +| `durability_status_after_flush` | `parquet_files > 0 && parquet_pending_batch == 0` | Parquet actually wrote to disk | |
| 104 | +| `durability_status_after_recovery` | `memory == wal && parquet_pending_batch == memory && durable == true` | **The #84 check** — after WAL recovery, events must be in `parquet_pending_batch` before truncation | |
| 105 | +| `durability_status_warns_on_memory_only` | `warnings.len() > 0` when events in memory but not WAL/Parquet | Would have flagged #84 at runtime | |
| 106 | +| `durability_status_survives_unclean_restart` | Drop (no shutdown) → reopen → `durable == true && warnings.is_empty()` | End-to-end #84 regression | |
| 107 | + |
| 108 | +Run with: |
| 109 | +```bash |
| 110 | +bash tooling/embedded-durability-test/test-embedded-durability.sh --status |
| 111 | +``` |
| 112 | + |
| 113 | +These tests expect corresponding `#[tokio::test]` functions in `apps/core/tests/embedded_core_api.rs` |
| 114 | +that call `core.durability_status()` and assert the invariants above. The Rust tests are the source of truth; |
| 115 | +this shell script just runs them and reports results. |
| 116 | + |
| 117 | +## Common Failure Patterns |
| 118 | + |
| 119 | +- **"events lost after restart"**: The #84 bug — WAL events loaded into memory but not checkpointed to Parquet before WAL truncation |
| 120 | +- **"storage/ directory empty after recovery"**: Same root cause — `flush_storage()` is a no-op because `current_batch` was never populated from WAL recovery |
| 121 | +- **"test_wal_durability_and_recovery fails"**: The EventStore-level recovery path has the bug |
| 122 | +- **"events_survive_store_restart_via_wal passes but unclean test fails"**: Clean shutdown works (events go through `ingest()` → `append_event()` → `current_batch`), but recovery path doesn't populate `current_batch` |
| 123 | + |
| 124 | +## Relationship to Other Skills |
| 125 | + |
| 126 | +| Skill | Scope | Catches #84? | |
| 127 | +|-------|-------|-------------| |
| 128 | +| `chronos-durability` | Docker container restart (HTTP path) | No | |
| 129 | +| `chronos-data-flow` | Docker stack connectivity | No | |
| 130 | +| `chronos-data-flow-embedded` | MCP embedded backend (clean shutdown) | No | |
| 131 | +| **`chronos-embedded-durability`** | **Rust crate crash recovery (unclean shutdown)** | **Yes** | |
| 132 | + |
| 133 | +## Running Manually |
| 134 | + |
| 135 | +```bash |
| 136 | +# Quick: just the #84-relevant tests |
| 137 | +bash tooling/embedded-durability-test/test-embedded-durability.sh --quick |
| 138 | + |
| 139 | +# durability_status() API regression tests only |
| 140 | +bash tooling/embedded-durability-test/test-embedded-durability.sh --status |
| 141 | + |
| 142 | +# Full: all embedded durability tests |
| 143 | +bash tooling/embedded-durability-test/test-embedded-durability.sh |
| 144 | + |
| 145 | +# Or run cargo tests directly: |
| 146 | +cd apps/core |
| 147 | +cargo test --test integration_tests test_wal_durability -- --nocapture |
| 148 | +cargo test --test embedded_core_api --features embedded durability_status -- --nocapture |
| 149 | +cargo test --test embedded_core_api --features embedded events_survive -- --nocapture |
| 150 | +``` |
0 commit comments