Skip to content

Commit 66524a1

Browse files
fixes++
1 parent e011bff commit 66524a1

8 files changed

+454
-161
lines changed

.tasks/DELEGATE_WORK_HANDOFF.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,105 @@ Artifacts:
6868

6969
---
7070

71+
## Round 2025-12-22 (63) — Fix 3 critical divergences from Round 62 (3 tasks)
72+
73+
**Tasks executed**
74+
- `.tasks/done/TASK-184-fix-resurrection-via-sentinel.md`
75+
- `.tasks/done/TASK-185-fix-spurious-sentinel-on-merge.md`
76+
- `.tasks/done/TASK-187-fix-star-topology-convergence.md`
77+
78+
**Commits**
79+
- (pending commit)
80+
81+
**Environment**
82+
- OS: darwin (macOS ARM64)
83+
- Tooling: nix, zig (via nix), bash
84+
85+
**Commands run (exact)**
86+
```bash
87+
make -C zig build
88+
bash zig/harness/test-resurrection-parity.sh
89+
bash zig/harness/test-sentinel-parity.sh
90+
bash zig/harness/test-multinode-sync.sh
91+
```
92+
93+
**Outputs (paste)**
94+
95+
<details>
96+
<summary>TASK-184: Resurrection via sentinel (25/25 pass, was 11/12)</summary>
97+
98+
```text
99+
Test 1: Resurrection of Live Row via Sentinel - 5/5 PASS
100+
Test 2: Resurrection of Dead Row via Sentinel - 4/4 PASS (was failing)
101+
Test 3: Update on Live Row via Column - 6/6 PASS (was failing)
102+
...
103+
Results: 25 passed, 0 failed, 0 skipped
104+
All resurrection parity tests PASSED
105+
```
106+
107+
**Root Cause**: The sentinel merge path in `changes_vtab.zig` handled resurrection (odd CL) by only zeroing column clocks, but didn't check if the base table row actually existed. For tombstoned rows, the base table row was deleted.
108+
109+
**Fix**: Added `insertRowForSentinelResurrection()` in `merge_insert.zig`. Modified sentinel handling in `changes_vtab.zig` to check `rowExistsInBaseTable()` before zeroing clocks, and if row doesn't exist, call `insertRowForSentinelResurrection()` first.
110+
</details>
111+
112+
<details>
113+
<summary>TASK-185: Spurious sentinel on merge (6/6 pass, was 5/6)</summary>
114+
115+
```text
116+
PASS: No sentinels created on INSERT (both = 0)
117+
PASS: Sentinels created on DELETE (both = 800)
118+
PASS: No sentinels created on INSERT OR REPLACE (both = 0)
119+
PASS: No sentinels created on merge (all sites = 0) <-- was failing
120+
PASS: Sentinels propagated correctly (Site B matches Site A = 3)
121+
PASS: Sentinel structure matches (sampled first 5 entries)
122+
Sentinel Parity Summary: 6 passed, 0 failed
123+
All sentinel parity tests PASSED!
124+
```
125+
126+
**Root Cause**: In `changes_vtab.zig`, the `changesUpdate` function unconditionally created sentinel entries when inserting new rows during merge. Should only create sentinels when merging an actual sentinel (cid='-1').
127+
128+
**Fix**: Removed unconditional sentinel creation for new rows in `changes_vtab.zig`. Added sentinel CL update for existing row column merges when cl > local_cl.
129+
</details>
130+
131+
<details>
132+
<summary>TASK-187: Star topology convergence (6/6 pass, was 5/6)</summary>
133+
134+
```text
135+
Test 1: Discord corrosion (3-node) - PASS
136+
Test 2: Extended 4-node discord - PASS
137+
Test 3: Star topology - PASS <-- was failing
138+
139+
Results: 6 passed, 0 failed, 0 skipped
140+
All multi-node sync parity tests PASSED
141+
```
142+
143+
**Root Cause**: Three bugs in `merge_insert.zig`:
144+
1. `setWinnerClock` and `setWinnerClockCached` always stored `site_id = 0` instead of preserving remote site_id
145+
2. `getLocalColVersion` filtered `WHERE site_id = 0` which ignored remote changes
146+
3. For tables with `a PRIMARY KEY NOT NULL` (not `INTEGER PRIMARY KEY`), operations used `WHERE rowid = ?` but rowid ≠ PK value
147+
148+
**Fix**: Fixed all three issues in `merge_insert.zig` - preserve site_id, remove incorrect filter, use PK column instead of rowid.
149+
</details>
150+
151+
**Files modified**
152+
- `zig/src/merge_insert.zig` — Added `insertRowForSentinelResurrection()`, fixed site_id handling, fixed PK vs rowid confusion
153+
- `zig/src/changes_vtab.zig` — Fixed sentinel creation logic, added row existence check for resurrection
154+
155+
**Reproduction steps (clean checkout)**
156+
1. `git clone <repo> && cd cr-sqlite`
157+
2. `make -C zig build` — build Zig extension
158+
3. `bash zig/harness/test-resurrection-parity.sh` — verify 25/25 pass
159+
4. `bash zig/harness/test-sentinel-parity.sh` — verify 6/6 pass
160+
5. `bash zig/harness/test-multinode-sync.sh` — verify 6/6 pass
161+
162+
**Known gaps / unverified claims**
163+
- All 3 Round 62 divergences fixed: resurrection, spurious sentinel, star topology
164+
- TASK-186 (schema mismatch unknown column behavior) remains in triage — this is a design decision, not a bug
165+
- CI integration not verified (local runs only)
166+
- No coverage captured
167+
168+
---
169+
71170
## Round 2025-12-22 (62) — 8 new parity test suites (8 tasks)
72171

73172
**Tasks executed**

.tasks/triage/TASK-184-fix-resurrection-via-sentinel.md renamed to .tasks/done/TASK-184-fix-resurrection-via-sentinel.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
Fix Zig implementation to resurrect tombstoned rows when receiving a sentinel with higher CL.
55

66
## Status
7-
- State: triage
7+
- State: done
88
- Priority: HIGH (sync incompatibility)
99
- Discovered: Round 62 (TASK-161 test suite)
1010

@@ -49,6 +49,27 @@ The sentinel merge path in `zig/src/merge_insert.zig` likely doesn't handle the
4949

5050
## Progress Log
5151
- 2025-12-22: Created from Round 62 divergence discovery.
52+
- 2025-12-22: Implemented fix - resurrection via sentinel now works.
5253

5354
## Completion Notes
54-
(Empty until done.)
55+
**Root Cause**: The sentinel merge path in `changes_vtab.zig` (lines 1901-1914) handled resurrection (odd CL) by only zeroing column clocks, but didn't check if the base table row actually existed. For tombstoned rows, the base table row was deleted, so `zeroClockOnResurrect` ran on a non-existent row.
56+
57+
**Fix Applied**:
58+
1. Added `insertRowForSentinelResurrection()` function in `merge_insert.zig` - inserts a row with just the PK value (non-PK columns NULL) using the `__crsql_key` lookup.
59+
2. Modified sentinel handling in `changes_vtab.zig` to check `rowExistsInBaseTable()` before zeroing clocks. If row doesn't exist, calls `insertRowForSentinelResurrection()` first.
60+
61+
**Files Modified**:
62+
- `zig/src/merge_insert.zig`: Added `insertRowForSentinelResurrection()` function (lines 659-703)
63+
- `zig/src/changes_vtab.zig`: Added row existence check + resurrection insert in sentinel handler (lines 1901-1920)
64+
65+
**Test Results**:
66+
- `test-resurrection-parity.sh` Test 2 (dead via sentinel): **PASS**
67+
- `test-resurrection-parity.sh` Tests 1a-1e: **PASS**
68+
- `test-cl-parity.sh`: **17 passed, 0 failed**
69+
- `test-rows-impacted-parity.sh`: **18 passed, 0 failed**
70+
- `test-merge-value-parity.sh`: **7 passed, 0 failed**
71+
- `test-oracle-parity.sh`: **18 passed, 0 failed**
72+
73+
**Note**: Test 3c (`test-resurrection-parity.sh`) fails - this is a separate issue where column updates don't create/update the sentinel clock. This is out of scope for TASK-184.
74+
75+
**Date**: 2025-12-22

.tasks/triage/TASK-185-fix-spurious-sentinel-on-merge.md renamed to .tasks/done/TASK-185-fix-spurious-sentinel-on-merge.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
Fix Zig implementation to NOT create sentinel entries when syncing INSERT changes to a new site.
55

66
## Status
7-
- State: triage
7+
- State: done
88
- Priority: HIGH (wire format divergence)
99
- Discovered: Round 62 (TASK-166 test suite)
1010

@@ -51,6 +51,37 @@ The `changesUpdate` path in `zig/src/changes_vtab.zig` creates sentinels when in
5151

5252
## Progress Log
5353
- 2025-12-22: Created from Round 62 divergence discovery.
54+
- 2025-12-22: Fixed. Root cause confirmed.
5455

5556
## Completion Notes
56-
(Empty until done.)
57+
**Date**: 2025-12-22
58+
59+
### Root Cause
60+
In `zig/src/changes_vtab.zig`, the `changesUpdate` function unconditionally created sentinel entries (cid='-1') when inserting NEW rows during merge operations. This was incorrect - sentinels should only be created when:
61+
1. A DELETE operation is being merged (incoming cid='-1'), OR
62+
2. Passing through an existing sentinel from source
63+
64+
### Changes Made
65+
**File**: `zig/src/changes_vtab.zig`
66+
67+
1. **Lines 1767-1778**: Removed unconditional sentinel creation when merging new rows. Added comment explaining the correct behavior:
68+
- Sentinels should NOT be created for regular column changes
69+
- This matches Rust/C behavior
70+
71+
2. **Lines 2131-2144** (after edit ~2147-2160): Added code to update sentinel CL when merging column updates on existing rows where incoming CL > local CL. This ensures that when Site B receives a column update from Site A (where the row has been through INSERT->DELETE->INSERT), the sentinel CL is correctly advanced.
72+
73+
3. **Lines 1983-1996** (after edit ~1997-2010): Added code to update sentinel CL during resurrection via column update. When a tombstoned row is resurrected by a column update, the sentinel CL must also be updated to track the incoming causal length.
74+
75+
### Test Results
76+
```
77+
bash zig/harness/test-sentinel-parity.sh
78+
Sentinel Parity Summary: 6 passed, 0 failed
79+
80+
bash zig/harness/test-resurrection-parity.sh
81+
Results: 25 passed, 0 failed, 0 skipped
82+
```
83+
84+
All acceptance criteria met:
85+
- ✅ Test 4 (No sentinel on merge) passes
86+
- ✅ Test 5 (Sentinel propagation) still passes
87+
- ✅ All resurrection tests pass (no regressions)
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# TASK-187 — Fix star topology sync convergence failure
2+
3+
## Goal
4+
Fix Zig implementation to correctly converge in hub-and-spoke (star) topology sync patterns.
5+
6+
## Status
7+
- State: COMPLETE
8+
- Priority: HIGH (multi-node sync broken)
9+
- Discovered: Round 62 (TASK-179 test suite)
10+
11+
## Problem
12+
In a hub-and-spoke topology where all nodes sync only through a central hub, Zig nodes fail to converge to the same state. Rust/C converges correctly.
13+
14+
**Test failure from `test-multinode-sync.sh`:**
15+
```
16+
Test 3: Star topology - FAIL
17+
18+
Hub data: 1|updated_by_b|c2|c3|c4
19+
3|d1|d2|d3|d4
20+
B data: 1|updated_by_b|c2|c3|c4
21+
3|d1|d2|d3|d4
22+
C data: 1|c1|c2|c3|c4 <-- wrong! didn't get B's update
23+
3|d1|d2|d3|d4
24+
D data: 2|d1|d2|d3|d4 <-- completely wrong rows
25+
3|updated_by_b|b2|b3|b4
26+
27+
FAIL: Hub and C diverged
28+
FAIL: Hub and D diverged
29+
```
30+
31+
## Scenario
32+
1. Create 4 databases: Hub, B, C, D
33+
2. B, C, D each create unique rows
34+
3. All sync to Hub (Hub has all rows)
35+
4. Hub syncs back to all spokes (all should have same data)
36+
5. B updates row 1, C deletes row 2
37+
6. Sync through Hub again
38+
7. **Expected**: All 4 nodes converge to same state
39+
8. **Actual (Zig)**: C didn't get B's update, D has completely wrong data
40+
41+
## Root Cause Analysis
42+
43+
**Three bugs found and fixed:**
44+
45+
### Bug 1: site_id not preserved when relaying changes through hub
46+
- Location: `zig/src/merge_insert.zig` in `setWinnerClock` and `setWinnerClockCached`
47+
- Problem: The code was always storing `site_id = 0` (local) instead of preserving the original remote site_id
48+
- Fix: Added lookup via `site_identity.getOrCreateSiteOrdinal()` to convert the 16-byte site_id blob to an ordinal and store it correctly
49+
50+
### Bug 2: `getLocalColVersion` filtering incorrectly by site_id
51+
- Location: `zig/src/merge_insert.zig` in `getLocalColVersion` and `getLocalColVersionCached`
52+
- Problem: Was filtering with `WHERE site_id = 0` which ignored remote changes
53+
- Fix: Removed the site_id filter since clock table has unique constraint on (key, col_name)
54+
55+
### Bug 3 (MAIN BUG): Using `rowid` instead of PK column for table operations
56+
- Location: `zig/src/merge_insert.zig` in `updateBaseTableColumn`, `rowExistsInBaseTable`, `deleteFromBaseTable`, and cached variants
57+
- Problem: For tables with `a PRIMARY KEY NOT NULL` (not `INTEGER PRIMARY KEY`), SQLite's rowid is NOT aliased to the PK column. The code was using `WHERE rowid = ?` but rowid ≠ PK value.
58+
- Example: Row with a=2 inserted first gets rowid=1, row with a=1 inserted second gets rowid=2. Using `WHERE rowid=1` updates the wrong row!
59+
- Fix: Changed all operations to use `WHERE "pk_column" = ?` instead of `WHERE rowid = ?`
60+
61+
## Files Modified
62+
- `zig/src/merge_insert.zig` — Main changes for site_id handling and rowid→PK column fix
63+
64+
## Acceptance Criteria
65+
1.`bash zig/harness/test-multinode-sync.sh` — Test 3 (star topology) passes
66+
2. ✅ All other multinode tests still pass
67+
3. ✅ No regressions in `test-sentinel-parity.sh`
68+
4. ✅ No regressions in `test-resurrection-parity.sh`
69+
5. ✅ No regressions in `test-fract-parity.sh`
70+
71+
## Parent Docs / Cross-links
72+
- Test: `zig/harness/test-multinode-sync.sh` (Test 3: Star topology)
73+
- Triggering task: `.tasks/done/TASK-179-multinode-sync.md`
74+
- Python reference: `py/correctness/tests/test_cl_merging.py`
75+
76+
## Progress Log
77+
- 2025-12-22: Created from Round 62 divergence discovery.
78+
- 2025-12-22: Fixed Bug 1 - site_id not preserved in setWinnerClock/setWinnerClockCached
79+
- 2025-12-22: Fixed Bug 2 - getLocalColVersion filtering incorrectly by site_id
80+
- 2025-12-22: Fixed Bug 3 - rowid vs PK column mismatch in updateBaseTableColumn, rowExistsInBaseTable, deleteFromBaseTable
81+
- 2025-12-22: All tests pass
82+
83+
## Completion Notes
84+
- Fixed: 2025-12-22
85+
- Root cause was primarily Bug 3: SQLite's rowid is only aliased to the PK column for `INTEGER PRIMARY KEY`. For other PK types like `PRIMARY KEY NOT NULL`, rowid is independent. The fix queries using actual PK column names.
86+
- All multinode sync tests now pass with parity to Rust/C oracle.

.tasks/triage/TASK-187-fix-star-topology-convergence.md

Lines changed: 0 additions & 66 deletions
This file was deleted.

0 commit comments

Comments
 (0)