|
| 1 | +# TASK-100: Decide ALTER TABLE new-column clock semantics (backfill vs lazy) |
| 2 | + |
| 3 | +## Status |
| 4 | +- [ ] Planned |
| 5 | +- [ ] Assigned |
| 6 | +- [ ] In Progress |
| 7 | +- [ ] Blocked (reason: ...) |
| 8 | +- [x] Complete (2025-12-20) |
| 9 | + |
| 10 | +## Priority |
| 11 | +high |
| 12 | + |
| 13 | +## Assigned To |
| 14 | +(unassigned) |
| 15 | + |
| 16 | +## Parent Docs / Cross-links |
| 17 | +- Oracle parity test (current evidence): `zig/harness/test-alter-parity.sh` |
| 18 | +- Zig alter implementation: `zig/src/schema_alter.zig` |
| 19 | +- Rust/C alter implementation (reference): `core/rs/core/src/alter.rs` |
| 20 | +- sqlite-cr wrapper (oracle runtime): `nix run github:subtleGradient/sqlite-cr -- ...` |
| 21 | +- Origin task: `.tasks/active/TASK-094-alter-table-history-preservation.md` |
| 22 | +- Gap backlog: `research/zig-cr/92-gap-backlog.md` |
| 23 | + |
| 24 | +## Description |
| 25 | +TASK-094 surfaced an **ALTER TABLE semantic ambiguity**: |
| 26 | + |
| 27 | +When a CRR table undergoes `ALTER TABLE ... ADD COLUMN` (nullable or with DEFAULT), should the system: |
| 28 | + |
| 29 | +1) **Eager backfill** the `__crsql_clock` table with entries for the new column for *all existing rows* (so the new column becomes part of the per-row conflict history immediately), OR |
| 30 | + |
| 31 | +2) **Lazy materialize** clock entries for the new column only when that column is explicitly written (so schema evolution does not fabricate per-row “writes”). |
| 32 | + |
| 33 | +The current oracle (sqlite-cr / Rust/C) behavior observed in TASK-094’s test run: |
| 34 | +- `ADD COLUMN` does **not** create clock entries for the new column. |
| 35 | +- A later `UPDATE` to the new column creates the first clock entry with `col_version = 1`. |
| 36 | + |
| 37 | +The current Zig behavior observed: |
| 38 | +- `ADD COLUMN` **does** backfill clock entries for all existing rows. |
| 39 | +- A later `UPDATE` increments `col_version` (because the row already has a clock entry). |
| 40 | + |
| 41 | +This task is to decide which behavior is the intended contract for Zig (and by extension, what oracle parity tests should enforce). |
| 42 | + |
| 43 | +## Files to Modify |
| 44 | +- `research/zig-cr/92-gap-backlog.md` (record the decided contract) |
| 45 | +- `zig/harness/test-alter-parity.sh` (make expectations match the decided contract) |
| 46 | +- `.tasks/active/TASK-094-alter-table-history-preservation.md` (update acceptance criteria wording) |
| 47 | + |
| 48 | +## Acceptance Criteria |
| 49 | +- [x] A written decision exists in this task's Completion Notes: **Eager backfill** or **Lazy materialize**. |
| 50 | +- [x] Decision includes: |
| 51 | + - the intended meaning of `__crsql_clock` entries |
| 52 | + - impact on `crsql_changes` payload size and db_version evolution |
| 53 | + - expected behavior for `ADD COLUMN DEFAULT ...` on existing rows |
| 54 | +- [ ] `zig/harness/test-alter-parity.sh` assertions reflect the decision (no "failing-by-design"). → TASK-101 |
| 55 | +- [x] `research/zig-cr/92-gap-backlog.md` updated to link to follow-up implementation task(s). |
| 56 | + |
| 57 | +## Progress Log |
| 58 | +### 2025-12-20 |
| 59 | +- Observed divergence via `zig/harness/test-alter-parity.sh`: |
| 60 | + - Rust/sqlite-cr: no clock rows for new column until UPDATE |
| 61 | + - Zig: backfills clock rows on ADD COLUMN |
| 62 | +- Analyzed Rust source: `core/rs/core/src/alter.rs` and `core/rs/core/src/backfill.rs` |
| 63 | +- Key finding: Rust `compact_post_alter` does NOT call `backfill_table` — it only compacts/deletes |
| 64 | +- Key finding: Rust `backfill_missing_columns` is only called during `crsql_as_crr`, NOT during `crsql_commit_alter` |
| 65 | + |
| 66 | +## Completion Notes |
| 67 | + |
| 68 | +### Decision: **LAZY MATERIALIZE** (match Rust/C oracle behavior) |
| 69 | + |
| 70 | +### Recommendation |
| 71 | + |
| 72 | +Zig should adopt **lazy materialize** semantics for `ADD COLUMN`, matching the Rust/C oracle: |
| 73 | +- `crsql_commit_alter` should NOT create clock entries for newly added columns |
| 74 | +- Clock entries should only be created when the column is explicitly written (INSERT or UPDATE) |
| 75 | + |
| 76 | +### Justification |
| 77 | + |
| 78 | +#### 1. Intended Meaning of `__crsql_clock` Entries |
| 79 | + |
| 80 | +A clock entry represents a **write event** — a deliberate modification to a specific cell (row × column). The clock captures: |
| 81 | +- `col_version`: How many times this cell has been written |
| 82 | +- `db_version`: Which logical database version this write occurred at |
| 83 | +- `site_id`: Which node performed the write |
| 84 | + |
| 85 | +**Schema changes are not writes.** Adding a column doesn't represent user intent to set a value; it's a structural change. The column's initial value (NULL or DEFAULT) exists by virtue of the schema definition, not because any user wrote that value. |
| 86 | + |
| 87 | +Creating clock entries for schema changes would: |
| 88 | +- Conflate schema evolution with data modification |
| 89 | +- Create "phantom writes" that no user requested |
| 90 | +- Generate misleading conflict history (col_version=1 for values no one explicitly set) |
| 91 | + |
| 92 | +#### 2. Impact on `crsql_changes` Payload Size and `db_version` |
| 93 | + |
| 94 | +**Eager backfill (current Zig behavior):** |
| 95 | +- `ADD COLUMN` on a table with N rows creates N new clock entries |
| 96 | +- `crsql_changes` returns N extra change records (value=NULL/DEFAULT for each row) |
| 97 | +- `db_version` advances once (via `crsql_db_version()`) |
| 98 | +- Sync payload: O(N) extra records per schema migration |
| 99 | +- For a 1M row table, this means 1M extra change records per new column! |
| 100 | + |
| 101 | +**Lazy materialize (Rust/C oracle behavior):** |
| 102 | +- `ADD COLUMN` creates 0 clock entries |
| 103 | +- `crsql_changes` returns nothing for the new column until explicit writes |
| 104 | +- `db_version` does not advance for schema-only changes |
| 105 | +- Sync payload: O(0) extra records for schema migration |
| 106 | +- Nodes receiving the same schema change apply it locally; no re-sync needed |
| 107 | + |
| 108 | +The lazy approach aligns with the principle stated in Rust `backfill.rs:100-104`: |
| 109 | +> "We do not grab nextdbversion on migration. The idea is that other nodes will apply the same migration in the future so if they have already seen this node up to the current db version then the migration will place them into the correct state. No need to re-sync post migration." |
| 110 | +
|
| 111 | +#### 3. Expected Behavior for `ADD COLUMN DEFAULT ...` |
| 112 | + |
| 113 | +**Scenario:** `ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active'` |
| 114 | + |
| 115 | +- **Eager (Zig):** Creates N clock entries with `value='active'`, `col_version=1` |
| 116 | +- **Lazy (Rust/C):** Creates 0 clock entries |
| 117 | + |
| 118 | +**Why lazy is correct:** |
| 119 | +- Every node running the same migration gets `status='active'` for existing rows |
| 120 | +- No sync is needed — the schema migration IS the write |
| 121 | +- If Node A later sets `status='inactive'` for row 1, THEN a clock entry is created |
| 122 | +- Node B receives the change, sees `col_version=1` > 0 (local has no entry), accepts it |
| 123 | + |
| 124 | +**Why eager is problematic:** |
| 125 | +- Node A creates N clock entries on ALTER |
| 126 | +- Node B receives schema change, also creates N clock entries locally |
| 127 | +- Both have `col_version=1` — no conflict, but redundant sync traffic |
| 128 | +- Or worse: if db_versions differ, may create spurious conflicts |
| 129 | + |
| 130 | +#### 4. Edge Cases |
| 131 | + |
| 132 | +**New row inserted after ADD COLUMN:** |
| 133 | +- INSERT trigger creates clock entry for the new column (value from DEFAULT or explicit) |
| 134 | +- This is a real write, so clock entry is appropriate |
| 135 | + |
| 136 | +**UPDATE to new column on existing row:** |
| 137 | +- UPDATE trigger creates clock entry for the column |
| 138 | +- `col_version=1` (first write to this cell) |
| 139 | +- This is correct: the first explicit write is version 1 |
| 140 | + |
| 141 | +**Row existed before ALTER, never updated:** |
| 142 | +- No clock entry for the new column |
| 143 | +- `crsql_changes` does not return this column for this row |
| 144 | +- Other nodes apply the same ALTER and have the same state |
| 145 | +- No sync needed |
| 146 | + |
| 147 | +### Follow-up Implementation Task |
| 148 | + |
| 149 | +TASK-101 (`impl-alter-add-column-no-backfill.md`) should modify `zig/src/schema_alter.zig`: |
| 150 | +- Remove `backfillNewColumns(db, table_name_ptr)` call from `crsqlCommitAlterFunc` |
| 151 | +- Or: Only call backfill for rows that didn't exist before (i.e., rows missing from `__crsql_pks`) |
| 152 | + |
| 153 | +The simpler approach is to remove the backfill entirely from `commit_alter`, matching Rust/C exactly. |
0 commit comments