Skip to content

Feature: Database Branching & Merge (Git-like workflows for safe experimentation) #17

@hoangsonww

Description

@hoangsonww

Summary

Introduce first-class database branches that allow users to create isolated, lightweight copies of a database (or a schema/table subset), make changes safely, and merge them back using LatticeDB’s existing CRDT/merge policies and bitemporal lineage. This unlocks dev/test workflows, data science sandboxes, blue/green changes, and collaborative modeling—without duplicating terabytes or leaving the engine.


Motivation

  • Safe experimentation: Run migrations, vector-index builds, DP policy changes, or query tuning without risking prod.
  • Reproducible ML/analytics: Branch, transform, train, and later reconcile results.
  • Ops velocity: Blue/green data changes and hotfixes with clear review & rollback paths.
  • Leverage strengths: LatticeDB already has CRDT merge policies, MVCC, and bitemporal history—perfect building blocks for branching.

Proposed Design

1) Core Concepts

  • Branch: A named, immutable base pointer plus a mutable HEAD for writes (copy-on-write at page/extent granularity).
  • Base: Any commit-like snapshot: main@TX=… or timepoint (AS OF '2025-09-22T12:00Z').
  • Merge: 3-way merge: base, target, source. Conflicts resolved via per-column merge policies (e.g., lww, sum_bounded, gset, or custom WASM resolvers).
  • Scope: Whole DB, schema, or table-level branches.

2) SQL / CLI Surface

SQL DDL

-- Create a branch from current main
CREATE BRANCH feature_price_optim FROM main;

-- From a time-travel snapshot
CREATE BRANCH audit_fix FROM main FOR SYSTEM_TIME AS OF TX 12345;

-- Switch session to a branch
SET CURRENT_BRANCH = 'feature_price_optim';

-- Merge feature branch back into main, policy-aware
MERGE BRANCH feature_price_optim INTO main
  WITH (dry_run = true, strategy = 'policy');  -- dry-run shows plan & conflicts

-- Promote after review
MERGE BRANCH feature_price_optim INTO main
  WITH (commit_message = 'Introduce dynamic pricing v2');

-- Show differences
SHOW DIFF BETWEEN feature_price_optim AND main
  WHERE schema = 'sales' AND table = 'orders';

-- Delete/GC
DROP BRANCH feature_price_optim;

CLI

latticedb branch create feature_price_optim --from main
latticedb branch list
latticedb branch diff feature_price_optim main --tables sales.orders
latticedb branch merge feature_price_optim main --dry-run
latticedb branch merge feature_price_optim main --commit-message "pricing v2"
latticedb branch drop feature_price_optim

3) GUI/UX

  • Branch switcher in header (like git branch picker).

  • Diff viewer:

    • DDL diffs (added/removed indexes, policy changes).
    • DML diffs (row counts, sampled row deltas, histogram deltas).
    • Vector index deltas (HNSW/IVF params & metadata).
  • Merge preview with conflict counts per table/column + the policy applied.

  • One-click dry-run → PR-style review → Merge workflow.

  • Badges in query history indicating current branch and base snapshot.

4) Storage/Engine Considerations

  • Copy-on-Write (CoW): Page/extent-level CoW in the Unified Log-Structured Storage to keep branches cheap.
  • WAL tagging: Tag records with {branch_id, tx_id}; reuse existing MVCC to isolate branch snapshots.
  • Checkpointing: Branch HEADs become named checkpoints; merging creates a new checkpoint on target.
  • GC/Retention: Reference counting of pages across branches; policy-driven TTL for stale branches; VACUUM BRANCH.
  • Indexes: Lazy materialization for secondary/B+Tree and ANN indexes; rebuild on first touch or during MERGE ... WITH (prebuild_indexes=true).

5) Security & Governance

  • RLS/CLS inheritance: Policies copy from base; diffs flagged on merge.
  • Auditability: Create/merge/drop events emitted to audit log with actor, rationale, diff summary.
  • Role controls: Grants to create/merge branches, and scoped branches (table/schema).

6) Interactions with Temporal & Lineage

  • Branches record their base transaction/time. On merge, produce a lineage edge capturing 3-way sources and policies applied.
  • lineage_explain() extended to show branch merges as first-class events.

Alternatives Considered

  • Full physical clones (expensive in space/time).
  • External CDC/ETL sandboxes (breaks ACID and lineage continuity).
  • Single-table shadow copies (awkward for cross-table constraints & queries).

Risks & Mitigations

  • Storage bloat: Mitigate with CoW, page ref-counts, and branch TTLs.
  • Merge surprises: Require dry-run by default in GUI; per-table acceptance gates.
  • Index skew: Offer prebuild_indexes option and background index build in branch.

Milestones

  1. MVP (engine + CLI)

    • CREATE/SET/DROP BRANCH
    • Read/write isolation using MVCC + branch_id
    • CoW for data pages
    • Basic MERGE (policy = lww only)
    • SHOW DIFF summaries (row counts, DDL change list)
  2. Policy-Aware Merge

    • Respect existing column merge policies (lww, gset, sum_bounded)
    • WASM resolver hook
    • Detailed conflict reporting
  3. GUI

    • Branch picker, diff viewer, merge preview
    • Dry-run with PR-style review & approvals
  4. Performance & Ops

    • Background index (re)builds on branch
    • VACUUM BRANCH, retention policies, telemetry
    • Audit trail & lineage integration

Acceptance Criteria

  • Branch creation < 1s for 100GB base (no data copy).

  • Branch reads/writes isolated; SELECT on main unaffected by branch mutations.

  • MERGE ... --dry-run outputs:

    • DDL diffs
    • Per-table DML summary (inserts/updates/deletes)
    • Conflict counts per column with policy to apply
  • Successful merge applies CRDT/WASM resolvers correctly (tests included).

  • GUI shows end-to-end flow with clear conflict visualization.


Out of Scope (for now)

  • Cross-cluster branch replication.
  • Multi-target fan-in merges.
  • Binary artifact branching for UDFs beyond current WASM hooks.

Open Questions

  • Should branches be nestable/hierarchical (e.g., feature/a/b) with inherited retention?
  • Need READ FROM branch X JOIN branch Y in a single query for comparisons?
  • Default retention/TTL for inactive branches?

Testing

  • Unit: MVCC with branch_id, CoW semantics, WAL tags.
  • Integration: DDL/DML diffs, policy-aware merges (including WASM).
  • Performance: Space amplification, merge latencies, index rebuild times.
  • Security: RLS/CLS parity across branches, audit event emission.

Labels

enhancement engine gui sql mvcc merge-policies good-first-issue (for CLI/GUI sub-tasks)


Checklist

  • Engine: branch metadata, CoW, WAL tagging
  • SQL: CREATE/SET/DROP/MERGE/SHOW DIFF
  • Policy integration + WASM hook
  • CLI plumbing & docs
  • GUI branch switcher + diff + merge preview
  • GC/retention & VACUUM BRANCH
  • Tests & benchmarks
  • Docs (README + DESIGN/ARCHITECTURE updates)

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededquestionFurther information is requested

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions