Skip to content

GSoC 2026: SQLite Source Connector proposal#32

Open
PDGGK wants to merge 10 commits intodebezium:mainfrom
PDGGK:gsoc-2026-sqlite-proposal
Open

GSoC 2026: SQLite Source Connector proposal#32
PDGGK wants to merge 10 commits intodebezium:mainfrom
PDGGK:gsoc-2026-sqlite-proposal

Conversation

@PDGGK
Copy link
Copy Markdown

@PDGGK PDGGK commented Mar 28, 2026

Draft proposal for the SQLite Source Connector project under JBoss Community / Debezium.

Zulip thread: Zihan - SQLite Source Connector

Code contribution: debezium-platform#309

PoC: sqlite-wal-poc (WAL reader + b-tree page decoder, 11/11 tests passing)

Mentor: Giovanni Panice (@kmos)

PDGGK added 5 commits March 28, 2026 20:02
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
@kmos kmos requested review from kmos and vsantonastaso March 28, 2026 16:42
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kmos kmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. I would suggest that the proposal consider potential drawbacks through a research-oriented approach (evaluate for example CDC state of art in the sqlite context), rather than relying primarily on mentors’ suggestions.

last thing, the main mentor is @vsantonastaso


#### Fallback Plan

If transaction-wide reconciliation is not working correctly by the end of Week 4 (Jun 29), invoke the fallback immediately. **Minimum viable product**: Replace raw WAL page decoding with a hybrid approach — use WAL frame monitoring to detect WHICH tables changed, then re-read via JDBC `SELECT`. This sacrifices `before` images and intermediate states but produces valid Debezium events.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please evaluate another fallback plan which doesn't imply the WAL: for example you mentioned sqlite session

#### Key Design Decisions

- **WAL-based over trigger-based**: WAL reading is passive — no DDL changes to the target database. Mentor confirmed WAL-based approach.
- **Page-level diff over session extension**: The `sqlite3session` API requires explicit table registration. WAL-based change detection works with any SQLite database.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlite3 session API need explicit table registration ? is it really true ?

sqlite3session_create(db, "main", &pSession);
sqlite3session_attach(pSession, NULL);


This project builds a Debezium Source Connector for SQLite that reads the Write-Ahead Log (WAL) to detect committed changes, reconstructs row-level events from page-level WAL frames by decoding SQLite's b-tree page format, and emits standard Debezium change events (Envelope with before/after/source/op fields) that flow into Kafka, Pulsar, or any Debezium Server sink.

SQLite's WAL was designed for local concurrency, not replication. Unlike PostgreSQL's logical decoding or MySQL's binlog, SQLite WAL frames are physical page images — not logical row operations. The connector must parse WAL frame headers to identify committed transactions, decode b-tree leaf pages to extract row data, diff page states to determine which rows were inserted, updated, or deleted, and handle edge cases including WITHOUT ROWID tables, overflow pages, and WAL checkpoint/reset cycles.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part sounds more like a limitation or potential failure stemming from SQLite’s WAL architecture. I recommend starting with an abstract that focuses on the primary objective (a source connector for sqlite) and emphasizes the business value it delivers.

Addresses Giovanni's review: proposal now evaluates CDC approaches through
independent research (Litestream, LiteFS, rqlite, Turso, Marmot, sqlite-cdc,
cr-sqlite) rather than relying primarily on mentor suggestions.

Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Copy link
Copy Markdown
Author

@PDGGK PDGGK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I've updated the proposal with two changes:

  1. Added a "CDC state of the art" analysis evaluating existing approaches in the SQLite ecosystem: Litestream (page-level WAL streaming), LiteFS (FUSE-based page capture), rqlite 9.0 (pre-update hook CDC with Raft indices), Turso/libSQL (native engine-level CDC), Marmot (trigger-based Debezium-format events), sqlite-cdc (trigger-based with 97-127% write overhead benchmarks), and CRDT approaches (cr-sqlite, Corrosion). The analysis positions the WAL-based approach as the only one providing external monitoring + row-level events + zero write overhead simultaneously.

  2. Revised design decision rationale to reflect independent research (tracing through SQL Server's TxLogPosition, PostgreSQL's Lsn, and MySQL/SQL Server schema handling patterns) rather than citing mentor suggestions.

Also noted: @vsantonastaso as primary mentor.

Copy link
Copy Markdown
Member

@kmos kmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work. If you have not already done so, you may proceed with the official submission

PDGGK added 3 commits March 31, 2026 14:14
Previous version was a condensed summary. This version includes:
- Full WAL parsing code with checksum validation
- B-tree page decoder implementation
- Offset management with salt-based epoch tracking
- Schema evolution via HistorizedRelationalDatabaseSchema
- Snapshot-to-stream handoff design
- CDC state-of-art analysis (Litestream, LiteFS, rqlite, Turso, Marmot)
- Performance benchmarking plan
- Correctness test matrix
- Community collaboration strategy

Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants