GSoC 2026: SQLite Source Connector proposal#32
Conversation
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
kmos
left a comment
There was a problem hiding this comment.
Thank you for your contribution. I would suggest that the proposal consider potential drawbacks through a research-oriented approach (evaluate for example CDC state of art in the sqlite context), rather than relying primarily on mentors’ suggestions.
last thing, the main mentor is @vsantonastaso
gsoc/2026/PDGGK/proposal.md
Outdated
|
|
||
| #### Fallback Plan | ||
|
|
||
| If transaction-wide reconciliation is not working correctly by the end of Week 4 (Jun 29), invoke the fallback immediately. **Minimum viable product**: Replace raw WAL page decoding with a hybrid approach — use WAL frame monitoring to detect WHICH tables changed, then re-read via JDBC `SELECT`. This sacrifices `before` images and intermediate states but produces valid Debezium events. |
There was a problem hiding this comment.
Please evaluate another fallback plan which doesn't imply the WAL: for example you mentioned sqlite session
gsoc/2026/PDGGK/proposal.md
Outdated
| #### Key Design Decisions | ||
|
|
||
| - **WAL-based over trigger-based**: WAL reading is passive — no DDL changes to the target database. Mentor confirmed WAL-based approach. | ||
| - **Page-level diff over session extension**: The `sqlite3session` API requires explicit table registration. WAL-based change detection works with any SQLite database. |
There was a problem hiding this comment.
sqlite3 session API need explicit table registration ? is it really true ?
sqlite3session_create(db, "main", &pSession);
sqlite3session_attach(pSession, NULL);
gsoc/2026/PDGGK/proposal.md
Outdated
|
|
||
| This project builds a Debezium Source Connector for SQLite that reads the Write-Ahead Log (WAL) to detect committed changes, reconstructs row-level events from page-level WAL frames by decoding SQLite's b-tree page format, and emits standard Debezium change events (Envelope with before/after/source/op fields) that flow into Kafka, Pulsar, or any Debezium Server sink. | ||
|
|
||
| SQLite's WAL was designed for local concurrency, not replication. Unlike PostgreSQL's logical decoding or MySQL's binlog, SQLite WAL frames are physical page images — not logical row operations. The connector must parse WAL frame headers to identify committed transactions, decode b-tree leaf pages to extract row data, diff page states to determine which rows were inserted, updated, or deleted, and handle edge cases including WITHOUT ROWID tables, overflow pages, and WAL checkpoint/reset cycles. |
There was a problem hiding this comment.
This part sounds more like a limitation or potential failure stemming from SQLite’s WAL architecture. I recommend starting with an abstract that focuses on the primary objective (a source connector for sqlite) and emphasizes the business value it delivers.
Addresses Giovanni's review: proposal now evaluates CDC approaches through independent research (Litestream, LiteFS, rqlite, Turso, Marmot, sqlite-cdc, cr-sqlite) rather than relying primarily on mentor suggestions. Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
PDGGK
left a comment
There was a problem hiding this comment.
Thank you for the feedback. I've updated the proposal with two changes:
-
Added a "CDC state of the art" analysis evaluating existing approaches in the SQLite ecosystem: Litestream (page-level WAL streaming), LiteFS (FUSE-based page capture), rqlite 9.0 (pre-update hook CDC with Raft indices), Turso/libSQL (native engine-level CDC), Marmot (trigger-based Debezium-format events), sqlite-cdc (trigger-based with 97-127% write overhead benchmarks), and CRDT approaches (cr-sqlite, Corrosion). The analysis positions the WAL-based approach as the only one providing external monitoring + row-level events + zero write overhead simultaneously.
-
Revised design decision rationale to reflect independent research (tracing through SQL Server's TxLogPosition, PostgreSQL's Lsn, and MySQL/SQL Server schema handling patterns) rather than citing mentor suggestions.
Also noted: @vsantonastaso as primary mentor.
kmos
left a comment
There was a problem hiding this comment.
Thank you for your work. If you have not already done so, you may proceed with the official submission
Previous version was a condensed summary. This version includes: - Full WAL parsing code with checksum validation - B-tree page decoder implementation - Offset management with salt-based epoch tracking - Schema evolution via HistorizedRelationalDatabaseSchema - Snapshot-to-stream handoff design - CDC state-of-art analysis (Litestream, LiteFS, rqlite, Turso, Marmot) - Performance benchmarking plan - Correctness test matrix - Community collaboration strategy Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Signed-off-by: Zihan Dai <99155080+PDGGK@users.noreply.github.com>
Draft proposal for the SQLite Source Connector project under JBoss Community / Debezium.
Zulip thread: Zihan - SQLite Source Connector
Code contribution: debezium-platform#309
PoC: sqlite-wal-poc (WAL reader + b-tree page decoder, 11/11 tests passing)
Mentor: Giovanni Panice (@kmos)