Skip to content

fix: detect WAL changes by mtime#1078

Open
corylanou wants to merge 3 commits intomainfrom
codex/issue-1037
Open

fix: detect WAL changes by mtime#1078
corylanou wants to merge 3 commits intomainfrom
codex/issue-1037

Conversation

@corylanou
Copy link
Collaborator

Summary

Use WAL mtime in the monitor loop’s cheap change detection and add a regression test for mtime-only WAL changes.

Problem

Issue #1037 reports replication stalling when WAL size and header remain unchanged even though writes continue. SQLite can reuse WAL space, so size/header can stay constant while WAL mtime changes on each write. The current optimization skips Sync() in that case, so replication silently stalls.

Solution

Track WAL mtime alongside size and header. Skip Sync() only if size, header, and mtime are unchanged.

Scope

In scope:

  • Add WAL mtime to monitor loop change detection
  • Add a regression test for mtime-only changes

Not in scope:

  • WAL reader/format changes
  • Replica backend changes
  • New metrics/logging

Test Plan

go test -run TestDB_Monitor_ -v ./...

Related

Copy link
Owner

@benbjohnson benbjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corylanou I'm not sure we can rely on mtime as its granularity may be too large. You could have a WAL write, then a sync, and then another WAL write all within the window of the mtime granularity (such that the mtime before the sync and after the sync are the same).

Are we implementing a file watcher on the WAL? I can't remember if that ever made it in there.

@corylanou
Copy link
Collaborator Author

Implemented WAL change detection without relying on filesystem mtime granularity.

Instead of mtime, DB.monitor() now compares the WAL-index (-shm) mxFrame value (plus WAL size & WAL header). mxFrame advances on each commit even when SQLite reuses WAL space, and it avoids missing writes on coarse mtime filesystems.

Also fixed a related restore/compaction edge case exposed by the integration suite: when SQLite bumps the post-commit DB size by multiple pages but only some of the new pages appear in the WAL, we now encode zero-filled pages for the missing newly-added page numbers so compaction can always rebuild a valid snapshot.

Re: your question: there still isn't an fsnotify-based watcher on the WAL itself (only directory watching in cmd/litestream/directory_watcher.go); this remains polling-based via the monitor loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Writes silently fail to replicate when WAL size unchanged

2 participants