|
| 1 | +--- |
| 2 | +name: litestream |
| 3 | +description: >- |
| 4 | + Expert knowledge for contributing to Litestream, a standalone disaster recovery |
| 5 | + tool for SQLite. Provides architectural understanding, code patterns, critical |
| 6 | + rules, and debugging procedures for WAL monitoring, LTX replication format, |
| 7 | + storage backend implementation, multi-level compaction, and SQLite page |
| 8 | + management. Use when working with Litestream source code, writing storage |
| 9 | + backends, debugging replication issues, implementing compaction logic, or |
| 10 | + handling SQLite WAL operations. |
| 11 | +license: Apache-2.0 |
| 12 | +metadata: |
| 13 | + author: benbjohnson |
| 14 | + version: "1.0" |
| 15 | + repository: https://github.com/benbjohnson/litestream |
| 16 | +--- |
| 17 | + |
| 18 | +# Litestream Agent Skill |
| 19 | + |
| 20 | +Litestream is a standalone disaster recovery tool for SQLite. It runs as a |
| 21 | +background process, monitors the SQLite WAL (Write-Ahead Log), converts changes |
| 22 | +to immutable LTX files, and replicates them to cloud storage. It uses |
| 23 | +`modernc.org/sqlite` (pure Go, no CGO required). |
| 24 | + |
| 25 | +## Quick Start |
| 26 | + |
| 27 | +```bash |
| 28 | +# Build |
| 29 | +go build -o bin/litestream ./cmd/litestream |
| 30 | + |
| 31 | +# Test (always use race detector) |
| 32 | +go test -race -v ./... |
| 33 | + |
| 34 | +# Code quality |
| 35 | +pre-commit run --all-files |
| 36 | +``` |
| 37 | + |
| 38 | +## Critical Rules |
| 39 | + |
| 40 | +These invariants must never be violated: |
| 41 | + |
| 42 | +### 1. Lock Page at 1GB |
| 43 | + |
| 44 | +SQLite reserves a page at byte offset 0x40000000 (1 GB). Always skip it during |
| 45 | +replication and compaction. The page number varies by page size: |
| 46 | + |
| 47 | +| Page Size | Lock Page Number | |
| 48 | +|-----------|------------------| |
| 49 | +| 4 KB | 262145 | |
| 50 | +| 8 KB | 131073 | |
| 51 | +| 16 KB | 65537 | |
| 52 | +| 32 KB | 32769 | |
| 53 | + |
| 54 | +```go |
| 55 | +lockPgno := ltx.LockPgno(pageSize) |
| 56 | +if pgno == lockPgno { |
| 57 | + continue |
| 58 | +} |
| 59 | +``` |
| 60 | + |
| 61 | +### 2. LTX Files Are Immutable |
| 62 | + |
| 63 | +Once an LTX file is written, it must never be modified. New changes create new |
| 64 | +files. This guarantees point-in-time recovery integrity. |
| 65 | + |
| 66 | +### 3. Single Replica per Database |
| 67 | + |
| 68 | +Each database replicates to exactly one destination. The Replica component |
| 69 | +manages replication mechanics; database state belongs in the DB layer. |
| 70 | + |
| 71 | +### 4. Read Local Before Remote During Compaction |
| 72 | + |
| 73 | +Cloud storage is eventually consistent. Always read from local disk first: |
| 74 | + |
| 75 | +```go |
| 76 | +f, err := os.Open(db.LTXPath(info.Level, info.MinTXID, info.MaxTXID)) |
| 77 | +if err == nil { |
| 78 | + return f, nil // Use local copy |
| 79 | +} |
| 80 | +return replica.Client.OpenLTXFile(...) // Fall back to remote |
| 81 | +``` |
| 82 | + |
| 83 | +### 5. Preserve Timestamps During Compaction |
| 84 | + |
| 85 | +Set the compacted file's `CreatedAt` to the earliest source file timestamp to |
| 86 | +maintain temporal granularity for point-in-time restoration. |
| 87 | + |
| 88 | +```go |
| 89 | +info.CreatedAt = oldestSourceFile.CreatedAt |
| 90 | +``` |
| 91 | + |
| 92 | +### 6. Use Lock() Not RLock() for Writes |
| 93 | + |
| 94 | +```go |
| 95 | +// CORRECT |
| 96 | +r.mu.Lock() |
| 97 | +defer r.mu.Unlock() |
| 98 | +r.pos = pos |
| 99 | + |
| 100 | +// WRONG - race condition |
| 101 | +r.mu.RLock() |
| 102 | +defer r.mu.RUnlock() |
| 103 | +r.pos = pos |
| 104 | +``` |
| 105 | + |
| 106 | +### 7. Atomic File Operations |
| 107 | + |
| 108 | +Always write to a temp file then rename. Never write directly to the final path. |
| 109 | + |
| 110 | +```go |
| 111 | +tmpFile, err := os.CreateTemp(dir, ".tmp-*") |
| 112 | +// ... write data, sync ... |
| 113 | +os.Rename(tmpFile.Name(), finalPath) |
| 114 | +``` |
| 115 | + |
| 116 | +## Architecture |
| 117 | + |
| 118 | +### System Layers |
| 119 | + |
| 120 | +| Layer | File(s) | Responsibility | |
| 121 | +|---------|--------------------------|-------------------------------------------| |
| 122 | +| App | `cmd/litestream/` | CLI commands, YAML/env config | |
| 123 | +| Store | `store.go` | Multi-DB coordination, compaction | |
| 124 | +| DB | `db.go` | Single DB management, WAL monitoring | |
| 125 | +| Replica | `replica.go` | Replication to one destination | |
| 126 | +| Storage | `*/replica_client.go` | Backend implementations (S3, GCS, etc.) | |
| 127 | + |
| 128 | +Database state logic belongs in the DB layer, not the Replica layer. |
| 129 | + |
| 130 | +### ReplicaClient Interface |
| 131 | + |
| 132 | +All storage backends implement this interface from `replica_client.go`: |
| 133 | + |
| 134 | +```go |
| 135 | +type ReplicaClient interface { |
| 136 | + Type() string |
| 137 | + Init(ctx context.Context) error |
| 138 | + LTXFiles(ctx context.Context, level int, seek ltx.TXID, useMetadata bool) (ltx.FileIterator, error) |
| 139 | + OpenLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, offset, size int64) (io.ReadCloser, error) |
| 140 | + WriteLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, r io.Reader) (*ltx.FileInfo, error) |
| 141 | + DeleteLTXFiles(ctx context.Context, a []*ltx.FileInfo) error |
| 142 | + DeleteAll(ctx context.Context) error |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +Key contract details: |
| 147 | +- `OpenLTXFile` must return `os.ErrNotExist` when file is missing |
| 148 | +- `WriteLTXFile` must set `CreatedAt` from backend metadata or upload time |
| 149 | +- `LTXFiles` with `useMetadata=true` fetches accurate timestamps (for PIT restore) |
| 150 | +- `LTXFiles` with `useMetadata=false` uses fast timestamps (normal operations) |
| 151 | + |
| 152 | +### Lock Ordering |
| 153 | + |
| 154 | +Always acquire locks in this order to prevent deadlocks: |
| 155 | + |
| 156 | +1. `Store.mu` |
| 157 | +2. `DB.mu` |
| 158 | +3. `DB.chkMu` |
| 159 | +4. `Replica.mu` |
| 160 | + |
| 161 | +### Core Components |
| 162 | + |
| 163 | +**DB** (`db.go`): Manages SQLite connection, WAL monitoring, checkpointing, and |
| 164 | +long-running read transaction for consistency. Key fields: `path`, `db`, `rtx` |
| 165 | +(read transaction), `pageSize`, `notify` channel. |
| 166 | + |
| 167 | +**Replica** (`replica.go`): Tracks replication position (`ltx.Pos` with TXID, |
| 168 | +PageNo, Checksum). One replica per database. |
| 169 | + |
| 170 | +**Store** (`store.go`): Coordinates multiple databases and schedules compaction |
| 171 | +across levels. |
| 172 | + |
| 173 | +## LTX File Format |
| 174 | + |
| 175 | +LTX (Log Transaction) files are immutable, checksummed archives of database |
| 176 | +changes. Structure: |
| 177 | + |
| 178 | +``` |
| 179 | ++------------------+ |
| 180 | +| Header | 100 bytes (magic "LTX1", page size, TXID range, timestamp) |
| 181 | ++------------------+ |
| 182 | +| Page Frames | 4-byte pgno + pageSize bytes data, per page |
| 183 | ++------------------+ |
| 184 | +| Page Index | Binary search index for page lookup |
| 185 | ++------------------+ |
| 186 | +| Trailer | 16 bytes (post-apply checksum, file checksum) |
| 187 | ++------------------+ |
| 188 | +``` |
| 189 | + |
| 190 | +### Naming Convention |
| 191 | + |
| 192 | +``` |
| 193 | +Format: MMMMMMMMMMMMMMMM-NNNNNNNNNNNNNNNN.ltx |
| 194 | +Example: 0000000000000001-0000000000000064.ltx (TXID 1-100) |
| 195 | +``` |
| 196 | + |
| 197 | +### Compaction Levels |
| 198 | + |
| 199 | +``` |
| 200 | +Level 0: /ltx/0000/ Raw LTX files (no compaction) |
| 201 | +Level 1: /ltx/0001/ Compacted periodically |
| 202 | +Level 2: /ltx/0002/ Compacted less frequently |
| 203 | +``` |
| 204 | + |
| 205 | +Default compaction levels: L0 (raw), L1 (30s), L2 (5min), L3 (1h), plus daily |
| 206 | +snapshots. Compaction merges files by deduplicating pages (latest version wins) |
| 207 | +and always skips the lock page. |
| 208 | + |
| 209 | +## Code Patterns |
| 210 | + |
| 211 | +### DO |
| 212 | + |
| 213 | +- Return errors immediately; let callers decide handling |
| 214 | +- Use `fmt.Errorf("context: %w", err)` for error wrapping |
| 215 | +- Handle database state in the DB layer, not Replica |
| 216 | +- Use `db.verify()` to trigger snapshots (don't reimplement) |
| 217 | +- Test with race detector: `go test -race` |
| 218 | +- Use lazy iterators for `LTXFiles` (paginate, don't load all at once) |
| 219 | + |
| 220 | +### DON'T |
| 221 | + |
| 222 | +- Write data at the 1 GB lock page boundary |
| 223 | +- Modify LTX files after creation |
| 224 | +- Put database state logic in the Replica layer |
| 225 | +- Use `RLock()` when writing shared state |
| 226 | +- Write directly to final file paths (use temp + rename) |
| 227 | +- Ignore context cancellation in long operations |
| 228 | +- Return generic errors instead of `os.ErrNotExist` for missing files |
| 229 | + |
| 230 | +## Specialized Knowledge Areas |
| 231 | + |
| 232 | +Load reference files on demand based on the task: |
| 233 | + |
| 234 | +| Task | Reference File | |
| 235 | +|-----------------------------------|-----------------------------------------| |
| 236 | +| Understanding system design | `references/ARCHITECTURE.md` | |
| 237 | +| Writing or reviewing code | `references/PATTERNS.md` | |
| 238 | +| Working with LTX files | `references/LTX_FORMAT.md` | |
| 239 | +| WAL monitoring or page operations | `references/SQLITE_INTERNALS.md` | |
| 240 | +| Implementing storage backends | `references/REPLICA_CLIENT_GUIDE.md` | |
| 241 | +| Writing or debugging tests | `references/TESTING_GUIDE.md` | |
| 242 | + |
| 243 | +## Common Debugging Procedures |
| 244 | + |
| 245 | +### Replication Not Working |
| 246 | + |
| 247 | +1. Verify WAL mode: `PRAGMA journal_mode` must return `wal` |
| 248 | +2. Check monitor interval and that the monitor goroutine is running |
| 249 | +3. Confirm `db.notify` channel is being signaled on WAL changes |
| 250 | +4. Check replica position: `replica.Pos()` should advance with writes |
| 251 | +5. Look for `os.ErrNotExist` from `OpenLTXFile` (file not replicated yet) |
| 252 | + |
| 253 | +### Large Database Issues (>1 GB) |
| 254 | + |
| 255 | +1. Verify lock page is being skipped: check `ltx.LockPgno(pageSize)` |
| 256 | +2. Test with multiple page sizes (4K, 8K, 16K, 32K) |
| 257 | +3. Run with databases both smaller and larger than 1 GB |
| 258 | +4. Ensure page iteration loops include the `continue` guard for lock page |
| 259 | + |
| 260 | +### Compaction Problems |
| 261 | + |
| 262 | +1. Confirm local L0 files exist before compaction reads them |
| 263 | +2. Check that `CreatedAt` timestamps are preserved (earliest source) |
| 264 | +3. Verify compaction level intervals in `Store.levels` |
| 265 | +4. Look for eventual consistency issues if reading from remote storage |
| 266 | + |
| 267 | +### Storage Backend Issues |
| 268 | + |
| 269 | +1. Return `os.ErrNotExist` for missing files (not generic errors) |
| 270 | +2. Support partial reads via `offset`/`size` in `OpenLTXFile` |
| 271 | +3. Handle context cancellation in all methods |
| 272 | +4. Test concurrent operations with `-race` flag |
| 273 | +5. For eventually consistent backends, add retry logic with backoff |
| 274 | + |
| 275 | +## Contribution Guidelines |
| 276 | + |
| 277 | +### What's Accepted |
| 278 | + |
| 279 | +- Bug fixes and patches (welcome) |
| 280 | +- Documentation improvements |
| 281 | +- Small code improvements and performance optimizations |
| 282 | +- Security vulnerability reports (report privately) |
| 283 | + |
| 284 | +### Discuss First |
| 285 | + |
| 286 | +- Feature requests: open an issue before implementing |
| 287 | +- Large changes: discuss approach in an issue first |
| 288 | + |
| 289 | +### Pre-Submit Checklist |
| 290 | + |
| 291 | +- [ ] Read relevant docs from the reference table above |
| 292 | +- [ ] Follow patterns in `references/PATTERNS.md` |
| 293 | +- [ ] Run `go test -race -v ./...` |
| 294 | +- [ ] Run `pre-commit run --all-files` |
| 295 | +- [ ] For page iteration: test with >1 GB databases |
| 296 | +- [ ] Show investigation evidence in PR (see CONTRIBUTING.md) |
| 297 | + |
| 298 | +## Testing |
| 299 | + |
| 300 | +```bash |
| 301 | +# Full test suite with race detection |
| 302 | +go test -race -v ./... |
| 303 | + |
| 304 | +# Specific areas |
| 305 | +go test -race -v -run TestReplica_Sync ./... |
| 306 | +go test -race -v -run TestDB_Sync ./... |
| 307 | +go test -race -v -run TestStore_CompactDB ./... |
| 308 | + |
| 309 | +# Coverage |
| 310 | +go test -coverprofile=coverage.out ./... |
| 311 | +go tool cover -html=coverage.out |
| 312 | +``` |
| 313 | + |
| 314 | +Key testing areas: |
| 315 | +- Lock page handling with >1 GB databases and multiple page sizes |
| 316 | +- Race conditions in position updates, WAL monitoring, and checkpointing |
| 317 | +- Eventual consistency in storage backend operations |
| 318 | +- Atomic file operations and cleanup on error paths |
| 319 | + |
| 320 | +## Environment Validation |
| 321 | + |
| 322 | +Run `scripts/validate-setup.sh` to verify your development environment is |
| 323 | +correctly configured for Litestream development. |
0 commit comments