-
Notifications
You must be signed in to change notification settings - Fork 324
fix(vfs): critical bug fixes and comprehensive testing infrastructure #835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
corylanou
wants to merge
16
commits into
main
Choose a base branch
from
vfs-integration-tests
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Owner
One issue we may have with the existing implementation is that I'm not sure if continuous read transactions will block updates indefinitely. It's not something we need to fix before release but it's good to know. The VFS is not meant for high volume or high performance usage so it's unlikely to be a big issue. |
Add extensive integration tests covering high-load scenarios, concurrent reads/writes, long-running soak tests, and large result set sorting. Includes temp file support infrastructure for SQLite operations. Tests: - TestVFS_HighLoadConcurrentReads: 8 concurrent readers with continuous write operations (inserts/updates/deletes) at 50ms sync intervals - TestVFS_LongRunningSoak: Sustained 5-10 minute test with 2 writers and 4 readers at aggressive 75ms intervals - TestVFS_SortingLargeResultSet: 25,000 row sorting test with PRAGMA temp_store=FILE to verify temp file handling Implementation: - Add localTempFile implementation for SQLite temp/transient files - Add VFS temp file management (tracking, deletion, access control) - Add test helpers for data seeding and replica synchronization - Use build tag 'soak' for long-running tests Addresses comprehensive VFS testing requirements including edge cases, performance under load, and SQLite temp file operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add extensive test coverage for VFS functionality including: Integration tests (cmd/litestream-vfs/main_test.go): - High-load concurrent read/write scenarios with 8+ concurrent readers - Long-running transaction stress tests validating snapshot isolation - Overlapping transaction commit storms with rapid BEGIN/COMMIT cycles - Storage backend failure injection (timeout, server errors, partial reads) - Multiple page size support (512B-65KB) with data integrity validation - Polling thread recovery from transient backend failures - Rapid update coalescing with millisecond-level synchronization - Poll interval edge cases (fast 5ms and slow 200ms intervals) - Initial snapshot blocking behavior Unit tests (vfs_lock_test.go): - SQLite lock state machine validation (NONE→SHARED→RESERVED→EXCLUSIVE) - Pending index race conditions with concurrent index updates - Index memory leak detection ensuring bounded growth - Auto-vacuum shrinking with commit size reduction - Database header journal mode rewriting to DELETE - Lock page boundary handling at 1GB offset for all page sizes - Temp file lifecycle stress with 400 concurrent operations - Temp file collision handling and delete-on-close cleanup - Temp directory exhaustion error propagation - Context cancellation propagation to blocked polling threads Production improvements: - Add full temp file support in VFS for SQLite temp databases/journals - Fix VFS.Close() to properly cancel context and wait for goroutines - Add FetchLTXHeader() helper for retrieving LTX file metadata Documentation: - Add comprehensive VFS test plan (docs/VFS_TEST_PLAN.md) - Document 34 test scenarios with 62% completion (21/34 tests) - Track known issues including hardcoded page size assumption These tests exercise critical VFS functionality including transaction isolation, concurrent access patterns, failure recovery, and SQLite-specific behaviors like lock pages and temp file management. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…modes Expand VFS test coverage from 65% to 91% with 9 new integration tests targeting production edge cases: Integration tests (main_test.go): - TestVFS_CacheMissStorm: Behavior under cache pressure (5KB cache) - TestVFS_NetworkLatencySensitivity: Replica behavior with network latency - TestVFS_ConcurrentConnectionScaling: 32 concurrent connections under load - TestVFS_PartialLTXUpload: Recovery from incomplete LTX file uploads - TestVFS_S3EventualConsistency: Behavior with eventual consistency delays - TestVFS_FileDescriptorBudget: FD management under ulimit constraints - TestVFS_PageIndexOOM: Handling OOM during page index loading - TestVFS_PageIndexCorruptionRecovery: Recovery from corrupted index data - BenchmarkVFS_LargeDatabase: Performance benchmarking on 20k row database Specialized test suites (require build tags): - chaos_test.go: Random 5% failure injection with 16 concurrent readers - fuzz_test.go: Fuzzing harness for read pattern validation - stress_test.go: Race detector stress test with 64 readers Test infrastructure improvements: - Enhanced error detection for sqlite3.Error types and "SQL logic error" - New mock clients: latencyReplicaClient, eventualConsistencyClient, fdLimitedReplicaClient, oomPageIndexClient, corruptingPageIndexClient - Added waitForTableRowCount() helper for table-specific synchronization - Timing adjustments in existing tests for improved reliability Unit tests (vfs_lock_test.go): - TestVFSFile_CorruptedPageIndexRecovery: Corrupted index data handling Documentation: - Update VFS_TEST_PLAN.md progress tracking (31/34 tests, 91% complete) - Add Ben's guidance note on high-concurrency test expectations These tests validate production readiness by exercising critical failure scenarios: network conditions, resource exhaustion, data corruption, and eventual consistency behaviors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ed error handling Merges Ben's dual-level polling architecture (PR #837) with resilient error handling for production stability: - Add retry logic to ReadAt() with exponential backoff (6 attempts, 15ms base delay) - Implement isRetryablePageError() for transient failures (EOF, context timeout, NotExist) - Preserve L0/L1 dual polling with maxTXID1 tracking and gap detection - Add commit rollback detection via baseCommit/replaceIndex logic - Extend isBusyError() to handle "converting NULL to int" scan errors - Fix chaos_test.go to retry on busy errors in rows.Scan() and rows.Err() paths Tests passing: - TestVFS_PageIndexCorruptionRecovery - TestVFS_MultiplePageSizes - TestVFS_FuzzSeedCorpus - TestVFS_S3EventualConsistency - TestVFS_FileDescriptorBudget - TestVFS_LongRunningTxnStress - TestVFS_ChaosEngineering (with chaos tag) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace wrapper-based failure injection with direct error injection mechanism for cleaner, more deterministic testing. Changes: - Add per-read context markers (IsVFSPageFetchContext, PageFetchFileName) - Add InjectNextVFSReadError() for test failure injection - Add pendingReplace flag to handle full index replacement during locks - Simplify TestVFS_StorageFailureInjection and TestVFS_PartialLTXUpload - Fix TestVFSFile_NonContiguousTXIDError to verify gap behavior - Add TestVFSFile_PendingIndexReplacementRemovesStalePages The new injection mechanism eliminates the need for wrapper clients, making tests more focused and easier to understand. Context markers enable future observability features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace global pageReadFailures map and InjectNextVFSReadError() with VFSReadInterceptor interface. This eliminates global mutable state and enables per-VFS instance interceptors for improved testability and concurrent usage. Changes: - Add VFSReadInterceptor interface for observing page reads - Add SetReadInterceptor() methods on VFS and VFSFile - Remove global pageReadFailures map and associated mutex - Update tests to use local vfsReadErrorInjector instances 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add extensive VFS test coverage using wrapper-based failure injection. Test infrastructure: - testVFS wrapper with per-database error injection - injectingFile for ReadAt() interception - faultInjectingReplicaClient for storage-level failures Test coverage (31 integration tests): - Concurrent access (100+ readers, 32 connections) - Stress scenarios (high load, 20K rows) - Failure injection (storage errors, partial reads) - Edge cases (long txns, empty DBs) - L1 compaction support Additional test suites: - Chaos tests (5% random failures) - Fuzz tests (random read patterns) - Stress tests (memory pressure) - Lock unit tests (state machine) Production enhancements: - Dual L0/L1 polling - Resilient page fetch (retry logic) - Temp file support - Context helpers for testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace context-based error injection with direct InjectNextVFSReadError() function. The new approach stores failures in a global map keyed by database path, eliminating the need to thread context through ReplicaClient calls. This simplifies the testing infrastructure by removing: - faultInjectingReplicaClient wrapper - Context value propagation through OpenLTXFile - 68 lines of test scaffolding The new approach uses a single injection point in VFSFile.ReadAt() that checks for injected errors before each page read. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Refactor Delete() error handling for clearer control flow - Ignore ErrNotExist when removing temp files from disk - Seed maxTXID1 during Open() based on restore plan for correct dual-polling - Add tests for Open() position seeding and Delete() edge cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace counter-based lock tracking with actual lock type storage in localTempFile. CheckReservedLock() now accurately reflects whether the lock is at Reserved level or higher, rather than just checking if any lock operation occurred. Add TestLocalTempFileLocking to validate lock state transitions and CheckReservedLock behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add missing documentation comment for the FetchLTXHeader helper function to improve code clarity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Move all VFS unit tests from vfs_lock_test.go into vfs_test.go for better organization alongside the integration tests. Inline the localTempFile implementation directly in vfs.go using sync/atomic for lock state tracking, removing the need for vfs_temp_file.go. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
ad1c100 to
3428aad
Compare
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
4843b99 to
3e3e01a
Compare
Removes contextWithPageFetch, IsVFSPageFetchContext, and PageFetchFileName which are no longer needed. Simplifies ReadAt to use f.ctx directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fix two critical VFS bugs: 1. Temp file deletion: Track temp file names separately from paths to tolerate SQLite cleanup of already-deleted files. SQLite may attempt to delete temp/journal files multiple times; we now return success (os.ErrNotExist) for known temp files even if already unlinked, preventing SQLITE_IOERR_DELETE errors while still protecting the main replica from deletion. 2. Cache page offset: Use actual negotiated page size instead of hardcoded 4096 bytes when calculating offsets in cached pages. Previous implementation panicked on sub-4KB pages and returned corrupt data on larger page sizes. Changes: - Add tempNames map to remember all temp files ever created - Refactor Delete() to distinguish missing temp files from protected files - Add wasTempFileName() and unregisterTempFile() helpers - Fix ReadAt cache hit to use pageSize instead of magic number 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
… directories Previously, temp files like "foo/shared.db" and "bar/shared.db" would collide because canonicalTempName() used filepath.Base(), discarding directory information. This caused the second file to overwrite or interfere with the first. Changes: - Use filepath.Clean() instead of filepath.Base() in canonicalTempName to preserve full path information - Add tempFilenameFromCanonical() to generate unique filesystem names by hashing the canonical path and appending to basename (e.g., "shared.db-1234567890abcdef") - Update trackTempFile() to accept pre-computed canonical name - Add test TestVFS_TempFileSameBasenameDifferentDirs verifying unique paths and independent lifecycle management This ensures temp files with identical basenames but different parent directories are stored and tracked separately, preventing data corruption and registration conflicts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
benbjohnson
approved these changes
Nov 27, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fix critical VFS bugs that prevented production use and add comprehensive testing infrastructure to ensure reliability.
Critical Bug Fixes
Page Size Support
ReadAt()cache offset calculation (vfs.go:517)Temp File Handling
Lock State Tracking
CheckReservedLock()to report true when file holds at least a reserved lock (vfs.go:625-678)localTempFilehelper uses atomic lock-type storage (vfs.go:252-321), butVFSFileuses direct state checkingNetwork Resilience
Commit Rollback Detection
pendingReplacelogic to detect database shrinkage (vfs.go:300-320)FileSize()returns accurate values after vacuuming, trimmed pages disappear correctlyTesting Infrastructure
New Test Files
cmd/litestream-vfs/main_test.go- Integration tests expanded to 31 scenarios (+1874 lines)vfs_test.go- Unit tests for VFS internals (987 lines)cmd/litestream-vfs/chaos_test.go- Chaos engineering with network failure injection (230 lines)cmd/litestream-vfs/fuzz_test.go- Fuzzing harness for replica read patterns (162 lines)cmd/litestream-vfs/stress_test.go- Race detector stress tests (96 lines)cmd/litestream-vfs/vfs_soak_test.go- Long-running stability tests (140 lines)docs/VFS_TEST_PLAN.md- Comprehensive test plan with 34 scenarios (1068 lines)Test Coverage
Production Readiness
These fixes enable VFS to:
🤖 Generated with Claude Code