-
Notifications
You must be signed in to change notification settings - Fork 324
fix(restore): properly propagate compaction errors during restore #858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add validation in Restore() to check that LTX files have valid sizes before attempting to open and compact them. Files with size less than the LTX header size (100 bytes) now return a clear error message identifying the problematic file instead of the cryptic "decode database: decode header: EOF" error. Fixes #848 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
this can't be a real fix.
With litestream v3 everything works fine (I downgraded for now) |
|
@smerschjohann No, this is not the fix. I talked with @corylanou offline about it and decided that a length check is probably a good idea but not the root cause. |
|
ah, I see. I just wanted to mention it. Thanks for looking into it. |
|
Thanks for the feedback @smerschjohann, and you raise the key question: why are files ending up broken in the first place? You're right - this PR is defensive (better error message), not a root cause fix. As Ben mentioned, that's intentional while we investigate further. My Investigation So FarThe error
The restore flow: flowchart TB
A["1. CalcRestorePlan()"] --> B["2. LTXFiles()<br/><i>list files in storage</i>"]
B --> C["3. OpenLTXFile()<br/><i>open each file</i>"]
C --> D["4. ltx.NewCompactor()<br/><i>read headers from readers</i>"]
D --> E["5. DecodeDatabaseTo()<br/><i>write to disk</i>"]
E --> F[("Restored DB")]
Files examined
Potential causes I'm considering
Related: PR #807I fixed Key difference from v0.3.x
To Help Diagnose@smerschjohann - Since you can reproduce this consistently and you're asking the right question (why are files broken), could you provide: 1. File listing when the error occurs: gsutil ls -la gs://your-bucket/path/This will confirm if files are actually 0 bytes or suspiciously small (< 100 bytes). 2. Debug logs during replication (before you stop litestream): # Run litestream with debug logging
litestream replicate -log-level debug -config litestream.ymlThis might show what's happening when files are being written. 3. Debug logs during restore: litestream restore -log-level debug -o test.db gs://your-bucket/pathThis will show which specific file it fails on. For @benbjohnsonAny thoughts on other areas worth diving deeper into?
|
|
there is no -log-level flag |
|
ok found out that it is LOG_LEVEL=DEBUG |
|
@smerschjohann Thanks for those logs - they're really helpful. So the compaction shows Can you run these to dump the actual file content? # Check if the file has valid LTX header (should start with "LTX1" / 4c 54 58 31)
gsutil cat gs://BUCKET_NAME/test/ltx/1/0000000000000001-0000000000000001.ltx | xxd | head -10
# Or download and check locally
gsutil cp gs://BUCKET_NAME/test/ltx/1/0000000000000001-0000000000000001.ltx /tmp/test.ltx
ls -la /tmp/test.ltx
xxd /tmp/test.ltx | head -20If the local litestream cache still exists (run from your test directory): cd /home/simon/tmp/test
xxd .database.db-litestream/ltx/1/0000000000000001-0000000000000001.ltx | head -20If the hex dump shows valid LTX headers ( |
|
The files look good. Maybe this line from the replicate command is also relevant? |
…failures - Add detailed restore plan logging showing each file's level, TXID range, and size - Fix compaction error capture (was being swallowed due to variable scoping) - Improve decode error message to include first file info for debugging EOF errors - Add warning when restore plan starts from non-TXID-1 (seeded replica detection) - Add GCS-specific fail-fast for truncated files (< header size) - Add debug logging when opening LTX files from GCS These changes help diagnose issue #858 where GCS restores fail with "decode database: decode header: EOF" by providing better visibility into the restore plan and catching truncated files early. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
I just checked out your branch and tried with it, and it works. Multiple times, when I use the executable from the release page it does not work.. Currently try to figure out if it is one of the static build flags |
|
I successfully dissected the issue and this commit fixes it: 5cb6fd6 Can you create another release? |
Fix variable scoping bug where compaction errors were being lost. The error from c.Compact() was captured in an if-statement scope, but the outer `err` variable (nil on NewCompactor success) was passed to CloseWithError, causing compaction failures to result in misleading EOF errors instead of the actual error. Note: The root cause of issue #858 (GCS restore EOF errors) was already fixed in PR #807 (commit 5cb6fd6) which translates size=0 to length=-1 for GCS range reads. This PR fixes a secondary bug discovered during investigation. Related: #807 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
51ddcea to
e4c9b89
Compare
|
Thanks for the git bisect work @smerschjohann! That confirms it - commit 5cb6fd6 (PR #807) is indeed the fix for your issue. Why we couldn't reproduce: We were testing on the Good news: A new release is coming soon - we're preparing to release the VFS functionality, and that release will include the fix from #807. This PR (#858) started as an investigation into your issue, but since #807 already fixed the root cause, we've repurposed it to fix a secondary bug we discovered during the investigation (compaction errors weren't being properly propagated). |
Summary
Fix variable scoping bug where compaction errors were being lost during restore operations.
The Bug
The error from
c.Compact()was captured in an if-statement scope, but the outererrvariable (which wasnilafter successfulNewCompactor) was passed toCloseWithError. This caused compaction failures to result in misleading EOF errors instead of the actual error message.Relation to Issue #858
Note: The root cause of the GCS restore EOF errors reported in issue #858 was already fixed in PR #807 (commit 5cb6fd6), which translates
size=0tolength=-1for GCS range reads.This PR fixes a secondary bug discovered during the investigation of #858. The user confirmed via git bisect that #807 resolves their issue.
Test Plan
Related: #807
🤖 Generated with Claude Code