Skip to content

Commit d7290d7

Browse files
authored
Fix some missing durable writes in CheckpointBuilder (#5080)
This hopefully fixes a crash we're seeing in the field, with a signature like: `std::runtime_error("Corrupt checkpoint file /var/lib/stellar/buckets/history/ledger/ledger-0398c97f.xdr.dirty, ends on ledger 60344665, LCL is 60344667")` The fix is to change a couple `writeOne` calls to `durableWriteOne` calls in the `CheckpointBuilder`. As far as I can tell their absence was just an oversight during the multiple iterations of development of the original `CheckpointBuilder` PR #4446 -- initially the PR didn't have `durableWriteOne`, and then later after some discussion about durability guarantees it gained that path, but only 2 of the 4 cases of `writeOne` in the code got updated to use it. By not doing durable writes here, it's possible for core to lose a suffix of a dirty checkpoint file, which in turn can violate the safety invariant that the dirty checkpoint files are always _ahead_ of the sqlite-committed LCL. The `CheckpointBuilder` detects this invariant violation on startup and crashes with the message above. Why this started to manifest only in v25 is a _bit_ unclear, but there are a few possibilities. As a statistical sort of thing, the machine that failed in the field might just have got unlucky: its OS failed to write a buffer either on a crash or other non-graceful shutdown condition. Alternatively it _might_ be due to the fact that we removed some other unrelated fsync calls in general recently, for example in #4952, which might have coalesced with the flushed-but-not-fsync'ed `CheckpointBuilder` files. It's hard to be sure about anything with fsync! But I think this change ought to at least improve the odds of `CheckpointBuilder` maintaining its invariants. (The removed `flush` call after the previous `writeOne` is intentional -- flushing the stream buffer happens as part of `durableWriteOne`)
2 parents ff86d0d + 5a618d0 commit d7290d7

File tree

1 file changed

+2
-3
lines changed

1 file changed

+2
-3
lines changed

src/history/CheckpointBuilder.cpp

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,8 +170,7 @@ CheckpointBuilder::appendLedgerHeader(LedgerHeader const& header,
170170
LedgerHeaderHistoryEntry lhe;
171171
lhe.header = header;
172172
lhe.hash = xdrSha256(header);
173-
mLedgerHeaders->writeOne(lhe);
174-
mLedgerHeaders->flush();
173+
mLedgerHeaders->durableWriteOne(lhe);
175174
}
176175

177176
namespace
@@ -289,7 +288,7 @@ CheckpointBuilder::cleanup(uint32_t lcl)
289288
ft.localPath_nogz_dirty(), lcl);
290289
break;
291290
}
292-
out.writeOne(entry);
291+
out.durableWriteOne(entry);
293292
}
294293
catch (xdr::xdr_runtime_error const& e)
295294
{

0 commit comments

Comments
 (0)