Skip to content

Commit 05381a4

Browse files
authored
utils: remove unnecessary fsync in durable_rename() (#9686)
## Problem WAL segment fsyncs significantly affect WAL ingestion throughput. `durable_rename()` is used when initializing every 16 MB segment, and issues 3 fsyncs of which 1 was unnecessary. ## Summary of changes Remove an fsync in `durable_rename` which is unnecessary with Linux and ext4 (which we currently use). This improves WAL ingestion throughput by up to 23% with large appends on my MacBook.
1 parent cef1658 commit 05381a4

File tree

1 file changed

+20
-11
lines changed

1 file changed

+20
-11
lines changed

libs/utils/src/crashsafe.rs

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -123,15 +123,27 @@ pub async fn fsync_async_opt(
123123
Ok(())
124124
}
125125

126-
/// Like postgres' durable_rename, renames file issuing fsyncs do make it
127-
/// durable. After return, file and rename are guaranteed to be persisted.
126+
/// Like postgres' durable_rename, renames a file and issues fsyncs to make it durable. After
127+
/// returning, both the file and rename are guaranteed to be persisted. Both paths must be on the
128+
/// same file system.
128129
///
129-
/// Unlike postgres, it only does fsyncs to 1) file to be renamed to make
130-
/// contents durable; 2) its directory entry to make rename durable 3) again to
131-
/// already renamed file, which is not required by standards but postgres does
132-
/// it, let's stick to that. Postgres additionally fsyncs newpath *before*
133-
/// rename if it exists to ensure that at least one of the files survives, but
134-
/// current callers don't need that.
130+
/// Unlike postgres, it only fsyncs 1) the file to make contents durable, and 2) the directory to
131+
/// make the rename durable. This sequence ensures the target file will never be incomplete.
132+
///
133+
/// Postgres also:
134+
///
135+
/// * Fsyncs the target file, if it exists, before the rename, to ensure either the new or existing
136+
/// file survives a crash. Current callers don't need this as it should already be fsynced if
137+
/// durability is needed.
138+
///
139+
/// * Fsyncs the file after the rename. This can be required with certain OSes or file systems (e.g.
140+
/// NFS), but not on Linux with most common file systems like ext4 (which we currently use).
141+
///
142+
/// An audit of 8 other databases found that none fsynced the file after a rename:
143+
/// <https://github.com/neondatabase/neon/pull/9686#discussion_r1837180535>
144+
///
145+
/// eBPF probes confirmed that this is sufficient with ext4, XFS, and ZFS, but possibly not Btrfs:
146+
/// <https://github.com/neondatabase/neon/pull/9686#discussion_r1837926218>
135147
///
136148
/// virtual_file.rs has similar code, but it doesn't use vfs.
137149
///
@@ -149,9 +161,6 @@ pub async fn durable_rename(
149161
// Time to do the real deal.
150162
tokio::fs::rename(old_path.as_ref(), new_path.as_ref()).await?;
151163

152-
// Postgres'ish fsync of renamed file.
153-
fsync_async_opt(new_path.as_ref(), do_fsync).await?;
154-
155164
// Now fsync the parent
156165
let parent = match new_path.as_ref().parent() {
157166
Some(p) => p,

0 commit comments

Comments
 (0)