Skip to content

Commit 30abbca

Browse files
committed
Retry prim_file:rename on eacces error
On Windows, when creating a snapshot (renaming a checkpoint), occasionally prim_file:rename failed with `eacces`. It's not clear why this happens, it appears that Windows may not release the file handle immediately after closing the file so a close->rename in a quick succession may sometimes return an error. With this commit, we just retry after 20ms and so far, in our testing, the error has never occurred on the second attempt (with a 10ms delay, it still failed every now and then) We could have separate macros for eagain and eaccess, but I went with a share one. One retry seems reasonable in both cases.
1 parent 5adba50 commit 30abbca

File tree

2 files changed

+12
-8
lines changed

2 files changed

+12
-8
lines changed

src/ra_file.erl

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@
99

1010
-include("ra.hrl").
1111

12-
-define(HANDLE_EAGAIN(Op),
12+
-define(RETRY_ON_ERROR(Op),
1313
case Op of
14-
{error, eagain} ->
15-
?DEBUG("EAGAIN during file operation, retrying once in 10ms...", []),
16-
timer:sleep(10),
14+
{error, E} when E =:= eagain orelse E =:= eacces ->
15+
?DEBUG("Error `~p` during file operation, retrying once in 20ms...", [E]),
16+
timer:sleep(20),
1717
case Op of
1818
{error, eagain} = Err ->
19-
?DEBUG("EAGAIN again during file operation", []),
19+
?DEBUG("Error `~p` again during file operation", [E]),
2020
Err;
2121
Res ->
2222
Res
@@ -26,8 +26,12 @@
2626
end).
2727

2828
-export([
29-
sync/1
29+
sync/1,
30+
rename/2
3031
]).
3132

3233
sync(Fd) ->
33-
?HANDLE_EAGAIN(file:sync(Fd)).
34+
?RETRY_ON_ERROR(file:sync(Fd)).
35+
36+
rename(Src, Dst) ->
37+
?RETRY_ON_ERROR(prim_file:rename(Src, Dst)).

src/ra_snapshot.erl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -417,7 +417,7 @@ promote_checkpoint(PromotionIdx,
417417
%% sync the checkpoint before promoting it
418418
%% into a snapshot.
419419
ok = Mod:sync(Checkpoint),
420-
ok = prim_file:rename(Checkpoint, Snapshot),
420+
ok = ra_file:rename(Checkpoint, Snapshot),
421421
Self ! {ra_log_event,
422422
{snapshot_written,
423423
{Idx, Term}, snapshot}}

0 commit comments

Comments
 (0)