Skip to content

Commit 51bd956

Browse files
fdmananakdave
authored andcommitted
btrfs: fix deadlock due to page faults during direct IO reads and writes
If we do a direct IO read or write when the buffer given by the user is memory mapped to the file range we are going to do IO, we end up ending in a deadlock. This is triggered by the new test case generic/647 from fstests. For a direct IO read we get a trace like this: [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds. [967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1 [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000 [967.875992] Call Trace: [967.875999] __schedule+0x3ca/0xe10 [967.876015] schedule+0x43/0xe0 [967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] [967.876109] ? do_wait_intr_irq+0xb0/0xb0 [967.876118] lock_extent_bits+0x37/0x90 [btrfs] [967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs] [967.876184] ? extent_readahead+0xa7/0x530 [btrfs] [967.876214] extent_readahead+0x32d/0x530 [btrfs] [967.876253] ? lru_cache_add+0x104/0x220 [967.876255] ? kvm_sched_clock_read+0x14/0x40 [967.876258] ? sched_clock_cpu+0xd/0x110 [967.876263] ? lock_release+0x155/0x4a0 [967.876271] read_pages+0x86/0x270 [967.876274] ? lru_cache_add+0x125/0x220 [967.876281] page_cache_ra_unbounded+0x1a3/0x220 [967.876291] filemap_fault+0x626/0xa20 [967.876303] __do_fault+0x36/0xf0 [967.876308] __handle_mm_fault+0x83f/0x15f0 [967.876322] handle_mm_fault+0x9e/0x260 [967.876327] __get_user_pages+0x204/0x620 [967.876332] ? get_user_pages_unlocked+0x69/0x340 [967.876340] get_user_pages_unlocked+0xd3/0x340 [967.876349] internal_get_user_pages_fast+0xbca/0xdc0 [967.876366] iov_iter_get_pages+0x8d/0x3a0 [967.876374] bio_iov_iter_get_pages+0x82/0x4a0 [967.876379] ? lock_release+0x155/0x4a0 [967.876387] iomap_dio_bio_actor+0x232/0x410 [967.876396] iomap_apply+0x12a/0x4a0 [967.876398] ? iomap_dio_rw+0x30/0x30 [967.876414] __iomap_dio_rw+0x29f/0x5e0 [967.876415] ? iomap_dio_rw+0x30/0x30 [967.876420] ? lock_acquired+0xf3/0x420 [967.876429] iomap_dio_rw+0xa/0x30 [967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs] [967.876460] new_sync_read+0x118/0x1a0 [967.876472] vfs_read+0x128/0x1b0 [967.876477] __x64_sys_pread64+0x90/0xc0 [967.876483] do_syscall_64+0x3b/0xc0 [967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae [967.876490] RIP: 0033:0x7fb6f2c038d6 [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6 [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003 [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000 [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003 [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000 This happens because at btrfs_dio_iomap_begin() we lock the extent range and return with it locked - we only unlock in the endio callback, at end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after iomap called the btrfs_dio_iomap_begin() callback, it triggers the page faults that resulting in reading the pages, through the readahead callback btrfs_readahead(), and through there we end to attempt to lock again the same extent range (or a subrange of what we locked before), resulting in the deadlock. For a direct IO write, the scenario is a bit different, and it results in trace like this: [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35 [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds. [1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1 [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000 [1330.351906] Call Trace: [1330.351913] __schedule+0x3ca/0xe10 [1330.351930] schedule+0x43/0xe0 [1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs] [1330.352020] ? do_wait_intr_irq+0xb0/0xb0 [1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs] [1330.352064] ? extent_readahead+0xa7/0x530 [btrfs] [1330.352094] extent_readahead+0x32d/0x530 [btrfs] [1330.352133] ? lru_cache_add+0x104/0x220 [1330.352135] ? kvm_sched_clock_read+0x14/0x40 [1330.352138] ? sched_clock_cpu+0xd/0x110 [1330.352143] ? lock_release+0x155/0x4a0 [1330.352151] read_pages+0x86/0x270 [1330.352155] ? lru_cache_add+0x125/0x220 [1330.352162] page_cache_ra_unbounded+0x1a3/0x220 [1330.352172] filemap_fault+0x626/0xa20 [1330.352176] ? filemap_map_pages+0x18b/0x660 [1330.352184] __do_fault+0x36/0xf0 [1330.352189] __handle_mm_fault+0x1253/0x15f0 [1330.352203] handle_mm_fault+0x9e/0x260 [1330.352208] __get_user_pages+0x204/0x620 [1330.352212] ? get_user_pages_unlocked+0x69/0x340 [1330.352220] get_user_pages_unlocked+0xd3/0x340 [1330.352229] internal_get_user_pages_fast+0xbca/0xdc0 [1330.352246] iov_iter_get_pages+0x8d/0x3a0 [1330.352254] bio_iov_iter_get_pages+0x82/0x4a0 [1330.352259] ? lock_release+0x155/0x4a0 [1330.352266] iomap_dio_bio_actor+0x232/0x410 [1330.352275] iomap_apply+0x12a/0x4a0 [1330.352278] ? iomap_dio_rw+0x30/0x30 [1330.352292] __iomap_dio_rw+0x29f/0x5e0 [1330.352294] ? iomap_dio_rw+0x30/0x30 [1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs] [1330.352339] new_sync_write+0x11f/0x1b0 [1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e [1330.352354] vfs_write+0x292/0x3c0 [1330.352359] __x64_sys_pwrite64+0x90/0xc0 [1330.352365] do_syscall_64+0x3b/0xc0 [1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae [1330.352372] RIP: 0033:0x7f4b0a580986 [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012 [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986 [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003 [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000 [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent range unlocked, but later when the page faults are triggered and we try to read the extents, we end up btrfs_lock_and_flush_ordered_range() where we find the ordered extent for our write, created by the iomap callback btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us deadlock since we can't complete the ordered extent without reading the pages (the iomap code only submits the bio after the pages are faulted in). Fix this by setting the nofault attribute of the given iov_iter and retry the direct IO read/write if we get an -EFAULT error returned from iomap. For reads, also disable page faults completely, this is because when we read from a hole or a prealloc extent, we can still trigger page faults due to the call to iov_iter_zero() done by iomap - at the moment, it is oblivious to the value of the ->nofault attribute of an iov_iter. We also need to keep track of the number of bytes written or read, and pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL. This depends on the iov_iter and iomap changes introduced in commit c03098d ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2"). Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent c03098d commit 51bd956

File tree

1 file changed

+123
-16
lines changed

1 file changed

+123
-16
lines changed

fs/btrfs/file.c

Lines changed: 123 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
19121912

19131913
static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
19141914
{
1915+
const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
19151916
struct file *file = iocb->ki_filp;
19161917
struct inode *inode = file_inode(file);
19171918
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
19181919
loff_t pos;
19191920
ssize_t written = 0;
19201921
ssize_t written_buffered;
1922+
size_t prev_left = 0;
19211923
loff_t endbyte;
19221924
ssize_t err;
19231925
unsigned int ilock_flags = 0;
1924-
struct iomap_dio *dio = NULL;
19251926

19261927
if (iocb->ki_flags & IOCB_NOWAIT)
19271928
ilock_flags |= BTRFS_ILOCK_TRY;
@@ -1964,23 +1965,80 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
19641965
goto buffered;
19651966
}
19661967

1967-
dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
1968-
0, 0);
1968+
/*
1969+
* We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
1970+
* calls generic_write_sync() (through iomap_dio_complete()), because
1971+
* that results in calling fsync (btrfs_sync_file()) which will try to
1972+
* lock the inode in exclusive/write mode.
1973+
*/
1974+
if (is_sync_write)
1975+
iocb->ki_flags &= ~IOCB_DSYNC;
19691976

1970-
btrfs_inode_unlock(inode, ilock_flags);
1977+
/*
1978+
* The iov_iter can be mapped to the same file range we are writing to.
1979+
* If that's the case, then we will deadlock in the iomap code, because
1980+
* it first calls our callback btrfs_dio_iomap_begin(), which will create
1981+
* an ordered extent, and after that it will fault in the pages that the
1982+
* iov_iter refers to. During the fault in we end up in the readahead
1983+
* pages code (starting at btrfs_readahead()), which will lock the range,
1984+
* find that ordered extent and then wait for it to complete (at
1985+
* btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
1986+
* obviously the ordered extent can never complete as we didn't submit
1987+
* yet the respective bio(s). This always happens when the buffer is
1988+
* memory mapped to the same file range, since the iomap DIO code always
1989+
* invalidates pages in the target file range (after starting and waiting
1990+
* for any writeback).
1991+
*
1992+
* So here we disable page faults in the iov_iter and then retry if we
1993+
* got -EFAULT, faulting in the pages before the retry.
1994+
*/
1995+
again:
1996+
from->nofault = true;
1997+
err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
1998+
IOMAP_DIO_PARTIAL, written);
1999+
from->nofault = false;
19712000

1972-
if (IS_ERR_OR_NULL(dio)) {
1973-
err = PTR_ERR_OR_ZERO(dio);
1974-
if (err < 0 && err != -ENOTBLK)
1975-
goto out;
1976-
} else {
1977-
written = iomap_dio_complete(dio);
2001+
/* No increment (+=) because iomap returns a cumulative value. */
2002+
if (err > 0)
2003+
written = err;
2004+
2005+
if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
2006+
const size_t left = iov_iter_count(from);
2007+
/*
2008+
* We have more data left to write. Try to fault in as many as
2009+
* possible of the remainder pages and retry. We do this without
2010+
* releasing and locking again the inode, to prevent races with
2011+
* truncate.
2012+
*
2013+
* Also, in case the iov refers to pages in the file range of the
2014+
* file we want to write to (due to a mmap), we could enter an
2015+
* infinite loop if we retry after faulting the pages in, since
2016+
* iomap will invalidate any pages in the range early on, before
2017+
* it tries to fault in the pages of the iov. So we keep track of
2018+
* how much was left of iov in the previous EFAULT and fallback
2019+
* to buffered IO in case we haven't made any progress.
2020+
*/
2021+
if (left == prev_left) {
2022+
err = -ENOTBLK;
2023+
} else {
2024+
fault_in_iov_iter_readable(from, left);
2025+
prev_left = left;
2026+
goto again;
2027+
}
19782028
}
19792029

1980-
if (written < 0 || !iov_iter_count(from)) {
1981-
err = written;
2030+
btrfs_inode_unlock(inode, ilock_flags);
2031+
2032+
/*
2033+
* Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
2034+
* the fsync (call generic_write_sync()).
2035+
*/
2036+
if (is_sync_write)
2037+
iocb->ki_flags |= IOCB_DSYNC;
2038+
2039+
/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
2040+
if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
19822041
goto out;
1983-
}
19842042

19852043
buffered:
19862044
pos = iocb->ki_pos;
@@ -2005,7 +2063,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
20052063
invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
20062064
endbyte >> PAGE_SHIFT);
20072065
out:
2008-
return written ? written : err;
2066+
return err < 0 ? err : written;
20092067
}
20102068

20112069
static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -3659,6 +3717,8 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
36593717
static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
36603718
{
36613719
struct inode *inode = file_inode(iocb->ki_filp);
3720+
size_t prev_left = 0;
3721+
ssize_t read = 0;
36623722
ssize_t ret;
36633723

36643724
if (fsverity_active(inode))
@@ -3668,10 +3728,57 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
36683728
return 0;
36693729

36703730
btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
3731+
again:
3732+
/*
3733+
* This is similar to what we do for direct IO writes, see the comment
3734+
* at btrfs_direct_write(), but we also disable page faults in addition
3735+
* to disabling them only at the iov_iter level. This is because when
3736+
* reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
3737+
* which can still trigger page fault ins despite having set ->nofault
3738+
* to true of our 'to' iov_iter.
3739+
*
3740+
* The difference to direct IO writes is that we deadlock when trying
3741+
* to lock the extent range in the inode's tree during he page reads
3742+
* triggered by the fault in (while for writes it is due to waiting for
3743+
* our own ordered extent). This is because for direct IO reads,
3744+
* btrfs_dio_iomap_begin() returns with the extent range locked, which
3745+
* is only unlocked in the endio callback (end_bio_extent_readpage()).
3746+
*/
3747+
pagefault_disable();
3748+
to->nofault = true;
36713749
ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
3672-
0, 0);
3750+
IOMAP_DIO_PARTIAL, read);
3751+
to->nofault = false;
3752+
pagefault_enable();
3753+
3754+
/* No increment (+=) because iomap returns a cumulative value. */
3755+
if (ret > 0)
3756+
read = ret;
3757+
3758+
if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
3759+
const size_t left = iov_iter_count(to);
3760+
3761+
if (left == prev_left) {
3762+
/*
3763+
* We didn't make any progress since the last attempt,
3764+
* fallback to a buffered read for the remainder of the
3765+
* range. This is just to avoid any possibility of looping
3766+
* for too long.
3767+
*/
3768+
ret = read;
3769+
} else {
3770+
/*
3771+
* We made some progress since the last retry or this is
3772+
* the first time we are retrying. Fault in as many pages
3773+
* as possible and retry.
3774+
*/
3775+
fault_in_iov_iter_writeable(to, left);
3776+
prev_left = left;
3777+
goto again;
3778+
}
3779+
}
36733780
btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
3674-
return ret;
3781+
return ret < 0 ? ret : read;
36753782
}
36763783

36773784
static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)

0 commit comments

Comments
 (0)