Skip to content

Commit 87be949

Browse files
committed
Merge tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong: "The highlight of this is a batch of fixes for the online metadata checking code as we start the loooong march towards merging online repair. I aim to merge that in time for the 2023 LTS. There are also a large number of data corruption and race condition fixes in this patchset. Most notably fixed are write() calls to unwritten extents racing with writeback, which required some late(r than I prefer) code changes to iomap to support the necessary revalidations. I don't really like iomap changes going in past -rc4, but Dave and I have been working on it long enough that I chose to push it for 6.2 anyway. There are also a number of other subtle problems fixed, including the log racing with inode writeback to write inodes with incorrect link count to disk; file data mapping corruptions as a result of incorrect lock cycling when attaching dquots; refcount metadata corruption if one actually manages to share a block 2^32 times; and the log clobbering cow staging extents if they were formerly metadata blocks. Summary: - Fix a race condition w.r.t. percpu inode free counters - Fix a broken error return in xfs_remove - Print FS UUID at mount/unmount time - Numerous fixes to the online fsck code - Fix inode locking inconsistency problems when dealing with realtime metadata files - Actually merge pull requests so that we capture the cover letter contents - Fix a race between rebuilding VFS inode state and the AIL flushing inodes that could cause corrupt inodes to be written to the filesystem - Fix a data corruption problem resulting from a write() to an unwritten extent racing with writeback started on behalf of memory reclaim changing the extent state - Add debugging knobs so that we can test iomap invalidation - Fix the blockdev pagecache contents being stale after unmounting the filesystem, leading to spurious xfs_db errors and corrupt metadumps - Fix a file mapping corruption bug due to ilock cycling when attaching dquots to a file during delalloc reservation - Fix a refcount btree corruption problem due to the refcount adjustment code not handling MAXREFCOUNT correctly, resulting in unnecessary record splits - Fix COW staging extent alloctions not being classified as USERDATA, which results in filestreams being ignored and possible data corruption if the allocation was filled from the AGFL and the block buffer is still being tracked in the AIL - Fix new duplicated includes - Fix a race between the dquot shrinker and dquot freeing that could cause a UAF" * tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (50 commits) xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING xfs: Remove duplicated include in xfs_iomap.c xfs: invalidate xfs_bufs when allocating cow extents xfs: get rid of assert from xfs_btree_islastblock xfs: estimate post-merge refcounts correctly xfs: hoist refcount record merge predicates xfs: fix super block buf log item UAF during force shutdown xfs: wait iclog complete before tearing down AIL xfs: attach dquots to inode before reading data/cow fork mappings xfs: shut up -Wuninitialized in xfsaild_push xfs: use memcpy, not strncpy, to format the attr prefix during listxattr xfs: invalidate block device page cache during unmount xfs: add debug knob to slow down write for fun xfs: add debug knob to slow down writeback for fun xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap ...
2 parents c7020e1 + 52f31ed commit 87be949

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+1317
-313
lines changed

fs/iomap/buffered-io.c

Lines changed: 253 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -584,7 +584,7 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
584584
return iomap_read_inline_data(iter, folio);
585585
}
586586

587-
static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
587+
static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
588588
size_t len, struct folio **foliop)
589589
{
590590
const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
@@ -618,6 +618,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
618618
status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
619619
goto out_no_page;
620620
}
621+
622+
/*
623+
* Now we have a locked folio, before we do anything with it we need to
624+
* check that the iomap we have cached is not stale. The inode extent
625+
* mapping can change due to concurrent IO in flight (e.g.
626+
* IOMAP_UNWRITTEN state can change and memory reclaim could have
627+
* reclaimed a previously partially written page at this index after IO
628+
* completion before this write reaches this file offset) and hence we
629+
* could do the wrong thing here (zero a page range incorrectly or fail
630+
* to zero) and corrupt data.
631+
*/
632+
if (page_ops && page_ops->iomap_valid) {
633+
bool iomap_valid = page_ops->iomap_valid(iter->inode,
634+
&iter->iomap);
635+
if (!iomap_valid) {
636+
iter->iomap.flags |= IOMAP_F_STALE;
637+
status = 0;
638+
goto out_unlock;
639+
}
640+
}
641+
621642
if (pos + len > folio_pos(folio) + folio_size(folio))
622643
len = folio_pos(folio) + folio_size(folio) - pos;
623644

@@ -773,6 +794,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
773794
status = iomap_write_begin(iter, pos, bytes, &folio);
774795
if (unlikely(status))
775796
break;
797+
if (iter->iomap.flags & IOMAP_F_STALE)
798+
break;
776799

777800
page = folio_file_page(folio, pos >> PAGE_SHIFT);
778801
if (mapping_writably_mapped(mapping))
@@ -832,6 +855,231 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
832855
}
833856
EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
834857

858+
/*
859+
* Scan the data range passed to us for dirty page cache folios. If we find a
860+
* dirty folio, punch out the preceeding range and update the offset from which
861+
* the next punch will start from.
862+
*
863+
* We can punch out storage reservations under clean pages because they either
864+
* contain data that has been written back - in which case the delalloc punch
865+
* over that range is a no-op - or they have been read faults in which case they
866+
* contain zeroes and we can remove the delalloc backing range and any new
867+
* writes to those pages will do the normal hole filling operation...
868+
*
869+
* This makes the logic simple: we only need to keep the delalloc extents only
870+
* over the dirty ranges of the page cache.
871+
*
872+
* This function uses [start_byte, end_byte) intervals (i.e. open ended) to
873+
* simplify range iterations.
874+
*/
875+
static int iomap_write_delalloc_scan(struct inode *inode,
876+
loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
877+
int (*punch)(struct inode *inode, loff_t offset, loff_t length))
878+
{
879+
while (start_byte < end_byte) {
880+
struct folio *folio;
881+
882+
/* grab locked page */
883+
folio = filemap_lock_folio(inode->i_mapping,
884+
start_byte >> PAGE_SHIFT);
885+
if (!folio) {
886+
start_byte = ALIGN_DOWN(start_byte, PAGE_SIZE) +
887+
PAGE_SIZE;
888+
continue;
889+
}
890+
891+
/* if dirty, punch up to offset */
892+
if (folio_test_dirty(folio)) {
893+
if (start_byte > *punch_start_byte) {
894+
int error;
895+
896+
error = punch(inode, *punch_start_byte,
897+
start_byte - *punch_start_byte);
898+
if (error) {
899+
folio_unlock(folio);
900+
folio_put(folio);
901+
return error;
902+
}
903+
}
904+
905+
/*
906+
* Make sure the next punch start is correctly bound to
907+
* the end of this data range, not the end of the folio.
908+
*/
909+
*punch_start_byte = min_t(loff_t, end_byte,
910+
folio_next_index(folio) << PAGE_SHIFT);
911+
}
912+
913+
/* move offset to start of next folio in range */
914+
start_byte = folio_next_index(folio) << PAGE_SHIFT;
915+
folio_unlock(folio);
916+
folio_put(folio);
917+
}
918+
return 0;
919+
}
920+
921+
/*
922+
* Punch out all the delalloc blocks in the range given except for those that
923+
* have dirty data still pending in the page cache - those are going to be
924+
* written and so must still retain the delalloc backing for writeback.
925+
*
926+
* As we are scanning the page cache for data, we don't need to reimplement the
927+
* wheel - mapping_seek_hole_data() does exactly what we need to identify the
928+
* start and end of data ranges correctly even for sub-folio block sizes. This
929+
* byte range based iteration is especially convenient because it means we
930+
* don't have to care about variable size folios, nor where the start or end of
931+
* the data range lies within a folio, if they lie within the same folio or even
932+
* if there are multiple discontiguous data ranges within the folio.
933+
*
934+
* It should be noted that mapping_seek_hole_data() is not aware of EOF, and so
935+
* can return data ranges that exist in the cache beyond EOF. e.g. a page fault
936+
* spanning EOF will initialise the post-EOF data to zeroes and mark it up to
937+
* date. A write page fault can then mark it dirty. If we then fail a write()
938+
* beyond EOF into that up to date cached range, we allocate a delalloc block
939+
* beyond EOF and then have to punch it out. Because the range is up to date,
940+
* mapping_seek_hole_data() will return it, and we will skip the punch because
941+
* the folio is dirty. THis is incorrect - we always need to punch out delalloc
942+
* beyond EOF in this case as writeback will never write back and covert that
943+
* delalloc block beyond EOF. Hence we limit the cached data scan range to EOF,
944+
* resulting in always punching out the range from the EOF to the end of the
945+
* range the iomap spans.
946+
*
947+
* Intervals are of the form [start_byte, end_byte) (i.e. open ended) because it
948+
* matches the intervals returned by mapping_seek_hole_data(). i.e. SEEK_DATA
949+
* returns the start of a data range (start_byte), and SEEK_HOLE(start_byte)
950+
* returns the end of the data range (data_end). Using closed intervals would
951+
* require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
952+
* the code to subtle off-by-one bugs....
953+
*/
954+
static int iomap_write_delalloc_release(struct inode *inode,
955+
loff_t start_byte, loff_t end_byte,
956+
int (*punch)(struct inode *inode, loff_t pos, loff_t length))
957+
{
958+
loff_t punch_start_byte = start_byte;
959+
loff_t scan_end_byte = min(i_size_read(inode), end_byte);
960+
int error = 0;
961+
962+
/*
963+
* Lock the mapping to avoid races with page faults re-instantiating
964+
* folios and dirtying them via ->page_mkwrite whilst we walk the
965+
* cache and perform delalloc extent removal. Failing to do this can
966+
* leave dirty pages with no space reservation in the cache.
967+
*/
968+
filemap_invalidate_lock(inode->i_mapping);
969+
while (start_byte < scan_end_byte) {
970+
loff_t data_end;
971+
972+
start_byte = mapping_seek_hole_data(inode->i_mapping,
973+
start_byte, scan_end_byte, SEEK_DATA);
974+
/*
975+
* If there is no more data to scan, all that is left is to
976+
* punch out the remaining range.
977+
*/
978+
if (start_byte == -ENXIO || start_byte == scan_end_byte)
979+
break;
980+
if (start_byte < 0) {
981+
error = start_byte;
982+
goto out_unlock;
983+
}
984+
WARN_ON_ONCE(start_byte < punch_start_byte);
985+
WARN_ON_ONCE(start_byte > scan_end_byte);
986+
987+
/*
988+
* We find the end of this contiguous cached data range by
989+
* seeking from start_byte to the beginning of the next hole.
990+
*/
991+
data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
992+
scan_end_byte, SEEK_HOLE);
993+
if (data_end < 0) {
994+
error = data_end;
995+
goto out_unlock;
996+
}
997+
WARN_ON_ONCE(data_end <= start_byte);
998+
WARN_ON_ONCE(data_end > scan_end_byte);
999+
1000+
error = iomap_write_delalloc_scan(inode, &punch_start_byte,
1001+
start_byte, data_end, punch);
1002+
if (error)
1003+
goto out_unlock;
1004+
1005+
/* The next data search starts at the end of this one. */
1006+
start_byte = data_end;
1007+
}
1008+
1009+
if (punch_start_byte < end_byte)
1010+
error = punch(inode, punch_start_byte,
1011+
end_byte - punch_start_byte);
1012+
out_unlock:
1013+
filemap_invalidate_unlock(inode->i_mapping);
1014+
return error;
1015+
}
1016+
1017+
/*
1018+
* When a short write occurs, the filesystem may need to remove reserved space
1019+
* that was allocated in ->iomap_begin from it's ->iomap_end method. For
1020+
* filesystems that use delayed allocation, we need to punch out delalloc
1021+
* extents from the range that are not dirty in the page cache. As the write can
1022+
* race with page faults, there can be dirty pages over the delalloc extent
1023+
* outside the range of a short write but still within the delalloc extent
1024+
* allocated for this iomap.
1025+
*
1026+
* This function uses [start_byte, end_byte) intervals (i.e. open ended) to
1027+
* simplify range iterations.
1028+
*
1029+
* The punch() callback *must* only punch delalloc extents in the range passed
1030+
* to it. It must skip over all other types of extents in the range and leave
1031+
* them completely unchanged. It must do this punch atomically with respect to
1032+
* other extent modifications.
1033+
*
1034+
* The punch() callback may be called with a folio locked to prevent writeback
1035+
* extent allocation racing at the edge of the range we are currently punching.
1036+
* The locked folio may or may not cover the range being punched, so it is not
1037+
* safe for the punch() callback to lock folios itself.
1038+
*
1039+
* Lock order is:
1040+
*
1041+
* inode->i_rwsem (shared or exclusive)
1042+
* inode->i_mapping->invalidate_lock (exclusive)
1043+
* folio_lock()
1044+
* ->punch
1045+
* internal filesystem allocation lock
1046+
*/
1047+
int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
1048+
struct iomap *iomap, loff_t pos, loff_t length,
1049+
ssize_t written,
1050+
int (*punch)(struct inode *inode, loff_t pos, loff_t length))
1051+
{
1052+
loff_t start_byte;
1053+
loff_t end_byte;
1054+
int blocksize = i_blocksize(inode);
1055+
1056+
if (iomap->type != IOMAP_DELALLOC)
1057+
return 0;
1058+
1059+
/* If we didn't reserve the blocks, we're not allowed to punch them. */
1060+
if (!(iomap->flags & IOMAP_F_NEW))
1061+
return 0;
1062+
1063+
/*
1064+
* start_byte refers to the first unused block after a short write. If
1065+
* nothing was written, round offset down to point at the first block in
1066+
* the range.
1067+
*/
1068+
if (unlikely(!written))
1069+
start_byte = round_down(pos, blocksize);
1070+
else
1071+
start_byte = round_up(pos + written, blocksize);
1072+
end_byte = round_up(pos + length, blocksize);
1073+
1074+
/* Nothing to do if we've written the entire delalloc extent */
1075+
if (start_byte >= end_byte)
1076+
return 0;
1077+
1078+
return iomap_write_delalloc_release(inode, start_byte, end_byte,
1079+
punch);
1080+
}
1081+
EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
1082+
8351083
static loff_t iomap_unshare_iter(struct iomap_iter *iter)
8361084
{
8371085
struct iomap *iomap = &iter->iomap;
@@ -856,6 +1104,8 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
8561104
status = iomap_write_begin(iter, pos, bytes, &folio);
8571105
if (unlikely(status))
8581106
return status;
1107+
if (iter->iomap.flags & IOMAP_F_STALE)
1108+
break;
8591109

8601110
status = iomap_write_end(iter, pos, bytes, bytes, folio);
8611111
if (WARN_ON_ONCE(status == 0))
@@ -911,6 +1161,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
9111161
status = iomap_write_begin(iter, pos, bytes, &folio);
9121162
if (status)
9131163
return status;
1164+
if (iter->iomap.flags & IOMAP_F_STALE)
1165+
break;
9141166

9151167
offset = offset_in_folio(folio, pos);
9161168
if (bytes > folio_size(folio) - offset)

fs/iomap/iter.c

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,28 @@
77
#include <linux/iomap.h>
88
#include "trace.h"
99

10+
/*
11+
* Advance to the next range we need to map.
12+
*
13+
* If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
14+
* processed - it was aborted because the extent the iomap spanned may have been
15+
* changed during the operation. In this case, the iteration behaviour is to
16+
* remap the unprocessed range of the iter, and that means we may need to remap
17+
* even when we've made no progress (i.e. iter->processed = 0). Hence the
18+
* "finished iterating" case needs to distinguish between
19+
* (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
20+
* need to remap the entire remaining range.
21+
*/
1022
static inline int iomap_iter_advance(struct iomap_iter *iter)
1123
{
24+
bool stale = iter->iomap.flags & IOMAP_F_STALE;
25+
1226
/* handle the previous iteration (if any) */
1327
if (iter->iomap.length) {
14-
if (iter->processed <= 0)
28+
if (iter->processed < 0)
1529
return iter->processed;
30+
if (!iter->processed && !stale)
31+
return 0;
1632
if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
1733
return -EIO;
1834
iter->pos += iter->processed;
@@ -33,6 +49,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter)
3349
WARN_ON_ONCE(iter->iomap.offset > iter->pos);
3450
WARN_ON_ONCE(iter->iomap.length == 0);
3551
WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
52+
WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE);
3653

3754
trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
3855
if (iter->srcmap.type != IOMAP_HOLE)

fs/xfs/libxfs/xfs_bmap.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4058,7 +4058,7 @@ xfs_bmap_alloc_userdata(
40584058
* the busy list.
40594059
*/
40604060
bma->datatype = XFS_ALLOC_NOBUSY;
4061-
if (whichfork == XFS_DATA_FORK) {
4061+
if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
40624062
bma->datatype |= XFS_ALLOC_USERDATA;
40634063
if (bma->offset == 0)
40644064
bma->datatype |= XFS_ALLOC_INITIAL_USER_DATA;
@@ -4551,7 +4551,8 @@ xfs_bmapi_convert_delalloc(
45514551
* the extent. Just return the real extent at this offset.
45524552
*/
45534553
if (!isnullstartblock(bma.got.br_startblock)) {
4554-
xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
4554+
xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
4555+
xfs_iomap_inode_sequence(ip, flags));
45554556
*seq = READ_ONCE(ifp->if_seq);
45564557
goto out_trans_cancel;
45574558
}
@@ -4599,7 +4600,8 @@ xfs_bmapi_convert_delalloc(
45994600
XFS_STATS_INC(mp, xs_xstrat_quick);
46004601

46014602
ASSERT(!isnullstartblock(bma.got.br_startblock));
4602-
xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
4603+
xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
4604+
xfs_iomap_inode_sequence(ip, flags));
46034605
*seq = READ_ONCE(ifp->if_seq);
46044606

46054607
if (whichfork == XFS_COW_FORK)

fs/xfs/libxfs/xfs_btree.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -556,7 +556,6 @@ xfs_btree_islastblock(
556556
struct xfs_buf *bp;
557557

558558
block = xfs_btree_get_block(cur, level, &bp);
559-
ASSERT(block && xfs_btree_check_block(cur, block, level, bp) == 0);
560559

561560
if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
562561
return block->bb_u.l.bb_rightsib == cpu_to_be64(NULLFSBLOCK);

0 commit comments

Comments
 (0)