Skip to content

Commit b86629c

Browse files
riteshharjanitytso
authored andcommitted
ext4: Add multi-fsblock atomic write support with bigalloc
EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. This patch adds the required changes to enable multi-fsblock atomic write support using bigalloc in the next patch. In this patch for block allocation: we first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Note having a single contiguous underlying region of type mapped, unwrittn or hole is not a problem. But the reason to avoid writing on top of mixed mapping region is because, atomic writes requires all or nothing should get written for the userspace pwritev2 request. So if at any point in time during the write if a crash or a sudden poweroff occurs, the region undergoing atomic write should read either complete old data or complete new data. But it should never have a mix of both old and new data. So, we first convert any mixed mapping region to a single contiguous mapped extent before any data gets written to it. This is because normally FS will only convert unwritten extents to written at the end of the write in ->end_io() call. And if we allow the writes over a mixed mapping and if a sudden power off happens in between, we will end up reading mix of new data (over mapped extents) and old data (over unwritten extents), because unwritten to written conversion never went through. So to avoid this and to avoid writes getting torned due to mixed mapping, we first allocate a single contiguous block mapping and then do the write. Acked-by: Darrick J. Wong <[email protected]> Co-developed-by: Ojaswin Mujoo <[email protected]> Signed-off-by: Ojaswin Mujoo <[email protected]> Signed-off-by: Ritesh Harjani (IBM) <[email protected]> Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <[email protected]>
1 parent 5bb12b1 commit b86629c

File tree

4 files changed

+299
-5
lines changed

4 files changed

+299
-5
lines changed

fs/ext4/ext4.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3751,6 +3751,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
37513751
loff_t len);
37523752
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
37533753
loff_t offset, ssize_t len);
3754+
extern int ext4_convert_unwritten_extents_atomic(handle_t *handle,
3755+
struct inode *inode, loff_t offset, ssize_t len);
37543756
extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
37553757
ext4_io_end_t *io_end);
37563758
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,

fs/ext4/extents.c

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4796,6 +4796,93 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
47964796
return ret;
47974797
}
47984798

4799+
/*
4800+
* This function converts a range of blocks to written extents. The caller of
4801+
* this function will pass the start offset and the size. all unwritten extents
4802+
* within this range will be converted to written extents.
4803+
*
4804+
* This function is called from the direct IO end io call back function for
4805+
* atomic writes, to convert the unwritten extents after IO is completed.
4806+
*
4807+
* Note that the requirement for atomic writes is that all conversion should
4808+
* happen atomically in a single fs journal transaction. We mainly only allocate
4809+
* unwritten extents either on a hole on a pre-exiting unwritten extent range in
4810+
* ext4_map_blocks_atomic_write(). The only case where we can have multiple
4811+
* unwritten extents in a range [offset, offset+len) is when there is a split
4812+
* unwritten extent between two leaf nodes which was cached in extent status
4813+
* cache during ext4_iomap_alloc() time. That will allow
4814+
* ext4_map_blocks_atomic_write() to return the unwritten extent range w/o going
4815+
* into the slow path. That means we might need a loop for conversion of this
4816+
* unwritten extent split across leaf block within a single journal transaction.
4817+
* Split extents across leaf nodes is a rare case, but let's still handle that
4818+
* to meet the requirements of multi-fsblock atomic writes.
4819+
*
4820+
* Returns 0 on success.
4821+
*/
4822+
int ext4_convert_unwritten_extents_atomic(handle_t *handle, struct inode *inode,
4823+
loff_t offset, ssize_t len)
4824+
{
4825+
unsigned int max_blocks;
4826+
int ret = 0, ret2 = 0, ret3 = 0;
4827+
struct ext4_map_blocks map;
4828+
unsigned int blkbits = inode->i_blkbits;
4829+
unsigned int credits = 0;
4830+
int flags = EXT4_GET_BLOCKS_IO_CONVERT_EXT;
4831+
4832+
map.m_lblk = offset >> blkbits;
4833+
max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
4834+
4835+
if (!handle) {
4836+
/*
4837+
* TODO: An optimization can be added later by having an extent
4838+
* status flag e.g. EXTENT_STATUS_SPLIT_LEAF. If we query that
4839+
* it can tell if the extent in the cache is a split extent.
4840+
* But for now let's assume pextents as 2 always.
4841+
*/
4842+
credits = ext4_meta_trans_blocks(inode, max_blocks, 2);
4843+
}
4844+
4845+
if (credits) {
4846+
handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
4847+
if (IS_ERR(handle)) {
4848+
ret = PTR_ERR(handle);
4849+
return ret;
4850+
}
4851+
}
4852+
4853+
while (ret >= 0 && ret < max_blocks) {
4854+
map.m_lblk += ret;
4855+
map.m_len = (max_blocks -= ret);
4856+
ret = ext4_map_blocks(handle, inode, &map, flags);
4857+
if (ret != max_blocks)
4858+
ext4_msg(inode->i_sb, KERN_INFO,
4859+
"inode #%lu: block %u: len %u: "
4860+
"split block mapping found for atomic write, "
4861+
"ret = %d",
4862+
inode->i_ino, map.m_lblk,
4863+
map.m_len, ret);
4864+
if (ret <= 0)
4865+
break;
4866+
}
4867+
4868+
ret2 = ext4_mark_inode_dirty(handle, inode);
4869+
4870+
if (credits) {
4871+
ret3 = ext4_journal_stop(handle);
4872+
if (unlikely(ret3))
4873+
ret2 = ret3;
4874+
}
4875+
4876+
if (ret <= 0 || ret2)
4877+
ext4_warning(inode->i_sb,
4878+
"inode #%lu: block %u: len %u: "
4879+
"returned %d or %d",
4880+
inode->i_ino, map.m_lblk,
4881+
map.m_len, ret, ret2);
4882+
4883+
return ret > 0 ? ret2 : ret;
4884+
}
4885+
47994886
/*
48004887
* This function convert a range of blocks to written extents
48014888
* The caller of this function will pass the start offset and the size.

fs/ext4/file.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -377,7 +377,12 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
377377
loff_t pos = iocb->ki_pos;
378378
struct inode *inode = file_inode(iocb->ki_filp);
379379

380-
if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
380+
381+
if (!error && size && (flags & IOMAP_DIO_UNWRITTEN) &&
382+
(iocb->ki_flags & IOCB_ATOMIC))
383+
error = ext4_convert_unwritten_extents_atomic(NULL, inode, pos,
384+
size);
385+
else if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
381386
error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
382387
if (error)
383388
return error;

fs/ext4/inode.c

Lines changed: 204 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3467,20 +3467,180 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
34673467
}
34683468
}
34693469

3470+
static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
3471+
struct inode *inode, struct ext4_map_blocks *map)
3472+
{
3473+
ext4_lblk_t m_lblk = map->m_lblk;
3474+
unsigned int m_len = map->m_len;
3475+
unsigned int mapped_len = 0, m_flags = 0;
3476+
ext4_fsblk_t next_pblk;
3477+
bool check_next_pblk = false;
3478+
int ret = 0;
3479+
3480+
WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
3481+
3482+
/*
3483+
* This is a slow path in case of mixed mapping. We use
3484+
* EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
3485+
* contiguous mapped mapping. This will ensure any unwritten or hole
3486+
* regions within the requested range is zeroed out and we return
3487+
* a single contiguous mapped extent.
3488+
*/
3489+
m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
3490+
3491+
do {
3492+
ret = ext4_map_blocks(handle, inode, map, m_flags);
3493+
if (ret < 0 && ret != -ENOSPC)
3494+
goto out_err;
3495+
/*
3496+
* This should never happen, but let's return an error code to
3497+
* avoid an infinite loop in here.
3498+
*/
3499+
if (ret == 0) {
3500+
ret = -EFSCORRUPTED;
3501+
ext4_warning_inode(inode,
3502+
"ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
3503+
m_flags, ret);
3504+
goto out_err;
3505+
}
3506+
/*
3507+
* With bigalloc we should never get ENOSPC nor discontiguous
3508+
* physical extents.
3509+
*/
3510+
if ((check_next_pblk && next_pblk != map->m_pblk) ||
3511+
ret == -ENOSPC) {
3512+
ext4_warning_inode(inode,
3513+
"Non-contiguous allocation detected: expected %llu, got %llu, "
3514+
"or ext4_map_blocks() returned out of space ret: %d",
3515+
next_pblk, map->m_pblk, ret);
3516+
ret = -EFSCORRUPTED;
3517+
goto out_err;
3518+
}
3519+
next_pblk = map->m_pblk + map->m_len;
3520+
check_next_pblk = true;
3521+
3522+
mapped_len += map->m_len;
3523+
map->m_lblk += map->m_len;
3524+
map->m_len = m_len - mapped_len;
3525+
} while (mapped_len < m_len);
3526+
3527+
/*
3528+
* We might have done some work in above loop, so we need to query the
3529+
* start of the physical extent, based on the origin m_lblk and m_len.
3530+
* Let's also ensure we were able to allocate the required range for
3531+
* mixed mapping case.
3532+
*/
3533+
map->m_lblk = m_lblk;
3534+
map->m_len = m_len;
3535+
map->m_flags = 0;
3536+
3537+
ret = ext4_map_blocks(handle, inode, map,
3538+
EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
3539+
if (ret != m_len) {
3540+
ext4_warning_inode(inode,
3541+
"allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
3542+
m_lblk, m_len, ret);
3543+
ret = -EINVAL;
3544+
}
3545+
return ret;
3546+
3547+
out_err:
3548+
/* reset map before returning an error */
3549+
map->m_lblk = m_lblk;
3550+
map->m_len = m_len;
3551+
map->m_flags = 0;
3552+
return ret;
3553+
}
3554+
3555+
/*
3556+
* ext4_map_blocks_atomic: Helper routine to ensure the entire requested
3557+
* range in @map [lblk, lblk + len) is one single contiguous extent with no
3558+
* mixed mappings.
3559+
*
3560+
* We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
3561+
* We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
3562+
* physical extent for the requested range does not have a single contiguous
3563+
* mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
3564+
* In that case we will loop over the requested range to allocate and zero out
3565+
* the unwritten / holes in between, to get a single mapped extent from
3566+
* [m_lblk, m_lblk + m_len). Note that this is only possible because we know
3567+
* this can be called only with bigalloc enabled filesystem where the underlying
3568+
* cluster is already allocated. This avoids allocating discontiguous extents
3569+
* in the slow path due to multiple calls to ext4_map_blocks().
3570+
* The slow path is mostly non-performance critical path, so it should be ok to
3571+
* loop using ext4_map_blocks() with appropriate flags to allocate & zero the
3572+
* underlying short holes/unwritten extents within the requested range.
3573+
*/
3574+
static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
3575+
struct ext4_map_blocks *map, int m_flags,
3576+
bool *force_commit)
3577+
{
3578+
ext4_lblk_t m_lblk = map->m_lblk;
3579+
unsigned int m_len = map->m_len;
3580+
int ret = 0;
3581+
3582+
WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
3583+
3584+
ret = ext4_map_blocks(handle, inode, map, m_flags);
3585+
if (ret < 0 || ret == m_len)
3586+
goto out;
3587+
/*
3588+
* This is a mixed mapping case where we were not able to allocate
3589+
* a single contiguous extent. In that case let's reset requested
3590+
* mapping and call the slow path.
3591+
*/
3592+
map->m_lblk = m_lblk;
3593+
map->m_len = m_len;
3594+
map->m_flags = 0;
3595+
3596+
/*
3597+
* slow path means we have mixed mapping, that means we will need
3598+
* to force txn commit.
3599+
*/
3600+
*force_commit = true;
3601+
return ext4_map_blocks_atomic_write_slow(handle, inode, map);
3602+
out:
3603+
return ret;
3604+
}
3605+
34703606
static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
34713607
unsigned int flags)
34723608
{
34733609
handle_t *handle;
34743610
u8 blkbits = inode->i_blkbits;
34753611
int ret, dio_credits, m_flags = 0, retries = 0;
3612+
bool force_commit = false;
34763613

34773614
/*
34783615
* Trim the mapping request to the maximum value that we can map at
34793616
* once for direct I/O.
34803617
*/
34813618
if (map->m_len > DIO_MAX_BLOCKS)
34823619
map->m_len = DIO_MAX_BLOCKS;
3483-
dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
3620+
3621+
/*
3622+
* journal credits estimation for atomic writes. We call
3623+
* ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
3624+
* then let's assume the no. of pextents required can be m_len i.e.
3625+
* every alternate block can be unwritten and hole.
3626+
*/
3627+
if (flags & IOMAP_ATOMIC) {
3628+
unsigned int orig_mlen = map->m_len;
3629+
3630+
ret = ext4_map_blocks(NULL, inode, map, 0);
3631+
if (ret < 0)
3632+
return ret;
3633+
if (map->m_len < orig_mlen) {
3634+
map->m_len = orig_mlen;
3635+
dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
3636+
map->m_len);
3637+
} else {
3638+
dio_credits = ext4_chunk_trans_blocks(inode,
3639+
map->m_len);
3640+
}
3641+
} else {
3642+
dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
3643+
}
34843644

34853645
retry:
34863646
/*
@@ -3511,7 +3671,11 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
35113671
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
35123672
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
35133673

3514-
ret = ext4_map_blocks(handle, inode, map, m_flags);
3674+
if (flags & IOMAP_ATOMIC)
3675+
ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
3676+
&force_commit);
3677+
else
3678+
ret = ext4_map_blocks(handle, inode, map, m_flags);
35153679

35163680
/*
35173681
* We cannot fill holes in indirect tree based inodes as that could
@@ -3525,6 +3689,22 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
35253689
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
35263690
goto retry;
35273691

3692+
/*
3693+
* Force commit the current transaction if the allocation spans a mixed
3694+
* mapping range. This ensures any pending metadata updates (like
3695+
* unwritten to written extents conversion) in this range are in
3696+
* consistent state with the file data blocks, before performing the
3697+
* actual write I/O. If the commit fails, the whole I/O must be aborted
3698+
* to prevent any possible torn writes.
3699+
*/
3700+
if (ret > 0 && force_commit) {
3701+
int ret2;
3702+
3703+
ret2 = ext4_force_commit(inode->i_sb);
3704+
if (ret2)
3705+
return ret2;
3706+
}
3707+
35283708
return ret;
35293709
}
35303710

@@ -3535,6 +3715,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
35353715
int ret;
35363716
struct ext4_map_blocks map;
35373717
u8 blkbits = inode->i_blkbits;
3718+
unsigned int orig_mlen;
35383719

35393720
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
35403721
return -EINVAL;
@@ -3548,6 +3729,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
35483729
map.m_lblk = offset >> blkbits;
35493730
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
35503731
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
3732+
orig_mlen = map.m_len;
35513733

35523734
if (flags & IOMAP_WRITE) {
35533735
/*
@@ -3558,8 +3740,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
35583740
*/
35593741
if (offset + length <= i_size_read(inode)) {
35603742
ret = ext4_map_blocks(NULL, inode, &map, 0);
3561-
if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
3562-
goto out;
3743+
/*
3744+
* For atomic writes the entire requested length should
3745+
* be mapped.
3746+
*/
3747+
if (map.m_flags & EXT4_MAP_MAPPED) {
3748+
if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
3749+
(flags & IOMAP_ATOMIC && ret >= orig_mlen))
3750+
goto out;
3751+
}
3752+
map.m_len = orig_mlen;
35633753
}
35643754
ret = ext4_iomap_alloc(inode, &map, flags);
35653755
} else {
@@ -3580,6 +3770,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
35803770
*/
35813771
map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
35823772

3773+
/*
3774+
* Before returning to iomap, let's ensure the allocated mapping
3775+
* covers the entire requested length for atomic writes.
3776+
*/
3777+
if (flags & IOMAP_ATOMIC) {
3778+
if (map.m_len < (length >> blkbits)) {
3779+
WARN_ON_ONCE(1);
3780+
return -EINVAL;
3781+
}
3782+
}
35833783
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
35843784

35853785
return 0;

0 commit comments

Comments
 (0)