Skip to content

Commit 4f984fe

Browse files
committed
Merge patch series "fallocate: introduce FALLOC_FL_WRITE_ZEROES flag"
Zhang Yi <[email protected]> says: Currently, we can use the fallocate command to quickly create a pre-allocated file. However, on most filesystems, such as ext4 and XFS, fallocate create pre-allocation blocks in an unwritten state, and the FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must be converted to a written state when the user writes data into this range later, which can trigger numerous metadata changes and consequent journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. Therefore, we need a method to create a pre-allocated file with written extents that can be used for pure overwriting. At the monent, the only method available is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth, we must pre-allocate files in advance but cannot add pre-allocated files while user business services are running. Fortunately, with the development and more and more widely used of flash-based storage devices, we can efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. Consequently, this feature can provide us with a faster method for creating pre-allocated files with written extents and zeroed data. However, please note that this may be a best-effort optimization rather than a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as receiving unaligned commands. This series aims to implement this by: 1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block device queue limit features, which indicates whether the storage is device explicitly supports the unmapped write zeroes command. This flag should be set to 1 by the driver if the attached disk supports this command. 2. Introduce a queue limit flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED, along with a corresponding sysfs entry. Users can query the support status of the unmap write zeroes operation and disable this operation if the write zeroes operation is very slow. /sys/block/<disk>/queue/write_zeroes_unmap 3. Introduce a new flag, FALLOC_FL_WRITE_ZEROES, into the fallocate. Filesystems that support this operation should allocate written extents and issue zeroes to the specified range of the device. For local block device filesystems, this operation should depend on the write_zeroes_unmap operaion of the underlying block device. It should return -EOPNOTSUPP if the device doesn't enable unmap write zeroes operaion. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. Any comments are welcome. I've tested performance with this series on ext4 filesystem on my machine with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD which supports unmap write zeroes command with the Deallocated state and the DEAC bit. Feel free to give it a try. 0. Ensure the NVMe device supports WRITE_ZERO command. $ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes 8388608 $ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat" dlfeat : 25 [4:4] : 0x1 Guard Field of Deallocated Logical Blocks is set to CRC of The Value Read [3:3] : 0x1 Deallocate Bit in the Write Zeroes Command is Supported [2:0] : 0x1 Bytes Read From a Deallocated Logical Block and its Metadata are 0x00 1. Compare 'dd' and fallocate with unmap write zeroes, the later one is significantly faster than 'dd'. Create a 1GB and 10GB zeroed file. $dd if=/dev/zero of=foo bs=2M count=$count oflag=direct $time fallocate -w -l $size bar #1G dd: 0.5s FALLOC_FL_WRITE_ZEROES: 0.17s #10G dd: 5.0s FALLOC_FL_WRITE_ZEROES: 1.7s 2. Run fio overwrite and fallocate with unmap write zeroes simultaneously, fallocate has little impact on write bandwidth and only slightly affects write latency. a) Test bandwidth costs. $ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \ -numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \ -fallocate=none -overwrite=1 -group_reportin -name=bw_test Without background zero range: bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40 With background zero range: bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20 b) Test write latency costs. $ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \ -numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \ -fallocate=none -overwrite=1 -group_reportin -name=lat_test Without background zero range: lat (nsec): min=9269, max=71635, avg=9840.65 With a background zero range: lat (usec): min=9, max=982, avg=11.03 3. Compare overwriting in a pre-allocated unwritten file and a written file in O_DSYNC mode. Write to a file with written extents is much faster. # First mkfs and create a test file according to below three cases, # and then run fio. $ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \ -rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \ -runtime=20 -fallocate=none -group_reportin -name=test unwritten file: IOPS=20.1k, BW=78.7MiB/s unwritten file + fast_commit: IOPS=42.9k, BW=167MiB/s written file: IOPS=98.8k, BW=386MiB/s * patches from https://lore.kernel.org/[email protected]: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits Link: https://lore.kernel.org/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2 parents e04c78d + f4265b8 commit 4f984fe

File tree

14 files changed

+212
-44
lines changed

14 files changed

+212
-44
lines changed

Documentation/ABI/stable/sysfs-block

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -778,6 +778,39 @@ Description:
778778
0, write zeroes is not supported by the device.
779779

780780

781+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes
782+
Date: January 2025
783+
Contact: Zhang Yi <[email protected]>
784+
Description:
785+
[RO] This file indicates whether a device supports zeroing data
786+
in a specified block range without incurring the cost of
787+
physically writing zeroes to the media for each individual
788+
block. If this parameter is set to write_zeroes_max_bytes, the
789+
device implements a zeroing operation which opportunistically
790+
avoids writing zeroes to media while still guaranteeing that
791+
subsequent reads from the specified block range will return
792+
zeroed data. This operation is a best-effort optimization, a
793+
device may fall back to physically writing zeroes to the media
794+
due to other factors such as misalignment or being asked to
795+
clear a block range smaller than the device's internal
796+
allocation unit. If this parameter is set to 0, the device may
797+
have to write each logical block media during a zeroing
798+
operation.
799+
800+
801+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes
802+
Date: January 2025
803+
Contact: Zhang Yi <[email protected]>
804+
Description:
805+
[RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit
806+
for the device, this setting is the software limit. Since the
807+
unmap write zeroes operation is a best-effort optimization, some
808+
devices may still physically writing zeroes to media. So the
809+
speed of this operation is not guaranteed. Writing a value of
810+
'0' to this file disables this operation. Otherwise, this
811+
parameter should be equal to write_zeroes_unmap_max_hw_bytes.
812+
813+
781814
What: /sys/block/<disk>/queue/zone_append_max_bytes
782815
Date: May 2020
783816

block/blk-settings.c

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
5050
lim->max_sectors = UINT_MAX;
5151
lim->max_dev_sectors = UINT_MAX;
5252
lim->max_write_zeroes_sectors = UINT_MAX;
53+
lim->max_hw_wzeroes_unmap_sectors = UINT_MAX;
54+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
5355
lim->max_hw_zone_append_sectors = UINT_MAX;
5456
lim->max_user_discard_sectors = UINT_MAX;
5557
}
@@ -333,6 +335,12 @@ int blk_validate_limits(struct queue_limits *lim)
333335
if (!lim->max_segments)
334336
lim->max_segments = BLK_MAX_SEGMENTS;
335337

338+
if (lim->max_hw_wzeroes_unmap_sectors &&
339+
lim->max_hw_wzeroes_unmap_sectors != lim->max_write_zeroes_sectors)
340+
return -EINVAL;
341+
lim->max_wzeroes_unmap_sectors = min(lim->max_hw_wzeroes_unmap_sectors,
342+
lim->max_user_wzeroes_unmap_sectors);
343+
336344
lim->max_discard_sectors =
337345
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors);
338346

@@ -418,10 +426,11 @@ int blk_set_default_limits(struct queue_limits *lim)
418426
{
419427
/*
420428
* Most defaults are set by capping the bounds in blk_validate_limits,
421-
* but max_user_discard_sectors is special and needs an explicit
422-
* initialization to the max value here.
429+
* but these limits are special and need an explicit initialization to
430+
* the max value here.
423431
*/
424432
lim->max_user_discard_sectors = UINT_MAX;
433+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
425434
return blk_validate_limits(lim);
426435
}
427436

@@ -708,6 +717,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
708717
t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
709718
t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
710719
b->max_write_zeroes_sectors);
720+
t->max_user_wzeroes_unmap_sectors =
721+
min(t->max_user_wzeroes_unmap_sectors,
722+
b->max_user_wzeroes_unmap_sectors);
723+
t->max_hw_wzeroes_unmap_sectors =
724+
min(t->max_hw_wzeroes_unmap_sectors,
725+
b->max_hw_wzeroes_unmap_sectors);
726+
711727
t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors,
712728
b->max_hw_zone_append_sectors);
713729

block/blk-sysfs.c

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,8 @@ static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
161161
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors)
162162
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors)
163163
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors)
164+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_wzeroes_unmap_sectors)
165+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_wzeroes_unmap_sectors)
164166
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors)
165167
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors)
166168
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_zone_append_sectors)
@@ -205,6 +207,24 @@ static int queue_max_discard_sectors_store(struct gendisk *disk,
205207
return 0;
206208
}
207209

210+
static int queue_max_wzeroes_unmap_sectors_store(struct gendisk *disk,
211+
const char *page, size_t count, struct queue_limits *lim)
212+
{
213+
unsigned long max_zeroes_bytes, max_hw_zeroes_bytes;
214+
ssize_t ret;
215+
216+
ret = queue_var_store(&max_zeroes_bytes, page, count);
217+
if (ret < 0)
218+
return ret;
219+
220+
max_hw_zeroes_bytes = lim->max_hw_wzeroes_unmap_sectors << SECTOR_SHIFT;
221+
if (max_zeroes_bytes != 0 && max_zeroes_bytes != max_hw_zeroes_bytes)
222+
return -EINVAL;
223+
224+
lim->max_user_wzeroes_unmap_sectors = max_zeroes_bytes >> SECTOR_SHIFT;
225+
return 0;
226+
}
227+
208228
static int
209229
queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count,
210230
struct queue_limits *lim)
@@ -514,6 +534,10 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
514534

515535
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
516536
QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
537+
QUEUE_LIM_RO_ENTRY(queue_max_hw_wzeroes_unmap_sectors,
538+
"write_zeroes_unmap_max_hw_bytes");
539+
QUEUE_LIM_RW_ENTRY(queue_max_wzeroes_unmap_sectors,
540+
"write_zeroes_unmap_max_bytes");
517541
QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
518542
QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
519543

@@ -662,6 +686,8 @@ static struct attribute *queue_attrs[] = {
662686
&queue_atomic_write_unit_min_entry.attr,
663687
&queue_atomic_write_unit_max_entry.attr,
664688
&queue_max_write_zeroes_sectors_entry.attr,
689+
&queue_max_hw_wzeroes_unmap_sectors_entry.attr,
690+
&queue_max_wzeroes_unmap_sectors_entry.attr,
665691
&queue_max_zone_append_sectors_entry.attr,
666692
&queue_zone_write_granularity_entry.attr,
667693
&queue_rotational_entry.attr,

block/fops.c

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -841,7 +841,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
841841

842842
#define BLKDEV_FALLOC_FL_SUPPORTED \
843843
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
844-
FALLOC_FL_ZERO_RANGE)
844+
FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
845845

846846
static long blkdev_fallocate(struct file *file, int mode, loff_t start,
847847
loff_t len)
@@ -850,11 +850,19 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
850850
struct block_device *bdev = I_BDEV(inode);
851851
loff_t end = start + len - 1;
852852
loff_t isize;
853+
unsigned int flags;
853854
int error;
854855

855856
/* Fail if we don't recognize the flags. */
856857
if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
857858
return -EOPNOTSUPP;
859+
/*
860+
* Don't allow writing zeroes if the device does not enable the
861+
* unmap write zeroes operation.
862+
*/
863+
if ((mode & FALLOC_FL_WRITE_ZEROES) &&
864+
!bdev_write_zeroes_unmap_sectors(bdev))
865+
return -EOPNOTSUPP;
858866

859867
/* Don't go off the end of the device. */
860868
isize = bdev_nr_bytes(bdev);
@@ -877,34 +885,32 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
877885
inode_lock(inode);
878886
filemap_invalidate_lock(inode->i_mapping);
879887

880-
/*
881-
* Invalidate the page cache, including dirty pages, for valid
882-
* de-allocate mode calls to fallocate().
883-
*/
884888
switch (mode) {
885889
case FALLOC_FL_ZERO_RANGE:
886890
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
887-
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
888-
if (error)
889-
goto fail;
890-
891-
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
892-
len >> SECTOR_SHIFT, GFP_KERNEL,
893-
BLKDEV_ZERO_NOUNMAP);
891+
flags = BLKDEV_ZERO_NOUNMAP;
894892
break;
895893
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
896-
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
897-
if (error)
898-
goto fail;
899-
900-
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
901-
len >> SECTOR_SHIFT, GFP_KERNEL,
902-
BLKDEV_ZERO_NOFALLBACK);
894+
flags = BLKDEV_ZERO_NOFALLBACK;
895+
break;
896+
case FALLOC_FL_WRITE_ZEROES:
897+
flags = 0;
903898
break;
904899
default:
905900
error = -EOPNOTSUPP;
901+
goto fail;
906902
}
907903

904+
/*
905+
* Invalidate the page cache, including dirty pages, for valid
906+
* de-allocate mode calls to fallocate().
907+
*/
908+
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
909+
if (error)
910+
goto fail;
911+
912+
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
913+
len >> SECTOR_SHIFT, GFP_KERNEL, flags);
908914
fail:
909915
filemap_invalidate_unlock(inode->i_mapping);
910916
inode_unlock(inode);

drivers/md/dm-table.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2065,8 +2065,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
20652065
limits->discard_alignment = 0;
20662066
}
20672067

2068-
if (!dm_table_supports_write_zeroes(t))
2068+
if (!dm_table_supports_write_zeroes(t)) {
20692069
limits->max_write_zeroes_sectors = 0;
2070+
limits->max_hw_wzeroes_unmap_sectors = 0;
2071+
}
20702072

20712073
if (!dm_table_supports_secure_erase(t))
20722074
limits->max_secure_erase_sectors = 0;

drivers/nvme/host/core.c

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2420,22 +2420,24 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
24202420
else
24212421
lim.write_stream_granularity = 0;
24222422

2423-
ret = queue_limits_commit_update(ns->disk->queue, &lim);
2424-
if (ret) {
2425-
blk_mq_unfreeze_queue(ns->disk->queue, memflags);
2426-
goto out;
2427-
}
2428-
2429-
set_capacity_and_notify(ns->disk, capacity);
2430-
24312423
/*
24322424
* Only set the DEAC bit if the device guarantees that reads from
24332425
* deallocated data return zeroes. While the DEAC bit does not
24342426
* require that, it must be a no-op if reads from deallocated data
24352427
* do not return zeroes.
24362428
*/
2437-
if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
2429+
if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
24382430
ns->head->features |= NVME_NS_DEAC;
2431+
lim.max_hw_wzeroes_unmap_sectors = lim.max_write_zeroes_sectors;
2432+
}
2433+
2434+
ret = queue_limits_commit_update(ns->disk->queue, &lim);
2435+
if (ret) {
2436+
blk_mq_unfreeze_queue(ns->disk->queue, memflags);
2437+
goto out;
2438+
}
2439+
2440+
set_capacity_and_notify(ns->disk, capacity);
24392441
set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
24402442
set_bit(NVME_NS_READY, &ns->flags);
24412443
blk_mq_unfreeze_queue(ns->disk->queue, memflags);

drivers/nvme/target/io-cmd-bdev.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,10 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
4646
id->npda = id->npdg;
4747
/* NOWS = Namespace Optimal Write Size */
4848
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
49+
50+
/* Set WZDS and DRB if device supports unmapped write zeroes */
51+
if (bdev_write_zeroes_unmap_sectors(bdev))
52+
id->dlfeat = (1 << 3) | 0x1;
4953
}
5054

5155
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)

drivers/scsi/sd.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1141,6 +1141,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
11411141
out:
11421142
lim->max_write_zeroes_sectors =
11431143
sdkp->max_ws_blocks * (logical_block_size >> SECTOR_SHIFT);
1144+
1145+
if (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
1146+
sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP)
1147+
lim->max_hw_wzeroes_unmap_sectors =
1148+
lim->max_write_zeroes_sectors;
11441149
}
11451150

11461151
static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)

0 commit comments

Comments
 (0)