Skip to content

Commit 278c7d9

Browse files
committed
Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2 parents 0c4ec4a + 4f984fe commit 278c7d9

File tree

14 files changed

+212
-44
lines changed

14 files changed

+212
-44
lines changed

Documentation/ABI/stable/sysfs-block

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -778,6 +778,39 @@ Description:
778778
0, write zeroes is not supported by the device.
779779

780780

781+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes
782+
Date: January 2025
783+
Contact: Zhang Yi <[email protected]>
784+
Description:
785+
[RO] This file indicates whether a device supports zeroing data
786+
in a specified block range without incurring the cost of
787+
physically writing zeroes to the media for each individual
788+
block. If this parameter is set to write_zeroes_max_bytes, the
789+
device implements a zeroing operation which opportunistically
790+
avoids writing zeroes to media while still guaranteeing that
791+
subsequent reads from the specified block range will return
792+
zeroed data. This operation is a best-effort optimization, a
793+
device may fall back to physically writing zeroes to the media
794+
due to other factors such as misalignment or being asked to
795+
clear a block range smaller than the device's internal
796+
allocation unit. If this parameter is set to 0, the device may
797+
have to write each logical block media during a zeroing
798+
operation.
799+
800+
801+
What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes
802+
Date: January 2025
803+
Contact: Zhang Yi <[email protected]>
804+
Description:
805+
[RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit
806+
for the device, this setting is the software limit. Since the
807+
unmap write zeroes operation is a best-effort optimization, some
808+
devices may still physically writing zeroes to media. So the
809+
speed of this operation is not guaranteed. Writing a value of
810+
'0' to this file disables this operation. Otherwise, this
811+
parameter should be equal to write_zeroes_unmap_max_hw_bytes.
812+
813+
781814
What: /sys/block/<disk>/queue/zone_append_max_bytes
782815
Date: May 2020
783816

block/blk-settings.c

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
5050
lim->max_sectors = UINT_MAX;
5151
lim->max_dev_sectors = UINT_MAX;
5252
lim->max_write_zeroes_sectors = UINT_MAX;
53+
lim->max_hw_wzeroes_unmap_sectors = UINT_MAX;
54+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
5355
lim->max_hw_zone_append_sectors = UINT_MAX;
5456
lim->max_user_discard_sectors = UINT_MAX;
5557
}
@@ -333,6 +335,12 @@ int blk_validate_limits(struct queue_limits *lim)
333335
if (!lim->max_segments)
334336
lim->max_segments = BLK_MAX_SEGMENTS;
335337

338+
if (lim->max_hw_wzeroes_unmap_sectors &&
339+
lim->max_hw_wzeroes_unmap_sectors != lim->max_write_zeroes_sectors)
340+
return -EINVAL;
341+
lim->max_wzeroes_unmap_sectors = min(lim->max_hw_wzeroes_unmap_sectors,
342+
lim->max_user_wzeroes_unmap_sectors);
343+
336344
lim->max_discard_sectors =
337345
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors);
338346

@@ -418,10 +426,11 @@ int blk_set_default_limits(struct queue_limits *lim)
418426
{
419427
/*
420428
* Most defaults are set by capping the bounds in blk_validate_limits,
421-
* but max_user_discard_sectors is special and needs an explicit
422-
* initialization to the max value here.
429+
* but these limits are special and need an explicit initialization to
430+
* the max value here.
423431
*/
424432
lim->max_user_discard_sectors = UINT_MAX;
433+
lim->max_user_wzeroes_unmap_sectors = UINT_MAX;
425434
return blk_validate_limits(lim);
426435
}
427436

@@ -708,6 +717,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
708717
t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
709718
t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
710719
b->max_write_zeroes_sectors);
720+
t->max_user_wzeroes_unmap_sectors =
721+
min(t->max_user_wzeroes_unmap_sectors,
722+
b->max_user_wzeroes_unmap_sectors);
723+
t->max_hw_wzeroes_unmap_sectors =
724+
min(t->max_hw_wzeroes_unmap_sectors,
725+
b->max_hw_wzeroes_unmap_sectors);
726+
711727
t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors,
712728
b->max_hw_zone_append_sectors);
713729

block/blk-sysfs.c

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,8 @@ static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
161161
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors)
162162
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors)
163163
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors)
164+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_wzeroes_unmap_sectors)
165+
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_wzeroes_unmap_sectors)
164166
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors)
165167
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors)
166168
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_zone_append_sectors)
@@ -205,6 +207,24 @@ static int queue_max_discard_sectors_store(struct gendisk *disk,
205207
return 0;
206208
}
207209

210+
static int queue_max_wzeroes_unmap_sectors_store(struct gendisk *disk,
211+
const char *page, size_t count, struct queue_limits *lim)
212+
{
213+
unsigned long max_zeroes_bytes, max_hw_zeroes_bytes;
214+
ssize_t ret;
215+
216+
ret = queue_var_store(&max_zeroes_bytes, page, count);
217+
if (ret < 0)
218+
return ret;
219+
220+
max_hw_zeroes_bytes = lim->max_hw_wzeroes_unmap_sectors << SECTOR_SHIFT;
221+
if (max_zeroes_bytes != 0 && max_zeroes_bytes != max_hw_zeroes_bytes)
222+
return -EINVAL;
223+
224+
lim->max_user_wzeroes_unmap_sectors = max_zeroes_bytes >> SECTOR_SHIFT;
225+
return 0;
226+
}
227+
208228
static int
209229
queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count,
210230
struct queue_limits *lim)
@@ -514,6 +534,10 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
514534

515535
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
516536
QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
537+
QUEUE_LIM_RO_ENTRY(queue_max_hw_wzeroes_unmap_sectors,
538+
"write_zeroes_unmap_max_hw_bytes");
539+
QUEUE_LIM_RW_ENTRY(queue_max_wzeroes_unmap_sectors,
540+
"write_zeroes_unmap_max_bytes");
517541
QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
518542
QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
519543

@@ -662,6 +686,8 @@ static struct attribute *queue_attrs[] = {
662686
&queue_atomic_write_unit_min_entry.attr,
663687
&queue_atomic_write_unit_max_entry.attr,
664688
&queue_max_write_zeroes_sectors_entry.attr,
689+
&queue_max_hw_wzeroes_unmap_sectors_entry.attr,
690+
&queue_max_wzeroes_unmap_sectors_entry.attr,
665691
&queue_max_zone_append_sectors_entry.attr,
666692
&queue_zone_write_granularity_entry.attr,
667693
&queue_rotational_entry.attr,

block/fops.c

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -844,7 +844,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
844844

845845
#define BLKDEV_FALLOC_FL_SUPPORTED \
846846
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
847-
FALLOC_FL_ZERO_RANGE)
847+
FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
848848

849849
static long blkdev_fallocate(struct file *file, int mode, loff_t start,
850850
loff_t len)
@@ -853,11 +853,19 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
853853
struct block_device *bdev = I_BDEV(inode);
854854
loff_t end = start + len - 1;
855855
loff_t isize;
856+
unsigned int flags;
856857
int error;
857858

858859
/* Fail if we don't recognize the flags. */
859860
if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
860861
return -EOPNOTSUPP;
862+
/*
863+
* Don't allow writing zeroes if the device does not enable the
864+
* unmap write zeroes operation.
865+
*/
866+
if ((mode & FALLOC_FL_WRITE_ZEROES) &&
867+
!bdev_write_zeroes_unmap_sectors(bdev))
868+
return -EOPNOTSUPP;
861869

862870
/* Don't go off the end of the device. */
863871
isize = bdev_nr_bytes(bdev);
@@ -880,34 +888,32 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
880888
inode_lock(inode);
881889
filemap_invalidate_lock(inode->i_mapping);
882890

883-
/*
884-
* Invalidate the page cache, including dirty pages, for valid
885-
* de-allocate mode calls to fallocate().
886-
*/
887891
switch (mode) {
888892
case FALLOC_FL_ZERO_RANGE:
889893
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
890-
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
891-
if (error)
892-
goto fail;
893-
894-
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
895-
len >> SECTOR_SHIFT, GFP_KERNEL,
896-
BLKDEV_ZERO_NOUNMAP);
894+
flags = BLKDEV_ZERO_NOUNMAP;
897895
break;
898896
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
899-
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
900-
if (error)
901-
goto fail;
902-
903-
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
904-
len >> SECTOR_SHIFT, GFP_KERNEL,
905-
BLKDEV_ZERO_NOFALLBACK);
897+
flags = BLKDEV_ZERO_NOFALLBACK;
898+
break;
899+
case FALLOC_FL_WRITE_ZEROES:
900+
flags = 0;
906901
break;
907902
default:
908903
error = -EOPNOTSUPP;
904+
goto fail;
909905
}
910906

907+
/*
908+
* Invalidate the page cache, including dirty pages, for valid
909+
* de-allocate mode calls to fallocate().
910+
*/
911+
error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
912+
if (error)
913+
goto fail;
914+
915+
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
916+
len >> SECTOR_SHIFT, GFP_KERNEL, flags);
911917
fail:
912918
filemap_invalidate_unlock(inode->i_mapping);
913919
inode_unlock(inode);

drivers/md/dm-table.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2065,8 +2065,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
20652065
limits->discard_alignment = 0;
20662066
}
20672067

2068-
if (!dm_table_supports_write_zeroes(t))
2068+
if (!dm_table_supports_write_zeroes(t)) {
20692069
limits->max_write_zeroes_sectors = 0;
2070+
limits->max_hw_wzeroes_unmap_sectors = 0;
2071+
}
20702072

20712073
if (!dm_table_supports_secure_erase(t))
20722074
limits->max_secure_erase_sectors = 0;

drivers/nvme/host/core.c

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2408,22 +2408,24 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
24082408
else
24092409
lim.write_stream_granularity = 0;
24102410

2411-
ret = queue_limits_commit_update(ns->disk->queue, &lim);
2412-
if (ret) {
2413-
blk_mq_unfreeze_queue(ns->disk->queue, memflags);
2414-
goto out;
2415-
}
2416-
2417-
set_capacity_and_notify(ns->disk, capacity);
2418-
24192411
/*
24202412
* Only set the DEAC bit if the device guarantees that reads from
24212413
* deallocated data return zeroes. While the DEAC bit does not
24222414
* require that, it must be a no-op if reads from deallocated data
24232415
* do not return zeroes.
24242416
*/
2425-
if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
2417+
if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
24262418
ns->head->features |= NVME_NS_DEAC;
2419+
lim.max_hw_wzeroes_unmap_sectors = lim.max_write_zeroes_sectors;
2420+
}
2421+
2422+
ret = queue_limits_commit_update(ns->disk->queue, &lim);
2423+
if (ret) {
2424+
blk_mq_unfreeze_queue(ns->disk->queue, memflags);
2425+
goto out;
2426+
}
2427+
2428+
set_capacity_and_notify(ns->disk, capacity);
24272429
set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
24282430
set_bit(NVME_NS_READY, &ns->flags);
24292431
blk_mq_unfreeze_queue(ns->disk->queue, memflags);

drivers/nvme/target/io-cmd-bdev.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,10 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
4646
id->npda = id->npdg;
4747
/* NOWS = Namespace Optimal Write Size */
4848
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
49+
50+
/* Set WZDS and DRB if device supports unmapped write zeroes */
51+
if (bdev_write_zeroes_unmap_sectors(bdev))
52+
id->dlfeat = (1 << 3) | 0x1;
4953
}
5054

5155
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)

drivers/scsi/sd.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1141,6 +1141,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
11411141
out:
11421142
lim->max_write_zeroes_sectors =
11431143
sdkp->max_ws_blocks * (logical_block_size >> SECTOR_SHIFT);
1144+
1145+
if (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
1146+
sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP)
1147+
lim->max_hw_wzeroes_unmap_sectors =
1148+
lim->max_write_zeroes_sectors;
11441149
}
11451150

11461151
static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)

0 commit comments

Comments
 (0)