Skip to content

Commit 7dbec0b

Browse files
committed
Merge tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka: - a new dm-pcache target for read/write caching on persistent memory - fix typos in docs - misc small refactoring - mark dm-error with DM_TARGET_PASSES_INTEGRITY - dm-request-based: fix NULL pointer dereference and quiesce_depth out of sync - dm-linear: optimize REQ_PREFLUSH - dm-vdo: return error on corrupted metadata - dm-integrity: support asynchronous hash interface * tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits) dm raid: use proper md_ro_state enumerators dm-integrity: prefer synchronous hash interface dm-integrity: enable asynchronous hash interface dm-integrity: rename internal_hash dm-integrity: add the "offset" argument dm-integrity: allocate the recalculate buffer with kmalloc dm-integrity: introduce integrity_kmap and integrity_kunmap dm-integrity: replace bvec_kmap_local with kmap_local_page dm-integrity: use internal variable for digestsize dm vdo: return error on corrupted metadata in start_restoring_volume functions dm vdo: Update code to use mem_is_zero dm: optimize REQ_PREFLUSH with data when using the linear target dm-pcache: use int type to store negative error codes dm: fix "writen"->"written" dm-pcache: cleanup: fix coding style report by checkpatch.pl dm-pcache: remove ctrl_lock for pcache_cache_segment dm: fix NULL pointer dereference in __dm_suspend() dm: fix queue start/stop imbalance under suspend/load/resume races dm-pcache: add persistent cache target in device-mapper dm error: mark as DM_TARGET_PASSES_INTEGRITY ...
2 parents 2ccb4d2 + 55dcfdf commit 7dbec0b

39 files changed

+5829
-181
lines changed

Documentation/admin-guide/device-mapper/delay.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ dm-delay
33
========
44

55
Device-Mapper's "delay" target delays reads and/or writes
6-
and/or flushs and optionally maps them to different devices.
6+
and/or flushes and optionally maps them to different devices.
77

88
Arguments::
99

@@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments:
1818
to write and flush operations on optionally different write_device with
1919
optionally different sector offset
2020

21-
9: same as 6 arguments plus define flush_offset and flush_delay explicitely
21+
9: same as 6 arguments plus define flush_offset and flush_delay explicitly
2222
on/with optionally different flush_device/flush_offset.
2323

2424
Offsets are specified in sectors.
@@ -40,15 +40,15 @@ Example scripts
4040
#!/bin/sh
4141
#
4242
# Create mapped device delaying write and flush operations for 400ms and
43-
# splitting reads to device $1 but writes and flushs to different device $2
43+
# splitting reads to device $1 but writes and flushes to different device $2
4444
# to different offsets of 2048 and 4096 sectors respectively.
4545
#
4646
dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400"
4747

4848
::
4949
#!/bin/sh
5050
#
51-
# Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms
51+
# Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms
5252
# onto the same backing device at offset 0 sectors.
5353
#
5454
dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333"
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================================
4+
dm-pcache — Persistent Cache
5+
=================================
6+
7+
*Author: Dongsheng Yang <[email protected]>*
8+
9+
This document describes *dm-pcache*, a Device-Mapper target that lets a
10+
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
11+
high-performance, crash-persistent cache in front of a slower block
12+
device. The code lives in `drivers/md/dm-pcache/`.
13+
14+
Quick feature summary
15+
=====================
16+
17+
* *Write-back* caching (only mode currently supported).
18+
* *16 MiB segments* allocated on the pmem device.
19+
* *Data CRC32* verification (optional, per cache).
20+
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
21+
== 2`) and protected with CRC+sequence numbers.
22+
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
23+
* Pure *DAX path* I/O – no extra BIO round-trips
24+
* *Log-structured write-back* that preserves backend crash-consistency
25+
26+
27+
Constructor
28+
===========
29+
30+
::
31+
32+
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
33+
34+
========================= ====================================================
35+
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
36+
All metadata *and* cached blocks are stored here.
37+
38+
``backing_dev`` The slow block device to be cached.
39+
40+
``cache_mode`` Optional, Only ``writeback`` is accepted at the
41+
moment.
42+
43+
``data_crc`` Optional, default to ``false``
44+
45+
* ``true`` – store CRC32 for every cached entry
46+
and verify on reads
47+
* ``false`` – skip CRC (faster)
48+
========================= ====================================================
49+
50+
Example
51+
-------
52+
53+
.. code-block:: shell
54+
55+
dmsetup create pcache_sdb --table \
56+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
57+
58+
The first time a pmem device is used, dm-pcache formats it automatically
59+
(super-block, cache_info, etc.).
60+
61+
62+
Status line
63+
===========
64+
65+
``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
66+
67+
::
68+
69+
<sb_flags> <seg_total> <cache_segs> <segs_used> \
70+
<gc_percent> <cache_flags> \
71+
<key_head_seg>:<key_head_off> \
72+
<dirty_tail_seg>:<dirty_tail_off> \
73+
<key_tail_seg>:<key_tail_off>
74+
75+
Field meanings
76+
--------------
77+
78+
=============================== =============================================
79+
``sb_flags`` Super-block flags (e.g. endian marker).
80+
81+
``seg_total`` Number of physical *pmem* segments.
82+
83+
``cache_segs`` Number of segments used for cache.
84+
85+
``segs_used`` Segments currently allocated (bitmap weight).
86+
87+
``gc_percent`` Current GC high-water mark (0-90).
88+
89+
``cache_flags`` Bit 0 – DATA_CRC enabled
90+
Bit 1 – INIT_DONE (cache initialised)
91+
Bits 2-5 – cache mode (0 == WB).
92+
93+
``key_head`` Where new key-sets are being written.
94+
95+
``dirty_tail`` First dirty key-set that still needs
96+
write-back to the backing device.
97+
98+
``key_tail`` First key-set that may be reclaimed by GC.
99+
=============================== =============================================
100+
101+
102+
Messages
103+
========
104+
105+
*Change GC trigger*
106+
107+
::
108+
109+
dmsetup message <dev> 0 gc_percent <0-90>
110+
111+
112+
Theory of operation
113+
===================
114+
115+
Sub-devices
116+
-----------
117+
118+
==================== =========================================================
119+
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
120+
cache_dev DAX device; must expose direct-access memory.
121+
==================== =========================================================
122+
123+
Segments and key-sets
124+
---------------------
125+
126+
* The pmem space is divided into *16 MiB segments*.
127+
* Each write allocates space from a per-CPU *data_head* inside a segment.
128+
* A *cache-key* records a logical range on the origin and where it lives
129+
inside pmem (segment + offset + generation).
130+
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
131+
and are themselves crash-safe (CRC).
132+
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
133+
134+
Write-back
135+
----------
136+
137+
Dirty keys are queued into a tree; a background worker copies data
138+
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
139+
upper layers forces an immediate metadata commit.
140+
141+
Garbage collection
142+
------------------
143+
144+
GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
145+
from *key_tail*, frees segments whose every key has been invalidated, and
146+
advances *key_tail*.
147+
148+
CRC verification
149+
----------------
150+
151+
If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
152+
range when it is inserted and stores it in the on-media key. Reads
153+
validate the CRC before copying to the caller.
154+
155+
156+
Failure handling
157+
================
158+
159+
* *pmem media errors* – all metadata copies are read with
160+
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
161+
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
162+
dm-pcache retries internally (request deferral).
163+
* *System crash* – on attach, the driver replays ksets from *key_tail* to
164+
rebuild the in-core trees; every segment’s generation guards against
165+
use-after-free keys.
166+
167+
168+
Limitations & TODO
169+
==================
170+
171+
* Only *write-back* mode; other modes planned.
172+
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
173+
* Table reload is not supported currently.
174+
* Discard planned.
175+
176+
177+
Example workflow
178+
================
179+
180+
.. code-block:: shell
181+
182+
# 1. Create devices
183+
dmsetup create pcache_sdb --table \
184+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
185+
186+
# 2. Put a filesystem on top
187+
mkfs.ext4 /dev/mapper/pcache_sdb
188+
mount /dev/mapper/pcache_sdb /mnt
189+
190+
# 3. Tune GC threshold to 80 %
191+
dmsetup message pcache_sdb 0 gc_percent 80
192+
193+
# 4. Observe status
194+
watch -n1 'dmsetup status pcache_sdb'
195+
196+
# 5. Shutdown
197+
umount /mnt
198+
dmsetup remove pcache_sdb
199+
200+
201+
``dm-pcache`` is under active development; feedback, bug reports and patches
202+
are very welcome!

Documentation/admin-guide/device-mapper/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Device Mapper
1818
dm-integrity
1919
dm-io
2020
dm-log
21+
dm-pcache
2122
dm-queue-length
2223
dm-raid
2324
dm-service-time

Documentation/admin-guide/device-mapper/vdo.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
.. SPDX-License-Identifier: GPL-2.0-only
22
3+
======
34
dm-vdo
45
======
56

MAINTAINERS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7133,6 +7133,14 @@ S: Maintained
71337133
F: Documentation/admin-guide/device-mapper/vdo*.rst
71347134
F: drivers/md/dm-vdo/
71357135

7136+
DEVICE-MAPPER PCACHE TARGET
7137+
M: Dongsheng Yang <[email protected]>
7138+
M: Zheng Gu <[email protected]>
7139+
7140+
S: Maintained
7141+
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
7142+
F: drivers/md/dm-pcache/
7143+
71367144
DEVLINK
71377145
M: Jiri Pirko <[email protected]>
71387146

drivers/md/Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -688,4 +688,6 @@ config DM_AUDIT
688688

689689
source "drivers/md/dm-vdo/Kconfig"
690690

691+
source "drivers/md/dm-pcache/Kconfig"
692+
691693
endif # MD

drivers/md/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
7373
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
7474
obj-$(CONFIG_DM_VERITY) += dm-verity.o
7575
obj-$(CONFIG_DM_VDO) += dm-vdo/
76+
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
7677
obj-$(CONFIG_DM_CACHE) += dm-cache.o
7778
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
7879
obj-$(CONFIG_DM_EBS) += dm-ebs.o

drivers/md/dm-bufio.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1337,7 +1337,7 @@ static void use_bio(struct dm_buffer *b, enum req_op op, sector_t sector,
13371337
char *ptr;
13381338
unsigned int len;
13391339

1340-
bio = bio_kmalloc(1, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOWARN);
1340+
bio = bio_kmalloc(1, GFP_NOWAIT);
13411341
if (!bio) {
13421342
use_dmio(b, op, sector, n_sectors, offset, ioprio);
13431343
return;
@@ -1601,18 +1601,18 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
16011601
* dm-bufio is resistant to allocation failures (it just keeps
16021602
* one buffer reserved in cases all the allocations fail).
16031603
* So set flags to not try too hard:
1604-
* GFP_NOWAIT: don't wait; if we need to sleep we'll release our
1605-
* mutex and wait ourselves.
1604+
* GFP_NOWAIT: don't wait and don't print a warning in case of
1605+
* failure; if we need to sleep we'll release our mutex
1606+
* and wait ourselves.
16061607
* __GFP_NORETRY: don't retry and rather return failure
16071608
* __GFP_NOMEMALLOC: don't use emergency reserves
1608-
* __GFP_NOWARN: don't print a warning in case of failure
16091609
*
16101610
* For debugging, if we set the cache size to 1, no new buffers will
16111611
* be allocated.
16121612
*/
16131613
while (1) {
16141614
if (dm_bufio_cache_size_latch != 1) {
1615-
b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
1615+
b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC);
16161616
if (b)
16171617
return b;
16181618
}

drivers/md/dm-cache-policy-smq.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -590,7 +590,7 @@ static int h_init(struct smq_hash_table *ht, struct entry_space *es, unsigned in
590590
nr_buckets = roundup_pow_of_two(max(nr_entries / 4u, 16u));
591591
ht->hash_bits = __ffs(nr_buckets);
592592

593-
ht->buckets = vmalloc(array_size(nr_buckets, sizeof(*ht->buckets)));
593+
ht->buckets = vmalloc_array(nr_buckets, sizeof(*ht->buckets));
594594
if (!ht->buckets)
595595
return -ENOMEM;
596596

drivers/md/dm-core.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ struct mapped_device {
162162
#define DMF_SUSPENDED_INTERNALLY 7
163163
#define DMF_POST_SUSPENDING 8
164164
#define DMF_EMULATE_ZONE_APPEND 9
165+
#define DMF_QUEUE_STOPPED 10
165166

166167
static inline sector_t dm_get_size(struct mapped_device *md)
167168
{
@@ -291,6 +292,7 @@ struct dm_io {
291292
struct dm_io *next;
292293
struct dm_stats_aux stats_aux;
293294
blk_status_t status;
295+
bool requeue_flush_with_data;
294296
atomic_t io_count;
295297
struct mapped_device *md;
296298

0 commit comments

Comments
 (0)