Skip to content

Commit 4f9cda4

Browse files
yangdongshengkawasaki
authored andcommitted
dm-pcache: initial dm-pcache target
Add the top-level integration pieces that make the new persistent-memory cache target usable from device-mapper: * Documentation - `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the design, table syntax, status fields and runtime messages. * Core target implementation - `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target, parse constructor arguments, create workqueues, and forward BIOS to the cache core added in earlier patches. - Supports flush/FUA, status reporting, and a “gc_percent” message. - Dont support discard currently. - Dont support table reload for live target currently. * Device-mapper tables now accept lines like pcache <pmem_dev> <backing_dev> writeback <true|false> Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
1 parent 490dc36 commit 4f9cda4

File tree

8 files changed

+681
-0
lines changed

8 files changed

+681
-0
lines changed
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================================
4+
dm-pcache — Persistent Cache
5+
=================================
6+
7+
*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
8+
9+
This document describes *dm-pcache*, a Device-Mapper target that lets a
10+
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
11+
high-performance, crash-persistent cache in front of a slower block
12+
device. The code lives in `drivers/md/dm-pcache/`.
13+
14+
Quick feature summary
15+
=====================
16+
17+
* *Write-back* caching (only mode currently supported).
18+
* *16 MiB segments* allocated on the pmem device.
19+
* *Data CRC32* verification (optional, per cache).
20+
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
21+
== 2`) and protected with CRC+sequence numbers.
22+
* *Multi-tree indexing* (one radix tree per CPU backend) for high PMem
23+
parallelism
24+
* Pure *DAX path* I/O – no extra BIO round-trips
25+
* *Log-structured write-back* that preserves backend crash-consistency
26+
27+
-------------------------------------------------------------------------------
28+
Constructor
29+
===========
30+
31+
::
32+
33+
pcache <cache_dev> <backing_dev> <cache_mode> <data_crc>
34+
35+
========================= ====================================================
36+
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
37+
All metadata *and* cached blocks are stored here.
38+
39+
``backing_dev`` The slow block device to be cached.
40+
41+
``cache_mode`` Only ``writeback`` is accepted at the moment.
42+
43+
``data_crc`` ``true`` – store CRC32 for every cached entry and
44+
verify on reads
45+
``false`` – skip CRC (faster)
46+
========================= ====================================================
47+
48+
Example
49+
-------
50+
51+
.. code-block:: shell
52+
53+
dmsetup create pcache_sdb --table \
54+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"
55+
56+
The first time a pmem device is used, dm-pcache formats it automatically
57+
(super-block, cache_info, etc.).
58+
59+
-------------------------------------------------------------------------------
60+
Status line
61+
===========
62+
63+
``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
64+
65+
::
66+
67+
<sb_flags> <seg_total> <cache_segs> <segs_used> \
68+
<gc_percent> <cache_flags> \
69+
<key_head_seg>:<key_head_off> \
70+
<dirty_tail_seg>:<dirty_tail_off> \
71+
<key_tail_seg>:<key_tail_off>
72+
73+
Field meanings
74+
--------------
75+
76+
=============================== =============================================
77+
``sb_flags`` Super-block flags (e.g. endian marker).
78+
79+
``seg_total`` Number of physical *pmem* segments.
80+
81+
``cache_segs`` Number of segments used for cache.
82+
83+
``segs_used`` Segments currently allocated (bitmap weight).
84+
85+
``gc_percent`` Current GC high-water mark (0-90).
86+
87+
``cache_flags`` Bit 0 – DATA_CRC enabled
88+
Bit 1 – INIT_DONE (cache initialised)
89+
Bits 2-5 – cache mode (0 == WB).
90+
91+
``key_head`` Where new key-sets are being written.
92+
93+
``dirty_tail`` First dirty key-set that still needs
94+
write-back to the backing device.
95+
96+
``key_tail`` First key-set that may be reclaimed by GC.
97+
=============================== =============================================
98+
99+
-------------------------------------------------------------------------------
100+
Messages
101+
========
102+
103+
*Change GC trigger*
104+
105+
::
106+
107+
dmsetup message <dev> 0 gc_percent <0-90>
108+
109+
-------------------------------------------------------------------------------
110+
Theory of operation
111+
===================
112+
113+
Sub-devices
114+
-----------
115+
116+
==================== =========================================================
117+
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
118+
cache_dev DAX device; must expose direct-access memory.
119+
==================== =========================================================
120+
121+
Segments and key-sets
122+
---------------------
123+
124+
* The pmem space is divided into *16 MiB segments*.
125+
* Each write allocates space from a per-CPU *data_head* inside a segment.
126+
* A *cache-key* records a logical range on the origin and where it lives
127+
inside pmem (segment + offset + generation).
128+
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
129+
and are themselves crash-safe (CRC).
130+
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
131+
132+
Write-back
133+
----------
134+
135+
Dirty keys are queued into a tree; a background worker copies data
136+
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
137+
upper layers forces an immediate metadata commit.
138+
139+
Garbage collection
140+
------------------
141+
142+
GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
143+
from *key_tail*, frees segments whose every key has been invalidated, and
144+
advances *key_tail*.
145+
146+
CRC verification
147+
----------------
148+
149+
If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
150+
range when it is inserted and stores it in the on-media key. Reads
151+
validate the CRC before copying to the caller.
152+
153+
-------------------------------------------------------------------------------
154+
Failure handling
155+
================
156+
157+
* *pmem media errors* – all metadata copies are read with
158+
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
159+
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
160+
dm-pcache retries internally (request deferral).
161+
* *System crash* – on attach, the driver replays ksets from *key_tail* to
162+
rebuild the in-core trees; every segment’s generation guards against
163+
use-after-free keys.
164+
165+
-------------------------------------------------------------------------------
166+
Limitations & TODO
167+
==================
168+
169+
* Only *write-back* mode; other modes planned.
170+
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
171+
* Table reload is not supported currently.
172+
* Discard planned.
173+
174+
-------------------------------------------------------------------------------
175+
Example workflow
176+
================
177+
178+
.. code-block:: shell
179+
180+
# 1. Create devices
181+
dmsetup create pcache_sdb --table \
182+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"
183+
184+
# 2. Put a filesystem on top
185+
mkfs.ext4 /dev/mapper/pcache_sdb
186+
mount /dev/mapper/pcache_sdb /mnt
187+
188+
# 3. Tune GC threshold to 80 %
189+
dmsetup message pcache_sdb 0 gc_percent 80
190+
191+
# 4. Observe status
192+
watch -n1 'dmsetup status pcache_sdb'
193+
194+
# 5. Shutdown
195+
umount /mnt
196+
dmsetup remove pcache_sdb
197+
198+
-------------------------------------------------------------------------------
199+
``dm-pcache`` is under active development; feedback, bug reports and patches
200+
are very welcome!

MAINTAINERS

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6946,6 +6946,15 @@ S: Maintained
69466946
F: Documentation/admin-guide/device-mapper/vdo*.rst
69476947
F: drivers/md/dm-vdo/
69486948

6949+
DEVICE-MAPPER PCACHE TARGET
6950+
M: Dongsheng Yang <dongsheng.yang@linux.dev>
6951+
M: Zheng Gu <cengku@gmail.com>
6952+
R: Linggang Zeng <linggang.linux@gmail.com>
6953+
L: dm-devel@lists.linux.dev
6954+
S: Maintained
6955+
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
6956+
F: drivers/md/dm-pcache/
6957+
69496958
DEVLINK
69506959
M: Jiri Pirko <jiri@resnulli.us>
69516960
L: netdev@vger.kernel.org

drivers/md/Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,4 +659,6 @@ config DM_AUDIT
659659

660660
source "drivers/md/dm-vdo/Kconfig"
661661

662+
source "drivers/md/dm-pcache/Kconfig"
663+
662664
endif # MD

drivers/md/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
7171
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
7272
obj-$(CONFIG_DM_VERITY) += dm-verity.o
7373
obj-$(CONFIG_DM_VDO) += dm-vdo/
74+
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
7475
obj-$(CONFIG_DM_CACHE) += dm-cache.o
7576
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
7677
obj-$(CONFIG_DM_EBS) += dm-ebs.o

drivers/md/dm-pcache/Kconfig

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
config DM_PCACHE
2+
tristate "Persistent cache for Block Device (Experimental)"
3+
depends on BLK_DEV_DM
4+
depends on DEV_DAX
5+
help
6+
PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
7+
DAX-enabled devices) as a high-performance cache layer in front of
8+
traditional block devices such as SSDs or HDDs.
9+
10+
PCACHE is implemented as a kernel module that integrates with the block
11+
layer and supports direct access (DAX) to persistent memory for low-latency,
12+
byte-addressable caching.
13+
14+
Note: This feature is experimental and should be tested thoroughly
15+
before use in production environments.
16+
17+
If unsure, say 'N'.

drivers/md/dm-pcache/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o
2+
3+
obj-m += dm-pcache.o

0 commit comments

Comments
 (0)