Skip to content

Commit 68cd449

Browse files
committed
Enable ext4 support for per-file/directory dax operations
This adds the same per-file/per-directory DAX support for ext4 as was done for xfs, now that we finally have consensus over what the interface should be.
2 parents 6b8ed62 + 15ee656 commit 68cd449

File tree

18 files changed

+350
-60
lines changed

18 files changed

+350
-60
lines changed

Documentation/filesystems/dax.txt

Lines changed: 139 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,144 @@ Usage
2020
If you have a block device which supports DAX, you can make a filesystem
2121
on it as usual. The DAX code currently only supports files with a block
2222
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
23-
size when creating the filesystem. When mounting it, use the "-o dax"
24-
option on the command line or add 'dax' to the options in /etc/fstab.
23+
size when creating the filesystem.
24+
25+
Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them
26+
is different.
27+
28+
Enabling DAX on ext2
29+
-----------------------------
30+
31+
When mounting the filesystem, use the "-o dax" option on the command line or
32+
add 'dax' to the options in /etc/fstab. This works to enable DAX on all files
33+
within the filesystem. It is equivalent to the '-o dax=always' behavior below.
34+
35+
36+
Enabling DAX on xfs and ext4
37+
----------------------------
38+
39+
Summary
40+
-------
41+
42+
1. There exists an in-kernel file access mode flag S_DAX that corresponds to
43+
the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details
44+
about this access mode.
45+
46+
2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
47+
files and directories. This advisory flag can be set or cleared at any
48+
time, but doing so does not immediately affect the S_DAX state.
49+
50+
3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
51+
be inherited by all regular files and subdirectories that are subsequently
52+
created in this directory. Files and subdirectories that exist at the time
53+
this flag is set or cleared on the parent directory are not modified by
54+
this modification of the parent directory.
55+
56+
4. There exist dax mount options which can override FS_XFLAG_DAX in the
57+
setting of the S_DAX flag. Given underlying storage which supports DAX the
58+
following hold:
59+
60+
"-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
61+
62+
"-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
63+
64+
"-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
65+
66+
"-o dax" is a legacy option which is an alias for "dax=always".
67+
This may be removed in the future so "-o dax=always" is
68+
the preferred method for specifying this behavior.
69+
70+
NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
71+
the same even when the filesystem is mounted with a dax option. However,
72+
in-core inode state (S_DAX) will be overridden until the filesystem is
73+
remounted with dax=inode and the inode is evicted from kernel memory.
74+
75+
5. The S_DAX policy can be changed via:
76+
77+
a) Setting the parent directory FS_XFLAG_DAX as needed before files are
78+
created
79+
80+
b) Setting the appropriate dax="foo" mount option
81+
82+
c) Changing the FS_XFLAG_DAX flag on existing regular files and
83+
directories. This has runtime constraints and limitations that are
84+
described in 6) below.
85+
86+
6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
87+
the change in behaviour for existing regular files may not occur
88+
immediately. If the change must take effect immediately, the administrator
89+
needs to:
90+
91+
a) stop the application so there are no active references to the data set
92+
the policy change will affect
93+
94+
b) evict the data set from kernel caches so it will be re-instantiated when
95+
the application is restarted. This can be achieved by:
96+
97+
i. drop-caches
98+
ii. a filesystem unmount and mount cycle
99+
iii. a system reboot
100+
101+
102+
Details
103+
-------
104+
105+
There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)
106+
and the other is a volatile flag indicating the active state of the feature
107+
(S_DAX).
108+
109+
FS_XFLAG_DAX is preserved within the filesystem. This persistent config
110+
setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
111+
(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
112+
113+
New files and directories automatically inherit FS_XFLAG_DAX from
114+
their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
115+
directory creation time can be used to set a default behavior for an entire
116+
sub-tree.
117+
118+
To clarify inheritance, here are 3 examples:
119+
120+
Example A:
121+
122+
mkdir -p a/b/c
123+
xfs_io -c 'chattr +x' a
124+
mkdir a/b/c/d
125+
mkdir a/e
126+
127+
dax: a,e
128+
no dax: b,c,d
129+
130+
Example B:
131+
132+
mkdir a
133+
xfs_io -c 'chattr +x' a
134+
mkdir -p a/b/c/d
135+
136+
dax: a,b,c,d
137+
no dax:
138+
139+
Example C:
140+
141+
mkdir -p a/b/c
142+
xfs_io -c 'chattr +x' c
143+
mkdir a/b/c/d
144+
145+
dax: c,d
146+
no dax: a,b
147+
148+
149+
The current enabled state (S_DAX) is set when a file inode is instantiated in
150+
memory by the kernel. It is set based on the underlying media support, the
151+
value of FS_XFLAG_DAX and the filesystem's dax mount option.
152+
153+
statx can be used to query S_DAX. NOTE that only regular files will ever have
154+
S_DAX set and therefore statx will never indicate that S_DAX is set on
155+
directories.
156+
157+
Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
158+
if the underlying media does not support dax and/or the filesystem is
159+
overridden with a mount option.
160+
25161

26162

27163
Implementation Tips for Block Driver Writers
@@ -94,7 +230,7 @@ sysadmins have an option to restore the lost data from a prior backup/inbuilt
94230
redundancy in the following ways:
95231

96232
1. Delete the affected file, and restore from a backup (sysadmin route):
97-
This will free the file system blocks that were being used by the file,
233+
This will free the filesystem blocks that were being used by the file,
98234
and the next time they're allocated, they will be zeroed first, which
99235
happens through the driver, and will clear bad sectors.
100236

Documentation/filesystems/ext4/verity.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,6 @@ is encrypted as well as the data itself.
3939

4040
Verity files cannot have blocks allocated past the end of the verity
4141
metadata.
42+
43+
Verity and DAX are not compatible and attempts to set both of these flags
44+
on a file will fail.

drivers/block/loop.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -634,8 +634,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
634634

635635
static inline void loop_update_dio(struct loop_device *lo)
636636
{
637-
__loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
638-
lo->use_dio);
637+
__loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) |
638+
lo->use_dio);
639639
}
640640

641641
static void loop_reread_partitions(struct loop_device *lo,
@@ -1028,7 +1028,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
10281028
if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
10291029
blk_queue_write_cache(lo->lo_queue, true, false);
10301030

1031-
if (io_is_direct(lo->lo_backing_file) && inode->i_sb->s_bdev) {
1031+
if ((lo->lo_backing_file->f_flags & O_DIRECT) && inode->i_sb->s_bdev) {
10321032
/* In case of direct I/O, match underlying block size */
10331033
unsigned short bsize = bdev_logical_block_size(
10341034
inode->i_sb->s_bdev);

fs/dcache.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,10 @@ static inline bool retain_dentry(struct dentry *dentry)
647647
if (dentry->d_op->d_delete(dentry))
648648
return false;
649649
}
650+
651+
if (unlikely(dentry->d_flags & DCACHE_DONTCACHE))
652+
return false;
653+
650654
/* retain; LRU fodder */
651655
dentry->d_lockref.count--;
652656
if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
@@ -656,6 +660,21 @@ static inline bool retain_dentry(struct dentry *dentry)
656660
return true;
657661
}
658662

663+
void d_mark_dontcache(struct inode *inode)
664+
{
665+
struct dentry *de;
666+
667+
spin_lock(&inode->i_lock);
668+
hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
669+
spin_lock(&de->d_lock);
670+
de->d_flags |= DCACHE_DONTCACHE;
671+
spin_unlock(&de->d_lock);
672+
}
673+
inode->i_state |= I_DONTCACHE;
674+
spin_unlock(&inode->i_lock);
675+
}
676+
EXPORT_SYMBOL(d_mark_dontcache);
677+
659678
/*
660679
* Finish off a dentry we've decided to kill.
661680
* dentry->d_lock must be held, returns with it unlocked.

fs/ext4/ext4.h

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -426,28 +426,33 @@ struct flex_groups {
426426
#define EXT4_VERITY_FL 0x00100000 /* Verity protected inode */
427427
#define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
428428
/* 0x00400000 was formerly EXT4_EOFBLOCKS_FL */
429+
430+
#define EXT4_DAX_FL 0x02000000 /* Inode is DAX */
431+
429432
#define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
430433
#define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
431434
#define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded directory */
432435
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */
433436

434-
#define EXT4_FL_USER_VISIBLE 0x705BDFFF /* User visible flags */
435-
#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */
437+
#define EXT4_FL_USER_VISIBLE 0x725BDFFF /* User visible flags */
438+
#define EXT4_FL_USER_MODIFIABLE 0x624BC0FF /* User modifiable flags */
436439

437440
/* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
438441
#define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \
439442
EXT4_IMMUTABLE_FL | \
440443
EXT4_APPEND_FL | \
441444
EXT4_NODUMP_FL | \
442445
EXT4_NOATIME_FL | \
443-
EXT4_PROJINHERIT_FL)
446+
EXT4_PROJINHERIT_FL | \
447+
EXT4_DAX_FL)
444448

445449
/* Flags that should be inherited by new inodes from their parent. */
446450
#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
447451
EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
448452
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
449453
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
450-
EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)
454+
EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL |\
455+
EXT4_DAX_FL)
451456

452457
/* Flags that are appropriate for regular files (all but dir-specific ones). */
453458
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | EXT4_CASEFOLD_FL |\
@@ -459,6 +464,10 @@ struct flex_groups {
459464
/* The only flags that should be swapped */
460465
#define EXT4_FL_SHOULD_SWAP (EXT4_HUGE_FILE_FL | EXT4_EXTENTS_FL)
461466

467+
/* Flags which are mutually exclusive to DAX */
468+
#define EXT4_DAX_MUT_EXCL (EXT4_VERITY_FL | EXT4_ENCRYPT_FL |\
469+
EXT4_JOURNAL_DATA_FL)
470+
462471
/* Mask out flags that are inappropriate for the given type of inode. */
463472
static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags)
464473
{
@@ -499,6 +508,7 @@ enum {
499508
EXT4_INODE_VERITY = 20, /* Verity protected inode */
500509
EXT4_INODE_EA_INODE = 21, /* Inode used for large EA */
501510
/* 22 was formerly EXT4_INODE_EOFBLOCKS */
511+
EXT4_INODE_DAX = 25, /* Inode is DAX */
502512
EXT4_INODE_INLINE_DATA = 28, /* Data in inode. */
503513
EXT4_INODE_PROJINHERIT = 29, /* Create with parents projid */
504514
EXT4_INODE_CASEFOLD = 30, /* Casefolded directory */
@@ -1135,9 +1145,9 @@ struct ext4_inode_info {
11351145
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
11361146
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
11371147
#ifdef CONFIG_FS_DAX
1138-
#define EXT4_MOUNT_DAX 0x00200 /* Direct Access */
1148+
#define EXT4_MOUNT_DAX_ALWAYS 0x00200 /* Direct Access */
11391149
#else
1140-
#define EXT4_MOUNT_DAX 0
1150+
#define EXT4_MOUNT_DAX_ALWAYS 0
11411151
#endif
11421152
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
11431153
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
@@ -1180,6 +1190,8 @@ struct ext4_inode_info {
11801190
blocks */
11811191
#define EXT4_MOUNT2_HURD_COMPAT 0x00000004 /* Support HURD-castrated
11821192
file systems */
1193+
#define EXT4_MOUNT2_DAX_NEVER 0x00000008 /* Do not allow Direct Access */
1194+
#define EXT4_MOUNT2_DAX_INODE 0x00000010 /* For printing options only */
11831195

11841196
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
11851197
specified journal checksum */
@@ -1991,6 +2003,7 @@ static inline bool ext4_has_incompat_features(struct super_block *sb)
19912003
*/
19922004
#define EXT4_FLAGS_RESIZING 0
19932005
#define EXT4_FLAGS_SHUTDOWN 1
2006+
#define EXT4_FLAGS_BDEV_IS_DAX 2
19942007

19952008
static inline int ext4_forced_shutdown(struct ext4_sb_info *sbi)
19962009
{
@@ -2704,7 +2717,7 @@ extern int ext4_can_truncate(struct inode *inode);
27042717
extern int ext4_truncate(struct inode *);
27052718
extern int ext4_break_layouts(struct inode *);
27062719
extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
2707-
extern void ext4_set_inode_flags(struct inode *);
2720+
extern void ext4_set_inode_flags(struct inode *, bool init);
27082721
extern int ext4_alloc_da_blocks(struct inode *inode);
27092722
extern void ext4_set_aops(struct inode *inode);
27102723
extern int ext4_writepage_trans_blocks(struct inode *);

fs/ext4/ialloc.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1116,7 +1116,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
11161116
ei->i_block_group = group;
11171117
ei->i_last_alloc_group = ~0;
11181118

1119-
ext4_set_inode_flags(inode);
1119+
ext4_set_inode_flags(inode, true);
11201120
if (IS_DIRSYNC(inode))
11211121
ext4_handle_sync(handle);
11221122
if (insert_inode_locked(inode) < 0) {

fs/ext4/inode.c

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4406,9 +4406,11 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
44064406
!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
44074407
}
44084408

4409-
static bool ext4_should_use_dax(struct inode *inode)
4409+
static bool ext4_should_enable_dax(struct inode *inode)
44104410
{
4411-
if (!test_opt(inode->i_sb, DAX))
4411+
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
4412+
4413+
if (test_opt2(inode->i_sb, DAX_NEVER))
44124414
return false;
44134415
if (!S_ISREG(inode->i_mode))
44144416
return false;
@@ -4420,14 +4422,21 @@ static bool ext4_should_use_dax(struct inode *inode)
44204422
return false;
44214423
if (ext4_test_inode_flag(inode, EXT4_INODE_VERITY))
44224424
return false;
4423-
return true;
4425+
if (!test_bit(EXT4_FLAGS_BDEV_IS_DAX, &sbi->s_ext4_flags))
4426+
return false;
4427+
if (test_opt(inode->i_sb, DAX_ALWAYS))
4428+
return true;
4429+
4430+
return ext4_test_inode_flag(inode, EXT4_INODE_DAX);
44244431
}
44254432

4426-
void ext4_set_inode_flags(struct inode *inode)
4433+
void ext4_set_inode_flags(struct inode *inode, bool init)
44274434
{
44284435
unsigned int flags = EXT4_I(inode)->i_flags;
44294436
unsigned int new_fl = 0;
44304437

4438+
WARN_ON_ONCE(IS_DAX(inode) && init);
4439+
44314440
if (flags & EXT4_SYNC_FL)
44324441
new_fl |= S_SYNC;
44334442
if (flags & EXT4_APPEND_FL)
@@ -4438,8 +4447,13 @@ void ext4_set_inode_flags(struct inode *inode)
44384447
new_fl |= S_NOATIME;
44394448
if (flags & EXT4_DIRSYNC_FL)
44404449
new_fl |= S_DIRSYNC;
4441-
if (ext4_should_use_dax(inode))
4450+
4451+
/* Because of the way inode_set_flags() works we must preserve S_DAX
4452+
* here if already set. */
4453+
new_fl |= (inode->i_flags & S_DAX);
4454+
if (init && ext4_should_enable_dax(inode))
44424455
new_fl |= S_DAX;
4456+
44434457
if (flags & EXT4_ENCRYPT_FL)
44444458
new_fl |= S_ENCRYPTED;
44454459
if (flags & EXT4_CASEFOLD_FL)
@@ -4653,7 +4667,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
46534667
* not initialized on a new filesystem. */
46544668
}
46554669
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
4656-
ext4_set_inode_flags(inode);
4670+
ext4_set_inode_flags(inode, true);
46574671
inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
46584672
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
46594673
if (ext4_has_feature_64bit(sb))

0 commit comments

Comments
 (0)