Skip to content

Commit 7180f8d

Browse files
mjguzikbrauner
authored andcommitted
vfs: add rcu-based find_inode variants for iget ops
This avoids one inode hash lock acquire in the common case on inode creation, in effect significantly reducing contention. On the stock kernel said lock is typically taken twice: 1. once to check if the inode happens to already be present 2. once to add it to the hash The back-to-back lock/unlock pattern is known to degrade performance significantly, which is further exacerbated if the hash is heavily populated (long chains to walk, extending hold time). Arguably hash sizing and hashing algo need to be revisited, but that's beyond the scope of this patch. With the acquire from step 1 eliminated with RCU lookup throughput increases significantly at the scale of 20 cores (benchmark results at the bottom). So happens the hash already supports RCU-based operation, but lookups on inode insertions didn't take advantage of it. This of course has its limits as the global lock is still a bottleneck. There was a patchset posted which introduced fine-grained locking[1] but it appears staled. Apart from that doubt was expressed whether a handrolled hash implementation is appropriate to begin with, suggesting replacement with rhashtables. Nobody committed to carrying [1] across the finish line or implementing anything better, thus the bandaid below. iget_locked consumers (notably ext4) get away without any changes because inode comparison method is built-in. iget5_locked consumers pass a custom callback. Since removal of locking adds more problems (inode can be changing) it's not safe to assume all filesystems happen to cope. Thus iget5_locked_rcu gets added, requiring manual conversion of interested filesystems. In order to reduce code duplication find_inode and find_inode_fast grow an argument indicating whether inode hash lock is held, which is passed down in case sleeping is necessary. They always rcu_read_lock, which is redundant but harmless. Doing it conditionally reduces readability for no real gain that I can see. RCU-alike restrictions were already put on callbacks due to the hash spinlock being held. Benchmarking: There is a real cache-busting workload scanning millions of files in parallel (it's a backup appliance), where the initial lookup is guaranteed to fail resulting in the two lock acquires on stock kernel (and one with the patch at hand). Implemented below is a synthetic benchmark providing the same behavior. [I shall note the workload is not running on Linux, instead it was causing trouble elsewhere. Benchmark below was used while addressing said problems and was found to adequately represent the real workload.] Total real time fluctuates by 1-2s. With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB RAM vm: ext4 (needed mkfs.ext4 -N 24000000): before: 3.77s user 890.90s system 1939% cpu 46.118 total after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) That's 20 million files to visit, while the machine can only cache about 15 million at a time (obtained from ext4_inode_cache object count in /proc/slabinfo). Since each terminal inode is only visited once per run this amounts to 0% hit ratio for the dentry cache and the hash table (there are however hits for the intermediate directories). On repeated runs the kernel caches the last ~15 mln, meaning there is ~5 mln of uncached inodes which are going to be visited first, evicting the previously cached state as it happens. Lack of hits can be trivially verified with bpftrace, like so: bpftrace -e 'kretprobe:find_inode_fast { @[kstack(), retval != 0] = count(); }'\ -c "/bin/sh walktrees /testfs 20" Best ran more than once. Expected results after "warmup": [snip] @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 link_path_walk.part.0.constprop.0+614 path_lookupat+62 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000 @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 path_lookupat+106 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000000 That is 20 million calls for the initial lookup and 20 million after allocating a new inode, all of them failing to return a value != 0 (i.e., they are returning NULL -- no match found). Of course aborting the benchmark in the middle and starting it again (or messing with the state in other ways) is going to alter these results. Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz [1] https://lore.kernel.org/all/[email protected]/ Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
1 parent 1613e60 commit 7180f8d

File tree

2 files changed

+86
-18
lines changed

2 files changed

+86
-18
lines changed

fs/inode.c

Lines changed: 80 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -886,36 +886,45 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
886886
return freed;
887887
}
888888

889-
static void __wait_on_freeing_inode(struct inode *inode);
889+
static void __wait_on_freeing_inode(struct inode *inode, bool locked);
890890
/*
891891
* Called with the inode lock held.
892892
*/
893893
static struct inode *find_inode(struct super_block *sb,
894894
struct hlist_head *head,
895895
int (*test)(struct inode *, void *),
896-
void *data)
896+
void *data, bool locked)
897897
{
898898
struct inode *inode = NULL;
899899

900+
if (locked)
901+
lockdep_assert_held(&inode_hash_lock);
902+
else
903+
lockdep_assert_not_held(&inode_hash_lock);
904+
905+
rcu_read_lock();
900906
repeat:
901-
hlist_for_each_entry(inode, head, i_hash) {
907+
hlist_for_each_entry_rcu(inode, head, i_hash) {
902908
if (inode->i_sb != sb)
903909
continue;
904910
if (!test(inode, data))
905911
continue;
906912
spin_lock(&inode->i_lock);
907913
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
908-
__wait_on_freeing_inode(inode);
914+
__wait_on_freeing_inode(inode, locked);
909915
goto repeat;
910916
}
911917
if (unlikely(inode->i_state & I_CREATING)) {
912918
spin_unlock(&inode->i_lock);
919+
rcu_read_unlock();
913920
return ERR_PTR(-ESTALE);
914921
}
915922
__iget(inode);
916923
spin_unlock(&inode->i_lock);
924+
rcu_read_unlock();
917925
return inode;
918926
}
927+
rcu_read_unlock();
919928
return NULL;
920929
}
921930

@@ -924,29 +933,39 @@ static struct inode *find_inode(struct super_block *sb,
924933
* iget_locked for details.
925934
*/
926935
static struct inode *find_inode_fast(struct super_block *sb,
927-
struct hlist_head *head, unsigned long ino)
936+
struct hlist_head *head, unsigned long ino,
937+
bool locked)
928938
{
929939
struct inode *inode = NULL;
930940

941+
if (locked)
942+
lockdep_assert_held(&inode_hash_lock);
943+
else
944+
lockdep_assert_not_held(&inode_hash_lock);
945+
946+
rcu_read_lock();
931947
repeat:
932-
hlist_for_each_entry(inode, head, i_hash) {
948+
hlist_for_each_entry_rcu(inode, head, i_hash) {
933949
if (inode->i_ino != ino)
934950
continue;
935951
if (inode->i_sb != sb)
936952
continue;
937953
spin_lock(&inode->i_lock);
938954
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
939-
__wait_on_freeing_inode(inode);
955+
__wait_on_freeing_inode(inode, locked);
940956
goto repeat;
941957
}
942958
if (unlikely(inode->i_state & I_CREATING)) {
943959
spin_unlock(&inode->i_lock);
960+
rcu_read_unlock();
944961
return ERR_PTR(-ESTALE);
945962
}
946963
__iget(inode);
947964
spin_unlock(&inode->i_lock);
965+
rcu_read_unlock();
948966
return inode;
949967
}
968+
rcu_read_unlock();
950969
return NULL;
951970
}
952971

@@ -1161,7 +1180,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
11611180

11621181
again:
11631182
spin_lock(&inode_hash_lock);
1164-
old = find_inode(inode->i_sb, head, test, data);
1183+
old = find_inode(inode->i_sb, head, test, data, true);
11651184
if (unlikely(old)) {
11661185
/*
11671186
* Uhhuh, somebody else created the same inode under us.
@@ -1245,6 +1264,48 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
12451264
}
12461265
EXPORT_SYMBOL(iget5_locked);
12471266

1267+
/**
1268+
* iget5_locked_rcu - obtain an inode from a mounted file system
1269+
* @sb: super block of file system
1270+
* @hashval: hash value (usually inode number) to get
1271+
* @test: callback used for comparisons between inodes
1272+
* @set: callback used to initialize a new struct inode
1273+
* @data: opaque data pointer to pass to @test and @set
1274+
*
1275+
* This is equivalent to iget5_locked, except the @test callback must
1276+
* tolerate the inode not being stable, including being mid-teardown.
1277+
*/
1278+
struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
1279+
int (*test)(struct inode *, void *),
1280+
int (*set)(struct inode *, void *), void *data)
1281+
{
1282+
struct hlist_head *head = inode_hashtable + hash(sb, hashval);
1283+
struct inode *inode, *new;
1284+
1285+
again:
1286+
inode = find_inode(sb, head, test, data, false);
1287+
if (inode) {
1288+
if (IS_ERR(inode))
1289+
return NULL;
1290+
wait_on_inode(inode);
1291+
if (unlikely(inode_unhashed(inode))) {
1292+
iput(inode);
1293+
goto again;
1294+
}
1295+
return inode;
1296+
}
1297+
1298+
new = alloc_inode(sb);
1299+
if (new) {
1300+
new->i_state = 0;
1301+
inode = inode_insert5(new, hashval, test, set, data);
1302+
if (unlikely(inode != new))
1303+
destroy_inode(new);
1304+
}
1305+
return inode;
1306+
}
1307+
EXPORT_SYMBOL_GPL(iget5_locked_rcu);
1308+
12481309
/**
12491310
* iget_locked - obtain an inode from a mounted file system
12501311
* @sb: super block of file system
@@ -1263,9 +1324,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
12631324
struct hlist_head *head = inode_hashtable + hash(sb, ino);
12641325
struct inode *inode;
12651326
again:
1266-
spin_lock(&inode_hash_lock);
1267-
inode = find_inode_fast(sb, head, ino);
1268-
spin_unlock(&inode_hash_lock);
1327+
inode = find_inode_fast(sb, head, ino, false);
12691328
if (inode) {
12701329
if (IS_ERR(inode))
12711330
return NULL;
@@ -1283,7 +1342,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
12831342

12841343
spin_lock(&inode_hash_lock);
12851344
/* We released the lock, so.. */
1286-
old = find_inode_fast(sb, head, ino);
1345+
old = find_inode_fast(sb, head, ino, true);
12871346
if (!old) {
12881347
inode->i_ino = ino;
12891348
spin_lock(&inode->i_lock);
@@ -1419,7 +1478,7 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
14191478
struct inode *inode;
14201479

14211480
spin_lock(&inode_hash_lock);
1422-
inode = find_inode(sb, head, test, data);
1481+
inode = find_inode(sb, head, test, data, true);
14231482
spin_unlock(&inode_hash_lock);
14241483

14251484
return IS_ERR(inode) ? NULL : inode;
@@ -1474,7 +1533,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
14741533
struct inode *inode;
14751534
again:
14761535
spin_lock(&inode_hash_lock);
1477-
inode = find_inode_fast(sb, head, ino);
1536+
inode = find_inode_fast(sb, head, ino, true);
14781537
spin_unlock(&inode_hash_lock);
14791538

14801539
if (inode) {
@@ -2235,17 +2294,21 @@ EXPORT_SYMBOL(inode_needs_sync);
22352294
* wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
22362295
* will DTRT.
22372296
*/
2238-
static void __wait_on_freeing_inode(struct inode *inode)
2297+
static void __wait_on_freeing_inode(struct inode *inode, bool locked)
22392298
{
22402299
wait_queue_head_t *wq;
22412300
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
22422301
wq = bit_waitqueue(&inode->i_state, __I_NEW);
22432302
prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
22442303
spin_unlock(&inode->i_lock);
2245-
spin_unlock(&inode_hash_lock);
2304+
rcu_read_unlock();
2305+
if (locked)
2306+
spin_unlock(&inode_hash_lock);
22462307
schedule();
22472308
finish_wait(wq, &wait.wq_entry);
2248-
spin_lock(&inode_hash_lock);
2309+
if (locked)
2310+
spin_lock(&inode_hash_lock);
2311+
rcu_read_lock();
22492312
}
22502313

22512314
static __initdata unsigned long ihash_entries;

include/linux/fs.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3029,7 +3029,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
30293029
int (*test)(struct inode *, void *),
30303030
int (*set)(struct inode *, void *),
30313031
void *data);
3032-
extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
3032+
struct inode *iget5_locked(struct super_block *, unsigned long,
3033+
int (*test)(struct inode *, void *),
3034+
int (*set)(struct inode *, void *), void *);
3035+
struct inode *iget5_locked_rcu(struct super_block *, unsigned long,
3036+
int (*test)(struct inode *, void *),
3037+
int (*set)(struct inode *, void *), void *);
30333038
extern struct inode * iget_locked(struct super_block *, unsigned long);
30343039
extern struct inode *find_inode_nowait(struct super_block *,
30353040
unsigned long,

0 commit comments

Comments
 (0)