Skip to content

Commit 522544f

Browse files
committed
Merge tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet: - Poisoned extents can now be moved: this lets us handle bitrotted data without deleting it. For now, reading from poisoned extents only returns -EIO: in the future we'll have an API for specifying "read this data even if there were bitflips". - Incompatible features may now be enabled at runtime, via "opts/version_upgrade" in sysfs. Toggle it to incompatible, and then toggle it back - option changes via the sysfs interface are persistent. - Various changes to support deployable disk images: - RO mounts now use less memory - Images may be stripped of alloc info, particularly useful for slimming them down if they will primarily be mounted RO. Alloc info will be automatically regenerated on first RW mount, and this is quite fast - Filesystem images generated with 'bcachefs image' will be automatically resized the first time they're mounted on a larger device The images 'bcachefs image' generates with compression enabled have been comparable in size to those generated by squashfs and erofs - but you get a full RW capable filesystem - Major error message improvements for btree node reads, data reads, and elsewhere. We now build up a single error message that lists all the errors encountered, actions taken to repair, and success/failure of the IO. This extends to other error paths that may kick off other actions, e.g. scheduling recovery passes: actions we took because of an error are included in that error message, with grouping/indentation so we can see what caused what. - New option, 'rebalance_on_ac_only'. Does exactly what the name suggests, quite handy with background compression. - Repair/self healing: - We can now kick off recovery passes and run them in the background if we detect errors. Currently, this is just used by code that walks backpointers. We now also check for missing backpointers at runtime and run check_extents_to_backpointers if required. The messy 6.14 upgrade left missing backpointers for some users, and this will correct that automatically instead of requiring a manual fsck - some users noticed this as copygc spinning and not making progress. In the future, as more recovery passes come online, we'll be able to repair and recover from nearly anything - except for unreadable btree nodes, and that's why you're using replication, of course - without shutting down the filesystem. - There's a new recovery pass, for checking the rebalance_work btree, which tracks extents that rebalance will process later. - Hardening: - Close the last known hole in btree iterator/btree locking assertions: path->should_be_locked paths must stay locked until the end of the transaction. This shook out a few bugs, including a performance issue that was causing unnecessary path_upgrade transaction restarts. - Performance: - Faster snapshot deletion: this is an incompatible feature, as it requires new sentinal values, for safety. Snapshot deletion no longer has to do a full metadata scan, it now just scans the inodes btree: if an extent/dirent/xattr is present for a given snapshot ID, we already require that an inode be present with that same snapshot ID. If/when users hit scalability limits again (ridiculously huge filesystems with lots of inodes, and many sparse snapshots), let me know - the next step will be to add an index from snapshot ID -> inode number, which won't be too hard. - Faster device removal: the "scan for pointers to this device" no longer does a full metadata scan, instead it walks backpointers. Like fast snapshot deletion this is another incompat feature: it also requires a new sentinal value, because we don't want to reuse these device IDs until after a fsck. - We're now coalescing redundant accounting updates prior to transaction commit, taking some pressure off the journal. Shortly we'll also be doing multiple extent updates in a transaction in the main write path, which combined with the previous should drastically cut down on the amount of metadata updates we have to journal. - Stack usage improvements: All allocator state has been moved off the stack - Debug improvements: - enumerated refcounts: The debug code previously used for filesystem write refs is now a small library, and used for other heavily used refcounts. Different users of a refcount are enumerated, making it much easier to debug refcount issues. - Async object debugging: There's a new kconfig option that makes various async objects (different types of bios, data updates, write ops, etc.) visible in debugfs, and it should be fast enough to leave on in production. - Various sets of assertions no longer require CONFIG_BCACHEFS_DEBUG, instead they're controlled by module parameters and static keys, meaning users won't need to compile custom kernels as often to help debug issues. - bch2_trans_kmalloc() calls can be tracked (there's a new kconfig option). With it on you can check the btree_transaction_stats in debugfs to see the bch2_trans_kmalloc() calls a transaction did when it used the most memory. * tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits) bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE bcachefs: Fix btree_iter_next_node() for new locking asserts bcachefs: Ensure we don't use a blacklisted journal seq bcachefs: Small check_fix_ptr fixes bcachefs: Fix opts.recovery_pass_last bcachefs: Fix allocate -> self healing path bcachefs: Fix endianness in casefold check/repair bcachefs: Path must be locked if trans->locked && should_be_locked bcachefs: Simplify bch2_path_put() bcachefs: Plumb btree_trans for more locking asserts bcachefs: Clear trans->locked before unlock bcachefs: Clear should_be_locked before unlock in key_cache_drop() bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked bcachefs: Give out new path if upgrade fails bcachefs: Fix btree_path_get_locks when not doing trans restart bcachefs: btree_node_locked_type_nowrite() bcachefs: Kill bch2_path_put_nokeep() bcachefs: bch2_journal_write_checksum() bcachefs: Reduce stack usage in data_update_index_update() bcachefs: bch2_trans_log_str() ...
2 parents 8fdabcd + 9caea92 commit 522544f

File tree

141 files changed

+7129
-3542
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

141 files changed

+7129
-3542
lines changed

Documentation/filesystems/bcachefs/casefolding.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,21 @@ This would fail if negative dentry's were cached.
8888

8989
This is slightly suboptimal, but could be fixed in future with some vfs work.
9090

91+
92+
References
93+
----------
94+
95+
(from Peter Anvin, on the list)
96+
97+
It is worth noting that Microsoft has basically declared their
98+
"recommended" case folding (upcase) table to be permanently frozen (for
99+
new filesystem instances in the case where they use an on-disk
100+
translation table created at format time.) As far as I know they have
101+
never supported anything other than 1:1 conversion of BMP code points,
102+
nor normalization.
103+
104+
The exFAT specification enumerates the full recommended upcase table,
105+
although in a somewhat annoying format (basically a hex dump of
106+
compressed data):
107+
108+
https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
Idle/background work classes design doc:
2+
3+
Right now, our behaviour at idle isn't ideal, it was designed for servers that
4+
would be under sustained load, to keep pending work at a "medium" level, to
5+
let work build up so we can process it in more efficient batches, while also
6+
giving headroom for bursts in load.
7+
8+
But for desktops or mobile - scenarios where work is less sustained and power
9+
usage is more important - we want to operate differently, with a "rush to
10+
idle" so the system can go to sleep. We don't want to be dribbling out
11+
background work while the system should be idle.
12+
13+
The complicating factor is that there are a number of background tasks, which
14+
form a heirarchy (or a digraph, depending on how you divide it up) - one
15+
background task may generate work for another.
16+
17+
Thus proper idle detection needs to model this heirarchy.
18+
19+
- Foreground writes
20+
- Page cache writeback
21+
- Copygc, rebalance
22+
- Journal reclaim
23+
24+
When we implement idle detection and rush to idle, we need to be careful not
25+
to disturb too much the existing behaviour that works reasonably well when the
26+
system is under sustained load (or perhaps improve it in the case of
27+
rebalance, which currently does not actively attempt to let work batch up).
28+
29+
SUSTAINED LOAD REGIME
30+
---------------------
31+
32+
When the system is under continuous load, we want these jobs to run
33+
continuously - this is perhaps best modelled with a P/D controller, where
34+
they'll be trying to keep a target value (i.e. fragmented disk space,
35+
available journal space) roughly in the middle of some range.
36+
37+
The goal under sustained load is to balance our ability to handle load spikes
38+
without running out of x resource (free disk space, free space in the
39+
journal), while also letting some work accumululate to be batched (or become
40+
unnecessary).
41+
42+
For example, we don't want to run copygc too aggressively, because then it
43+
will be evacuating buckets that would have become empty (been overwritten or
44+
deleted) anyways, and we don't want to wait until we're almost out of free
45+
space because then the system will behave unpredicably - suddenly we're doing
46+
a lot more work to service each write and the system becomes much slower.
47+
48+
IDLE REGIME
49+
-----------
50+
51+
When the system becomes idle, we should start flushing our pending work
52+
quicker so the system can go to sleep.
53+
54+
Note that the definition of "idle" depends on where in the heirarchy a task
55+
is - a task should start flushing work more quickly when the task above it has
56+
stopped generating new work.
57+
58+
e.g. rebalance should start flushing more quickly when page cache writeback is
59+
idle, and journal reclaim should only start flushing more quickly when both
60+
copygc and rebalance are idle.
61+
62+
It's important to let work accumulate when more work is still incoming and we
63+
still have room, because flushing is always more efficient if we let it batch
64+
up. New writes may overwrite data before rebalance moves it, and tasks may be
65+
generating more updates for the btree nodes that journal reclaim needs to flush.
66+
67+
On idle, how much work we do at each interval should be proportional to the
68+
length of time we have been idle for. If we're idle only for a short duration,
69+
we shouldn't flush everything right away; the system might wake up and start
70+
generating new work soon, and flushing immediately might end up doing a lot of
71+
work that would have been unnecessary if we'd allowed things to batch more.
72+
73+
To summarize, we will need:
74+
75+
- A list of classes for background tasks that generate work, which will
76+
include one "foreground" class.
77+
- Tracking for each class - "Am I doing work, or have I gone to sleep?"
78+
- And each class should check the class above it when deciding how much work to issue.

Documentation/filesystems/bcachefs/index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,10 @@ At this moment, only a few of these are described here.
2929

3030
casefolding
3131
errorcodes
32+
33+
Future design
34+
-------------
35+
.. toctree::
36+
:maxdepth: 1
37+
38+
future/idle_work

fs/bcachefs/Kconfig

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,14 @@ config BCACHEFS_PATH_TRACEPOINTS
103103
Enable extra tracepoints for debugging btree_path operations; we don't
104104
normally want these enabled because they happen at very high rates.
105105

106+
config BCACHEFS_TRANS_KMALLOC_TRACE
107+
bool "Trace bch2_trans_kmalloc() calls"
108+
depends on BCACHEFS_FS
109+
110+
config BCACHEFS_ASYNC_OBJECT_LISTS
111+
bool "Keep async objects on fast_lists for debugfs visibility"
112+
depends on BCACHEFS_FS && DEBUG_FS
113+
106114
config MEAN_AND_VARIANCE_UNIT_TEST
107115
tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
108116
depends on KUNIT

fs/bcachefs/Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,13 @@ bcachefs-y := \
3535
disk_accounting.o \
3636
disk_groups.o \
3737
ec.o \
38+
enumerated_ref.o \
3839
errcode.o \
3940
error.o \
4041
extents.o \
4142
extent_update.o \
4243
eytzinger.o \
44+
fast_list.o \
4345
fs.o \
4446
fs-ioctl.o \
4547
fs-io.o \
@@ -97,6 +99,8 @@ bcachefs-y := \
9799
varint.o \
98100
xattr.o
99101

102+
bcachefs-$(CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS) += async_objs.o
103+
100104
obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST) += mean_and_variance_test.o
101105

102106
# Silence "note: xyz changed in GCC X.X" messages

0 commit comments

Comments
 (0)