Skip to content

Commit 656a467

Browse files
authored
Merge pull request #61 from healeycodes/optimizing-dumac
add optimizing dumac post
2 parents ec0c16b + 260570a commit 656a467

File tree

3 files changed

+284
-0
lines changed

3 files changed

+284
-0
lines changed
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
---
2+
title: "Optimizing My Disk Usage Program"
3+
date: "2025-08-07"
4+
tags: ["rust"]
5+
description: "Increasing performance by reducing thread scheduling overhead and lock contention."
6+
---
7+
8+
In my previous post, [Maybe the Fastest Disk Usage Program on macOS](https://healeycodes.com/maybe-the-fastest-disk-usage-program-on-macos), I wrote about [dumac](https://github.com/healeycodes/dumac). A very fast alternative to `du -sh` that uses a macOS-specific syscall [getattrlistbulk](https://man.freebsd.org/cgi/man.cgi?query=getattrlistbulk&sektion=2&manpath=macOS+13.6.5) to be much faster than the next leading disk usage program.
9+
10+
I received some great technical feedback in the [Lobsters thread](https://lobste.rs/s/ddphh5/maybe_fastest_disk_usage_program_on_macos). After implementing some of the suggestions, I was able to increase performance by ~28% on my [large benchmark](https://github.com/healeycodes/dumac/blob/a2901c6867f194be73f92486826f33d3cf7658cb/setup_benchmark.py#L90).
11+
12+
```text
13+
hyperfine --warmup 3 --min-runs 5 './before temp/deep' './after temp/deep'
14+
Benchmark 1: ./before temp/deep
15+
Time (mean ± σ): 910.4 ms ± 10.1 ms [User: 133.4 ms, System: 3888.5 ms]
16+
Range (min … max): 894.5 ms … 920.0 ms 5 runs
17+
18+
Benchmark 2: ./after temp/deep
19+
Time (mean ± σ): 711.9 ms ± 10.5 ms [User: 73.9 ms, System: 2705.6 ms]
20+
Range (min … max): 700.1 ms … 725.0 ms 5 runs
21+
22+
Summary
23+
./after temp/deep ran
24+
1.28 ± 0.02 times faster than ./before temp/deep
25+
```
26+
27+
The main performance gain came from reducing thread scheduling overhead, while minor gains were from optimizing access to the inode hash-set shards.
28+
29+
## Better Parallelism
30+
31+
The previous version of `dumac` used [Tokio](https://crates.io/crates/tokio) to spawn a task for each directory.
32+
33+
```rust
34+
// Process subdirectories concurrently
35+
if !dir_info.subdirs.is_empty() {
36+
let futures: Vec<_> = dir_info.subdirs.into_iter()
37+
.map(|subdir| {
38+
let subdir_path = Path::new(&root_dir)
39+
.join(&subdir)
40+
.to_string_lossy()
41+
.to_string();
42+
tokio::spawn(async move {
43+
calculate_size(subdir_path).await
44+
})
45+
})
46+
.collect();
47+
48+
// Collect all results
49+
for future in futures {
50+
match future.await {
51+
Ok(Ok(size)) => total_size += size,
52+
// ..
53+
}
54+
}
55+
}
56+
```
57+
58+
And then in the middle of this task, calling `getattrlistbulk` required a blocking call:
59+
60+
```rust
61+
// In the middle of the async task
62+
// ..
63+
64+
let dir_info = tokio::task::spawn_blocking({
65+
let root_dir = root_dir.clone();
66+
move || get_dir_info(&root_dir) // Calls getattrlistbulk
67+
}).await.map_err(|_| "task join error".to_string())??;
68+
```
69+
70+
Tokio runs many tasks on a few threads by swapping the running task at each `.await`. Blocking threads are spawned on demand to avoid blocking the core threads which are handling tasks. I learned this _after_ shipping the first version when I read [CPU-bound tasks and blocking code](https://docs.rs/tokio/latest/tokio/index.html#cpu-bound-tasks-and-blocking-code).
71+
72+
Spawning a new thread for each syscall is unnecessary overhead. Additionally, there aren't many opportunities to use non-blocking I/O since `getattrlistbulk` is blocking. Opening the file descriptor on the directory could be made async with something like [AsyncFd](https://docs.rs/tokio/latest/tokio/io/unix/struct.AsyncFd.html) but it's very quick and isn't the bottleneck.
73+
74+
To reiterate the problem: I'm using Tokio as a thread pool even though I'm not directly benefiting from the task scheduling overhead, and worse, I'm creating lots of new threads (one per directory).
75+
76+
YogurtGuy saw this issue and suggested using [Rayon](https://crates.io/crates/rayon) instead of Tokio:
77+
78+
> As it stands, each thread must operate on a semaphore resource to limit concurrency. If this was limited directly via the number of threads, each thread could perform operations without trampling on each other. [] In addition, each call to `spawn_blocking` appears to involve inter-thread communication. Work-stealing would allow each thread to create and consume work without communication.
79+
80+
By using Rayon, I can re-use threads from a thread pool, and avoid creating a new thread for each `getattrlistbulk` call. Using a work-stealing design, I can recursively call `calculate_size`:
81+
82+
```rust
83+
// Calculate total size recursively using rayon work-stealing
84+
pub fn calculate_size(root_dir: String) -> Result<i64, String> {
85+
// ..
86+
87+
// Process subdirectories in parallel
88+
let subdir_size = if !dir_info.subdirs.is_empty() {
89+
dir_info
90+
.subdirs
91+
92+
// Parallel iterator.
93+
// The recursive calculate_size calls are scheduled with
94+
// very little overhead!
95+
.into_par_iter()
96+
.map(|subdir| {
97+
let subdir_path = Path::new(&root_dir)
98+
.join(&subdir)
99+
.to_string_lossy()
100+
.to_string();
101+
calculate_size(subdir_path)
102+
})
103+
.map(|result| match result {
104+
Ok(size) => size,
105+
Err(e) => {
106+
eprintln!("dumac: {}", e);
107+
0
108+
}
109+
})
110+
.sum()
111+
} else {
112+
0
113+
};
114+
115+
// ..
116+
```
117+
118+
I benchmarked and profiled these two approaches to see what changes I could observe. Tokio tasks with many blocking calls, and Rayon work-stealing.
119+
120+
Using macOS's Instruments, I checked that `dumac` was now using a fixed amount of threads:
121+
122+
![Comparing the system call trace view by thread](threads.png)
123+
124+
Additionally, the number of syscalls has been halved. Although the *count* of syscalls is not a perfect proxy for performance, in this case, it suggests we've achieved the simplicity we're after.
125+
126+
The macOS syscalls related to Tokio's thread scheduling are greatly reduced. Also, not pictured, the number of context switches was reduced by ~80% (1.2M -> 235k).
127+
128+
![Comparing the system call trace view by syscall](syscalls.png)
129+
130+
And finally, the most important result is the large benchmark.
131+
132+
```text
133+
hyperfine --warmup 3 --min-runs 5 './before temp/deep' './after temp/deep'
134+
Benchmark 1: ./before temp/deep
135+
Time (mean ± σ): 901.3 ms ± 47.1 ms [User: 125.5 ms, System: 3394.6 ms]
136+
Range (min … max): 821.6 ms … 942.8 ms 5 runs
137+
138+
Benchmark 2: ./after temp/deep
139+
Time (mean ± σ): 731.6 ms ± 20.6 ms [User: 76.5 ms, System: 2681.7 ms]
140+
Range (min … max): 717.4 ms … 767.1 ms 5 runs
141+
142+
Summary
143+
./after temp/deep ran
144+
1.23 ± 0.07 times faster than ./before temp/deep
145+
```
146+
147+
Why is it faster? Because we're creating and managing fewer threads, and we're waiting on less syscalls that are unrelated to the core work we're doing.
148+
149+
## Reducing Inode Lock Contention
150+
151+
YogurtGuy had [another point](https://lobste.rs/s/ddphh5/maybe_fastest_disk_usage_program_on_macos#c_fkix5i) on how `dumac` deduplicates inodes. In order to accurately report disk usage, hard links are deduplicated by their underlying inode. This means that, while our highly concurrent program is running, we need to read and write a data structure from all running threads.
152+
153+
I chose a sharded hash-set to reduce lock contention. Rather than a single hash-set with a single mutex, there are `128` hash-sets with `128` mutexes. The inode (a `u64`) is moduloed by `128` to find the hash-set that needs to be locked and accessed:
154+
155+
```rust
156+
// Global sharded inode set for hardlink deduplication
157+
static SEEN_INODES: LazyLock<[Mutex<HashSet<u64>>; SHARD_COUNT]> =
158+
LazyLock::new(|| std::array::from_fn(|_| Mutex::new(HashSet::new())));
159+
160+
fn shard_for_inode(inode: u64) -> usize {
161+
(inode % SHARD_COUNT as u64) as usize
162+
}
163+
164+
// Returns the blocks to add (blocks if newly seen, 0 if already seen)
165+
fn check_and_add_inode(inode: u64, blocks: i64) -> i64 {
166+
let shard_idx = shard_for_inode(inode);
167+
let shard = &SEEN_INODES[shard_idx];
168+
169+
let mut seen = shard.lock();
170+
if seen.insert(inode) {
171+
blocks // Inode was newly added, count the blocks
172+
} else {
173+
0 // Inode already seen, don't count
174+
}
175+
}
176+
```
177+
178+
This is the correct solution if inodes are distributed randomly across the `u64` space but, as [YogurtGuy points out](https://lobste.rs/s/ddphh5/maybe_fastest_disk_usage_program_on_macos#c_fkix5i), they are not:
179+
180+
> I like the sharded hash-set approach, but I think the implementation may have some inefficiencies. First, the [shard is chosen by `inode % SHARD_COUNT`](https://github.com/healeycodes/dumac/blob/152dad272ae3e1c73ecaead23341fb32392729ee/src/main.rs#L42). Inodes tend to be sequential, so while this choice of shard will distribute requests across multiple hash sets, it will also increase contention, since a single directory will have its inodes distributed across every hash set. I wonder if `(inode >> K) % SHARD_COUNT` might therefore lead to better performance for some values of `K`. Especially if each thread batched its requests to each hash set.
181+
182+
I looked at the data, and saw that they were right.
183+
184+
```rust
185+
// Inside calculate_size
186+
// ..
187+
188+
all_inodes.sort();
189+
println!("dir_info: {:?}, all_inodes: {:?}", dir_info, all_inodes);
190+
191+
// They're sequential!
192+
// subdirs: [] }, all_inodes: [50075095, 50075096, 50075097, 50075098, 50075099, 50075100, 50075101, 50075102, 50075103, 50075104, 50075105, 50075106, 50075107, 50075108, 50075109, 50075110, 50075111, 50075112, 50075113, 50075114, 50075115, 50075116, 50075117, 50075118, 50075119, 50075120, 50075121, 50075122, 50075123, 50075124, 50075125, 50075126, 50075127, 50075128, 50075129, 50075130, 50075131, 50075132, 50075133, 50075134, 50075135, 50075136, 50075137, 50075138, 50075139, 50075140, 50075141, 50075142, 50075143, 50075144, 50075145, 50075146, 50075147, 50075148, 50075149, 50075150, 50075151, 50075152, 50075153, 50075154, 50075155, 50075156, 50075157, 50075158, 50075159, 50075160, 50075161, 50075162, 50075163, 50075164, 50075165, 50075166, 50075167, 50075168, 50075169, 50075170, 50075171, 50075172, 50075173, 50075174, 50075175, 50075176, 50075177, 50075178, 50075179, 50075180, 50075181, 50075182, 50075183, 50075184, 50075185, 50075186, 50075187, 50075188, 50075189, 50075190, 50075191, 50075192, 50075193, 50075194]
193+
```
194+
195+
Each directory I spot-checked seemed to have sequential inodes. Inodes are created at the filesystem level (not at the directory level) and since my benchmark files were created in quick succession each directory's files are roughly sequential.
196+
197+
This is true for many real-life cases. The contents of a directory are often written at the same time (e.g. `npm i`).
198+
199+
The reason this is a problem is that when we modulo by 128, we don't necessarily reduce the chance of lock collisions. Recall that we're trying to take our inodes and shard them across many hash-sets. This is what that looks like:
200+
201+
```text
202+
# dir1 handled by thread1
203+
204+
50075095 % 128 = 87
205+
50075096 % 128 = 88
206+
50075097 % 128 = 89
207+
50075098 % 128 = 90
208+
50075099 % 128 = 91
209+
```
210+
211+
But if there's a separate directory with inodes starting at an entirely different point, like, say `65081175`, they can modulo to the same hash-set:
212+
213+
```text
214+
# dir2 handled by thread2
215+
216+
65081175 % 128 = 87 # same values as above
217+
65081176 % 128 = 88
218+
65081177 % 128 = 89
219+
65081178 % 128 = 90
220+
65081179 % 128 = 91
221+
```
222+
223+
In the worst case, if thread1 and thread2 run at the same time, they could fight for the lock on each of the entries they are handling! You can imagine how many threads iterating over directories could hit this case.
224+
225+
I tested it while running my benchmark:
226+
227+
```rust
228+
if shard.is_locked() {
229+
println!("shard is locked");
230+
}
231+
```
232+
233+
And found the average like so:
234+
235+
```bash
236+
avg=$(sum=0; for i in {1..15}; do c=$(./target/release/dumac temp/deep | grep -c "shard is locked"); sum=$((sum + c)); done; echo "scale=2; $sum / 15" | bc); echo "Average: $avg"
237+
# Average: 176.66
238+
```
239+
240+
Since directories are handled one at a time (for each thread) it would be ideal if the shard was per directory — but this isn't possible. We can get close by removing some of the least significant bits, say, `8`.
241+
242+
```text
243+
# dir1 handled by thread1
244+
245+
50075095 >> 8 = 195605 # this block of 256 entries share a key
246+
50075096 >> 8 = 195605
247+
50075097 >> 8 = 195605
248+
50075098 >> 8 = 195605
249+
50075099 >> 8 = 195605
250+
251+
# dir2 handled by thread2
252+
65081175 >> 8 = 254223 # this separate block of 256 entries share a different key
253+
65081176 >> 8 = 254223
254+
65081177 >> 8 = 254223
255+
65081178 >> 8 = 254223
256+
65081179 >> 8 = 254223
257+
```
258+
259+
I tested this idea using my benchmark:
260+
261+
```rust
262+
fn shard_for_inode(inode: u64) -> usize {
263+
((inode >> 8) % SHARD_COUNT as u64) as usize
264+
}
265+
266+
// avg=$(sum=0; for i in {1..15}; do c=$(./target/release/dumac temp/deep | grep -c "shard is locked"); sum=$((sum + c)); done; echo "scale=2; $sum / 15" | bc); echo "Average: $avg"
267+
// Average: 4.66
268+
```
269+
270+
The average lock collision dropped dramatically from `176.66` to `4.66`. I was surprised at this result. I didn't expect the decrease to be so great.
271+
272+
I also tested with some hash functions that avalanche sequential integers but the results were comparable to my original idea with just the modulo.
273+
274+
So, what is so special about `inode >> 8`?
275+
276+
[APFS](https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf) hands out IDs sequentially so the bottom bits toggle fastest, causing threads that are scanning directory trees in creation order to pile into the same shard when using `inode % 128`.
277+
278+
Shifting by `8` groups `256` consecutive inodes together. This improves temporal locality and cuts down inter-shard contention in our multithreaded crawl even though the statistical distribution stays perfectly flat across the whole disk.
279+
280+
The ideal number of inodes to group together depends on the number of inodes that were created at the same time per directory. Since that's not possible to work out ahead of time, I will pick `256` (my gut says that most directories have fewer than `256` files) and will keep shifting by `8` bits.
281+
282+
As a result, the reduced lock contention improved the performance of the large benchmark by ~5% or so.
283+
284+
I just pushed up these performance improvements to `dumac`'s [repository](https://github.com/healeycodes/dumac). Let me know if you have any technical feedback!
677 KB
Loading
664 KB
Loading

0 commit comments

Comments
 (0)