Skip to content

Commit bfe1900

Browse files
authored
Merge pull request #60 from healeycodes/post/dumac
Add dumac post
2 parents a9ad292 + 79ef598 commit bfe1900

File tree

6 files changed

+380
-0
lines changed

6 files changed

+380
-0
lines changed

data/projects.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,12 @@ export default [
3232
"A chess engine with alpha-beta pruning, piece-square tables, and move ordering.",
3333
to: "/building-my-own-chess-engine",
3434
},
35+
{
36+
name: "dumac",
37+
link: "https://github.com/healeycodes/dumac",
38+
desc: "Very fast alternative to 'du -sh' for macOS.",
39+
to: "/maybe-the-fastest-disk-usage-program-on-macos",
40+
},
3541
{
3642
name: "queuedle",
3743
link: "https://queuedle.com",
Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
---
2+
title: "Maybe the Fastest Disk Usage Program on macOS"
3+
date: "2025-07-31"
4+
tags: ["rust"]
5+
description: "A very fast du -sh clone for macOS."
6+
---
7+
8+
I set out to write the fastest `du -sh` clone on macOS and I think I've done it. On a large benchmark, [dumac](https://github.com/healeycodes/dumac) is 6.4x faster than `du` and 2.58x faster than [diskus](https://github.com/sharkdp/diskus) with a warm disk cache.
9+
10+
The odds were certainly in my favor as diskus does not use macOS-specific syscalls and instead uses standard POSIX APIs. As I'll go on to explain, I used [tokio tasks](https://docs.rs/tokio/latest/tokio/task/) and [getattrlistbulk](https://man.freebsd.org/cgi/man.cgi?query=getattrlistbulk&sektion=2&manpath=macOS+13.6.5) to be faster than the current crop of `du -sh` clones that run on macOS.
11+
12+
## The Challenge
13+
14+
My benchmark is a directory with 12 levels, 100 small files per level, with a branching factor of two — 4095 directories, 409500 files.
15+
16+
`du -sh` (disk usage, summarize, human-readable) works out the total size of a directory by traversing all the files and subdirectories. It must list every file and retrieve each file's size (in blocks) to sum the total.
17+
18+
On Unix-like systems a directory listing, like `readdir`, or the `fts` family of traversal functions, only provide filenames and inode numbers. It doesn't provide file sizes. So `du` needs to call `lstat` on every single file and hardlink it comes across.
19+
20+
In my benchmark this means making 4k+ syscalls for the directories and 400k+ syscalls for the files.
21+
22+
The traditional `du` (from GNU coreutils or BSD) is typically single-threaded and processes one file at a time. It doesn't use multiple CPU cores, or overlapping I/O operations, meaning the work is handled sequentially.
23+
24+
On my Apple M1 Pro, the CPU is not saturated and the majority of the time is spent waiting on each sequential syscall.
25+
26+
```bash
27+
time du -sh ./deep
28+
1.6G ./deep
29+
du -sh ./deep 0.04s user 1.08s system 43% cpu 2.570 total
30+
```
31+
32+
The performance of disk usage programs depends on the filesystem and the underlying hardware. Benchmarks for these projects are usually done on Linux with a cold or warm disk cache. For the cold runs, the disk cache is cleared between each run.
33+
34+
I couldn't find a reliable way of clearing the disk cache on macOS. However, on macOS with modern Apple hardware, I found that the performance of disk usage programs with a warm disk cache strongly correlates with cold disk cache performance. So to make my life easier, warm disk cache results are the only thing I'm measuring and comparing.
35+
36+
## Concurrency
37+
38+
Previously, when I wrote about [beating the performance of grep with Go](https://healeycodes.com/beating-grep-with-go), I found that just adding goroutines was enough to outperform the stock `grep` that comes with macOS.
39+
40+
My first attempt to write a faster `du -sh` in Go did not go so well. I expected that my quick prototype, focused on the core problem of traversing and summing the block size, would be faster.
41+
42+
```go
43+
var sem = make(chan struct{}, 16)
44+
45+
func handleDir(rootDir string, ch chan int64) {
46+
size := int64(0)
47+
48+
// open()
49+
dir, err := os.Open(rootDir)
50+
if err != nil {
51+
panic(err)
52+
}
53+
defer dir.Close()
54+
55+
// readdir()
56+
files, err := dir.Readdir(0)
57+
if err != nil {
58+
panic(err)
59+
}
60+
61+
for _, file := range files {
62+
sem <- struct{}{}
63+
if file.IsDir() {
64+
childCh := make(chan int64)
65+
go handleDir(filepath.Join(rootDir, file.Name()), childCh)
66+
childSize := <-childCh
67+
size += childSize
68+
} else {
69+
70+
// stat()
71+
size += file.Sys().(*syscall.Stat_t).Blocks * 512
72+
}
73+
<-sem
74+
}
75+
ch <- size
76+
}
77+
```
78+
79+
It was slower than `du -sh`. It took twice as long on my benchmark.
80+
81+
```go
82+
./goroutines temp/deep 0.30s user 3.12s system 68% cpu 4.987 total
83+
```
84+
85+
I ran a System Trace with macOS's Instruments to see what it was doing.
86+
87+
![A list of system calls made by ./goroutines.](syscalls.png)
88+
89+
The high amount of `lstat64` calls is expected. That's fetching the attributes of each file. The `open` and `close` calls are also expected, roughly one per directory.
90+
91+
`getdirentries64` calls is twice the number of directories in my benchmark. This is because it's designed to read directory entries into a buffer until it gets an empty result. This is the syscall that Go's `Readdir` uses under the hood on macOS.
92+
93+
The other syscalls here are related to the scheduling of goroutines and channels. I tried a few different designs (and different sized semaphores on the I/O) but it didn't affect performance that much.
94+
95+
I ran a CPU profile with pprof and saw that the majority of the time was spent doing the syscalls I saw above.
96+
97+
![A Go CPU profile.](go-profile.png)
98+
99+
My understanding at this point was that there is an inherent system resource cost to getting this information out, with some bandwidth/contention limitations, and some per-syscall Go overhead too.
100+
101+
I went looking for a more efficient method of getting this information out of the kernel without making a syscall for each file.
102+
103+
## getattrlistbulk
104+
105+
macOS has a syscall called [getattrlistbulk(2)](https://man.freebsd.org/cgi/man.cgi?query=getattrlistbulk&sektion=2&manpath=macOS+13.6.5) which allows you to read multiple directory entries and their metadata in one go. It's like a combined "readdir + stat" that returns a batch of file names along with requested attributes like file type, size, etc.
106+
107+
Instead of calling `stat` for every file, one `getattrlistbulk` call can return dozens or hundreds of entries at once. This means far fewer syscalls for my benchmark!
108+
109+
First you open a directory (one `open` call) and then make a `getattrlistbulk` call which will retrieve an amount of entries that fits in the passed buffer (128KB was optimal in my testing), along with the attributes you need to do the same work as `du -sh` like name, inode, type, and size. You loop and call it again until the directory is fully read (it returns `0` when done).
110+
111+
I found some background on this syscall in a [mailing list](https://lists.apple.com/archives/filesystem-dev/2014/Dec/msg00004.html):
112+
113+
> Also note that as of Yosemite, we have added a new API: getattrlistbulk(2), which is like getdirentriesattr(), but supported in VFS for all filesystems. getdirentriesattr() is now deprecated.
114+
115+
The main advantage of the bulk call is that we can return results in most cases without having to create a vnode in-kernel, which saves on I/O: HFS+ on-disk layout is such that all of the directory
116+
entries in a given directory are clustered together and we can get multiple directory entries from the same cached on-disk blocks.
117+
>
118+
119+
My first attempt at wiring up this syscall used CGO. I wrote a C function that took a directory file descriptor and called `getattrlistbulk` in a loop until it had all the file info, and then returned the list of files and their attributes to Go.
120+
121+
```c
122+
// File info structure to track size and inode
123+
typedef struct {
124+
long blocks;
125+
uint64_t inode;
126+
} file_info_t;
127+
128+
// Directory information and subdirectory names
129+
typedef struct {
130+
file_info_t *files;
131+
int file_count;
132+
char **subdirs;
133+
int subdir_count;
134+
} dir_info_t;
135+
136+
// Get directory info (called from Go)
137+
dir_info_t* get_dir_info(int dirfd) {
138+
struct attrlist attrList = {0};
139+
140+
// Describe what we want back
141+
attrList.bitmapcount = ATTR_BIT_MAP_COUNT;
142+
attrList.commonattr = ATTR_CMN_RETURNED_ATTRS | ATTR_CMN_NAME | ATTR_CMN_ERROR | ATTR_CMN_OBJTYPE | ATTR_CMN_FILEID;
143+
attrList.fileattr = ATTR_FILE_ALLOCSIZE;
144+
145+
// Set buffer size (affects number of calls required)
146+
char attrBuf[128 * 1024];
147+
148+
int file_capacity = INITIAL_FILE_CAPACITY;
149+
file_info_t *files = (file_info_t *)malloc(file_capacity * sizeof(file_info_t));
150+
int file_count = 0;
151+
152+
int subdir_capacity = INITIAL_SUBDIR_CAPACITY;
153+
char **subdirs = (char **)malloc(subdir_capacity * sizeof(char*));
154+
int subdir_count = 0;
155+
156+
for (;;) {
157+
int retcount = getattrlistbulk(dirfd, &attrList, attrBuf, sizeof(attrBuf), 0);
158+
if (retcount == 0) {
159+
break;
160+
}
161+
162+
char *entry = attrBuf;
163+
for (int i = 0; i < retcount; i++) {
164+
// .. parsing code
165+
```
166+
167+
The Go-side of this looks quite similar to my initial prototype. The C types are automatically generated on the Go-side in my editor and during `go build`. To turn the C pointers into a Go slice or struct, you use `unsafe` functions.
168+
169+
You also need to then free the C memory by calling into CGO again.
170+
171+
```go
172+
package main
173+
174+
// #cgo CFLAGS: -O3 -march=native -Wall -flto
175+
// #include "lib.h"
176+
import "C"
177+
178+
// ..
179+
180+
func handleDir(rootDir string) int64 {
181+
dir, err := os.Open(rootDir)
182+
// ..
183+
184+
info := C.get_dir_info(C.int(dir.Fd()))
185+
186+
// Free the C objects' memory
187+
defer C.free_dir_info(info)
188+
189+
size := int64(0)
190+
191+
// Process files
192+
if info.file_count > 0 {
193+
files := (*[1 << 30]C.file_info_t)(unsafe.Pointer(info.files))[:info.file_count:info.file_count]
194+
195+
for _, file := range files {
196+
size += int64(file.blocks) * 512
197+
}
198+
}
199+
200+
// Process subdirectories recursively
201+
if info.subdir_count > 0 {
202+
var wg sync.WaitGroup
203+
var totalSize int64
204+
subdirs := (*[1 << 30]*C.char)(unsafe.Pointer(info.subdirs))[:info.subdir_count:info.subdir_count]
205+
206+
for i, subdir := range subdirs {
207+
wg.Add(1)
208+
go func(index int) {
209+
defer wg.Done()
210+
childSize := handleDir(filepath.Join(rootDir, C.GoString(subdir)))
211+
atomic.AddInt64(&totalSize, childSize)
212+
}(i)
213+
}
214+
wg.Wait()
215+
216+
size += atomic.LoadInt64(&totalSize)
217+
}
218+
219+
return size
220+
}
221+
```
222+
223+
The results were great. My Go program was now ~3x faster than `du -sh` for my benchmark.
224+
225+
```text
226+
./cgo temp/deep 0.07s user 3.70s system 443% cpu 0.850 total
227+
```
228+
229+
I ran another CPU profile with pprof and saw that the majority of the time was spent in CGO, running my fairly optimized C code.
230+
231+
![A Go CPU profile.](cgo-profile.png)
232+
233+
This was where I wanted to spend CPU time: setting up the `getattrlistbulk` call, making it, and parsing the result. I optimized the buffer size (KB) and also tuned some of the memory allocations but then I hit a dead end. I couldn't get any faster without changing the overall design.
234+
235+
I had spent a few hours to get this far and now I wanted to go all the way. I knew that using CGO was suboptimal here because of the [overhead of calling into C from Go](https://groups.google.com/g/golang-dev/c/XSkrp1_FdiU). I've seen some sources suggest that the cost is 40ns for trivial calls. Along with some extra allocations on the Go side that I was having trouble getting rid of, I suspected the per-directory overhead to be higher.
236+
237+
I wanted to go all the way and see how fast I could push the performance here. So I ported my program to Rust.
238+
239+
## Rust
240+
241+
By using Rust, I avoid a context switch between runtimes. I'm making the same C function call underneath, `libc::getattrlistbulk`, but it's a zero-cost abstraction. My program calls into the kernel many times from Rust, but without bouncing back and forth between Go and C on each directory.
242+
243+
I also needed a new concurrency primitive. An alternative to Go's goroutines. I picked [tokio tasks](https://docs.rs/tokio/latest/tokio/task/) as I understand them to be fairly analogous — light-weight, multiplexed green threads, and cheap to spawn.
244+
245+
```rust
246+
// Calculate total size recursively
247+
fn calculate_size(root_dir: String) -> Pin<Box<dyn Future<Output = Result<i64, String>> + Send>> {
248+
Box::pin(async move {
249+
250+
// Get directory contents
251+
let dir_info = tokio::task::spawn_blocking({
252+
let root_dir = root_dir.clone();
253+
254+
// Make the libc::getattrlistbulk call
255+
move || get_dir_info(&root_dir)
256+
}).await.map_err(|_| "task join error".to_string())??;
257+
258+
let mut total_size = 0i64;
259+
260+
// Process files in this directory, deduplicating by inode
261+
for file in &dir_info.files {
262+
total_size += check_and_add_inode(file.inode, file.blocks);
263+
}
264+
265+
// Process subdirectories concurrently with limiting
266+
if !dir_info.subdirs.is_empty() {
267+
let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT));
268+
269+
let futures: Vec<_> = dir_info.subdirs.into_iter()
270+
.map(|subdir| {
271+
let semaphore = semaphore.clone();
272+
let subdir_path = Path::new(&root_dir).join(&subdir).to_string_lossy().to_string();
273+
tokio::spawn(async move {
274+
let _permit = semaphore.acquire().await.unwrap();
275+
calculate_size(subdir_path).await
276+
})
277+
})
278+
.collect();
279+
280+
// Collect all results
281+
for future in futures {
282+
match future.await {
283+
Ok(Ok(size)) => total_size += size,
284+
// ..
285+
}
286+
}
287+
}
288+
289+
Ok(total_size)
290+
})
291+
}
292+
```
293+
294+
Similar to my original Go prototype, it's a recursive traversal. I experimented with some concurrency limiting to avoid thrashing system resources and the overhead of context-switching and contention of threads. It didn't yield much performance improvements. Maybe 10% or so. I settled with a semaphore of 64.
295+
296+
`du -sh` deduplicates hardlinks so, while working concurrently, every time we sum the block size of a file, we need to lock and check whether we've seen the inode before.
297+
298+
I used a sharded inode set for this to lower the lock contention overhead.
299+
300+
```rust
301+
// Sharded inode tracking
302+
const SHARD_COUNT: usize = 128;
303+
304+
// Global sharded inode set for hardlink deduplication
305+
static SEEN_INODES: LazyLock<[Mutex<HashSet<u64>>; SHARD_COUNT]> = LazyLock::new(|| {
306+
std::array::from_fn(|_| Mutex::new(HashSet::new()))
307+
});
308+
```
309+
310+
In my final benchmark results, I've included diskus — a delightfully simple (and fast!) `du -sh` clone. It doesn't use macOS native APIs so it's a bit of an unfair comparison but it is the fastest `du -sh` clone I could find aside from my final Rust program which I've called dumac.
311+
312+
```text
313+
hyperfine --warmup 3 --min-runs 3 'du -sh temp/deep' 'diskus temp/deep' './goroutines temp/deep' './cgo temp/deep' './target/release/dumac temp/deep'
314+
Benchmark 1: du -sh temp/deep
315+
Time (mean ± σ): 3.330 s ± 0.220 s [User: 0.040 s, System: 1.339 s]
316+
Range (min … max): 3.115 s … 3.554 s 3 runs
317+
318+
Benchmark 2: diskus temp/deep
319+
Time (mean ± σ): 1.342 s ± 0.068 s [User: 0.438 s, System: 7.728 s]
320+
Range (min … max): 1.272 s … 1.408 s 3 runs
321+
322+
Benchmark 3: ./goroutines temp/deep
323+
Time (mean ± σ): 6.810 s ± 0.010 s [User: 0.290 s, System: 3.380 s]
324+
Range (min … max): 6.799 s … 6.816 s 3 runs
325+
326+
Benchmark 4: ./cgo temp/deep
327+
Time (mean ± σ): 564.6 ms ± 19.5 ms [User: 51.1 ms, System: 2634.2 ms]
328+
Range (min … max): 542.6 ms … 591.0 ms 5 runs
329+
330+
Benchmark 5: ./target/release/dumac temp/deep
331+
Time (mean ± σ): 521.0 ms ± 24.1 ms [User: 114.4 ms, System: 2424.5 ms]
332+
Range (min … max): 493.2 ms … 560.6 ms 6 runs
333+
334+
Summary
335+
./target/release/dumac temp/deep ran
336+
1.08 ± 0.06 times faster than ./cgo temp/deep
337+
2.58 ± 0.18 times faster than diskus temp/deep
338+
6.39 ± 0.52 times faster than du -sh temp/deep
339+
13.07 ± 0.61 times faster than ./goroutines temp/deep
340+
```
341+
342+
I was able to remove ~43ms by moving from Go/CGO to Rust.
343+
344+
If my program is a success, it's because I've tried to reduce the cost of all of the work that is not optimal syscalls.
345+
346+
The time spent on syscalls is ~91% of the time in the below flamegraph. Locking for the inode deduplication is ~1.5% and the remaining time is the scheduling overhead of tokio.
347+
348+
![The result of cargo flamegraph](flamegraph.png)
349+
350+
## Further Reading
351+
352+
I have [a version](https://github.com/healeycodes/dumac/blob/main/previousattamps/gosyscall/main.go) where I called `getattrlistbulk` from within Go like:
353+
354+
```go
355+
syscall.RawSyscall6(
356+
SYS_GETATTRLISTBULK,
357+
uintptr(fd),
358+
uintptr(unsafe.Pointer(&attrList)),
359+
uintptr(unsafe.Pointer(&attrBuf[0])),
360+
uintptr(len(attrBuf)),
361+
0, // options
362+
0, // unused
363+
)
364+
```
365+
366+
But it was actually slower than using CGO.
367+
368+
The [dumac repository](https://github.com/healeycodes/dumac) contains the source code for [my attempts](https://github.com/healeycodes/dumac/tree/main/previousattamps) as I iterated towards my final Rust program.
369+
370+
I was inspired by the work behind [dut](https://codeberg.org/201984/dut) ([Show HN: Dut – a fast Linux disk usage calculator](https://codeberg.org/201984/dut)) which is sadly Linux-only. I'm fairly confident that some of the ideas behind it could be ported to macOS native APIs and supersede the performance of my program. Especially when it comes to reducing the overhead of tokio's scheduler.
371+
372+
There are also some fast disk usage programs that show useful, and in some cases interactive, terminal output like [dust](https://github.com/bootandy/dust) and [dua](https://github.com/Byron/dua-cli).
373+
374+
A fairly up-to-date blog post on the performance of reading directories on macOS is [Performance considerations when reading directories on macOS](https://blog.tempel.org/2019/04/dir-read-performance.html).
220 KB
Loading
173 KB
Loading
278 KB
Loading
153 KB
Loading

0 commit comments

Comments
 (0)