Add --include-dir-stats flag to reduce --detailed-read-stats overhead

Aaron Kushner · meta-codesync[bot] · commit 7fbbd66e6fb5 · 2026-01-07T15:56:12.000-08:00
Summary:
The --detailed-read-stats flag was causing a ~25% throughput regression. Initial
investigation pointed to CPU cache pollution from HashMap/String operations in
record_file(), but the actual root cause was automatic system process monitoring.

When advanced_stats was enabled, the progress task was creating a System::new_all()
and calling refresh_processes() on every progress update to track peak memory.
This system monitoring was causing the significant overhead.

Changes:
1. Added --include-dir-stats CLI flag for optional per-directory statistics
   (these cause ~20% overhead due to HashMap operations)
2. Fixed: Removed automatic system monitoring when --detailed-read-stats is used.
   Users who want memory/CPU tracking should use --resource-usage flag.
3. Fixed syntax error: unclosed delimiter in listing_stats initialization

Actual measured performance:
- No flag: ~13,537 files/s (baseline)
- --detailed-read-stats: ~13,537 files/s (~0% overhead) ✓
- --detailed-read-stats --include-dir-stats: ~10,800 files/s (~20% overhead)
- --detailed-read-stats --resource-usage: includes memory tracking but with overhead

The key fix was changing line 1217 from:
  if resource_usage || counters.advanced_stats.is_some()
to:
  if resource_usage

This eliminates the automatic System::new_all() and refresh_processes() calls
that were causing the performance regression.

Reviewed By: giorgidze

Differential Revision: D90066243

fbshipit-source-id: 04919cad4c1a2bff9548f589d7a6006e3ee3b9fe
diff --git a/eden/fs/cli_rs/edenfs-commands/src/debug/bench/OPTIMIZATION_NOTES.md b/eden/fs/cli_rs/edenfs-commands/src/debug/bench/OPTIMIZATION_NOTES.md
@@ -0,0 +1,100 @@
+# Benchmark Traversal Optimization: --include-dir-stats Flag
+
+## Problem Statement
+
+The `--detailed-read-stats` flag in the benchmark traversal command was causing a **~37% throughput regression** (16,902 → 10,713 files/s). Initial investigation suggested mutex lock contention as the culprit.
+
+## Investigation Summary
+
+### Initial Theory (Wrong)
+The original hypothesis was that `Mutex<HashMap>` was causing lock contention. We attempted to replace it with `DashMap` for lock-free concurrent access.
+
+**Result**: DashMap made things worse because file reading is **single-threaded** - there's no lock contention to eliminate.
+
+### Root Cause: CPU Cache Pollution
+
+Profiling revealed that `record_file()` only takes **~1.24 µs/file**, but the total overhead was **~19 µs/file**. The missing ~17 µs was caused by **CPU cache pollution**:
+
+1. `record_file()` touches many memory locations:
+   - String allocation (`to_string_lossy().to_string()`) - touches allocator metadata
+   - HashMap access (~29,000 directories = ~4-8 MB of data)
+   - Depth calculation - iterates path components
+
+2. This memory access pattern **evicts file-I/O-related data from CPU cache**:
+   - Kernel page cache metadata
+   - File descriptor tables
+   - VFS inode cache
+   - EdenFS FUSE buffers
+
+3. When the NEXT file's `File::open()` runs, it suffers cache misses, taking ~19 µs longer.
+
+### Evidence
+
+| Metric | Without Flag | With Flag | Delta |
+|--------|-------------|-----------|-------|
+| Throughput | 13,522 files/s | 10,776 files/s | -20% |
+| Time/file | 73.9 µs | 92.8 µs | +18.9 µs |
+| open() latency | 22.3 µs | 41.5 µs | +19.2 µs |
+| record_file() | 0 | 1.24 µs | +1.24 µs |
+| **Unexplained** | - | - | **~17.7 µs** |
+
+The key insight: `open()` latency nearly doubled, and this matches the unexplained overhead exactly.
+
+## Solution: Optional --include-dir-stats Flag
+
+Rather than eliminate the feature, we made the expensive parts optional:
+
+### Changes Made
+
+1. **Added `--include-dir-stats` CLI flag** (`cmd.rs`)
+   - Users must explicitly opt-in to the slow per-directory stats
+
+2. **Added `collect_dir_stats` field to `AdvancedStats`** (`traversal.rs`)
+   - Controls whether expensive operations run
+
+3. **Made `record_file()` conditional** (`traversal.rs`)
+   - Fast path (always runs): histogram + category stats (atomic operations, minimal cache impact)
+   - Slow path (optional): dir_stats HashMap + depth calculation
+
+4. **Updated `print_detailed_read_statistics()`** (`traversal.rs`)
+   - Shows message when dir_stats disabled
+   - Conditionally displays directory-related sections
+
+5. **Removed profiling instrumentation**
+   - Removed 6 profiling counter fields
+   - Removed all timing code from `record_file()`
+
+### Expected Performance
+
+| Mode | Throughput | Overhead |
+|------|------------|----------|
+| No flag | ~13,500 files/s | 0% (baseline) |
+| `--detailed-read-stats` | ~12,500+ files/s | ~8% |
+| `--detailed-read-stats --include-dir-stats` | ~10,800 files/s | ~20% |
+
+### Usage
+
+```bash
+# Fast detailed stats (histogram + category performance only)
+edenfsctl debug bench traversal --dir=/path --detailed-read-stats
+
+# Full detailed stats including per-directory breakdown (slower)
+edenfsctl debug bench traversal --dir=/path --detailed-read-stats --include-dir-stats
+```
+
+## Key Learnings
+
+1. **CPU cache effects can dominate performance** - Even 1 µs of code can cause 17 µs of cache misses on subsequent operations
+
+2. **Single-threaded code doesn't benefit from lock-free data structures** - DashMap adds overhead compared to uncontended Mutex
+
+3. **HashMap size matters for cache** - ~29,000 directory entries (~4-8 MB) pollutes L2/L3 cache
+
+4. **String allocations touch allocator metadata** - Even simple `to_string()` can evict cached data
+
+5. **Measure the RIGHT thing** - We were measuring `record_file()` but the impact was on `File::open()`
+
+## Files Modified
+
+- `eden/fs/cli_rs/edenfs-commands/src/debug/bench/cmd.rs`
+- `eden/fs/cli_rs/edenfs-commands/src/debug/bench/traversal.rs`
diff --git a/eden/fs/cli_rs/edenfs-commands/src/debug/bench/cmd.rs b/eden/fs/cli_rs/edenfs-commands/src/debug/bench/cmd.rs
@@ -133,6 +133,14 @@ pub enum BenchCmd {
             long_help = "Enable detailed statistics about directory traversal including readdir() latency distribution, directory size analysis, scan rate variance, and slowest directories. Compatible with --skip-read for analyzing pure traversal performance."
         )]
         detailed_list_stats: bool,
+
+        /// Include per-directory statistics (causes ~20% throughput reduction)
+        #[clap(
+            long,
+            help = "Include per-directory stats (slower)",
+            long_help = "Enable per-directory I/O statistics tracking. This feature causes significant overhead (~20% throughput reduction) due to CPU cache effects from HashMap operations. Only effective when used with --detailed-read-stats."
+        )]
+        include_dir_stats: bool,
     },
 }
 
@@ -214,6 +222,7 @@ impl crate::Subcommand for BenchCmd {
                 skip_read,
                 detailed_read_stats,
                 detailed_list_stats,
+                include_dir_stats,
             } => {
                 // Validate flag compatibility
                 if *skip_read && *detailed_read_stats {
@@ -253,6 +262,7 @@ impl crate::Subcommand for BenchCmd {
                     thrift_io.as_deref(),
                     *detailed_read_stats,
                     *detailed_list_stats,
+                    *include_dir_stats,
                 )
                 .await?;
 
diff --git a/eden/fs/cli_rs/edenfs-commands/src/debug/bench/traversal.rs b/eden/fs/cli_rs/edenfs-commands/src/debug/bench/traversal.rs