perf: record hash in benchmark log and ability to diff against it

wincent · wincent · commit dddab8a97fd4 · 2025-08-08T12:57:46.000+02:00
Just say I try three things in sequence with these hashes: - deadbeef - feedface - abcd0123 Normally running the benchmark compares against the latest run only. But say I'm working on "abcd0123" and want to know the delta against "deadbeef", now I can use `BASE=deadbeef bin/benchmark/matcher.lua`. If there is no such hash in the log, reports an error. If there is a hash, but it's "dirty", prints a warning that it's using a dirty hash (unless you explicitly asked for the dirty hash). Note: examples above are 8-character hashes. In reality, this system always stores 40-character hashes. If your `BASE` value is shorter, the code does a prefix match. So, let's show this in action. First of all, I went back looking at changes to `lua/wincent/commandt/private/benchmark.lua` to see how far back I could easily go without running into significant changes to the benchmarking script. The last meaningful change was in 70f5c20 ("refactor(lua): rationalize our external symbols", 2022-07-29), meaning we can go back around three years without having to manually edit the file. With a bit more work, could do the edits anyway, but we can't go that much further back without running into a major change which would be painful to deal with: e91f67c ("refactor(lua): extract wincent.commandt.private.benchmark", 2022-07-16). With this info in mind, go back and find perf-related commits in the time invterval: git oneline --grep perf --reverse --since 2022-07-29 Here are some of the possibly interesting ones to look at: a91c298 perf: speed up workers by processing chunks of consecutive haystacks (12 months ago) 7e6f158 perf: try compiling with `-Ofast` instead of `-O3` (12 months ago) 48b5ee5 perf: add more compiler flags (11 months ago) 6033ca8 perf: avoid redundant merge (5 months ago) 3cd40d7 perf: limit based on available height as opposed to configured height (6 weeks ago) 4b98334 perf: use voodoo coding to eek out minor performance increase (6 weeks ago) 906385e perf: use faster downcasing (6 weeks ago) 19eae40 perf: avoid some pointer indirection (6 weeks ago) 817208d perf: avoid more pointer indirection (6 weeks ago) 9715dbc refactor: do case conversions consistently (6 weeks ago) bc8fe12 perf: avoid repeated case conversions (5 weeks ago) 8338f98 perf: avoid repeated pointer traversals (5 weeks ago) 8dfda23 perf: avoid some more pointer traversals (5 weeks ago) 56954c5 perf: record hash in benchmark log and ability to diff against it (HEAD -> main) (42 minutes ago) plus of course the oldest one that is accessible to us before the first "breaking" change to the benchmark, which we'll use as a baseline: 75cc367 feat(lua): teach the help finder to open in vertical splits, tabs etc (3 years ago) There might be other intermediate commits that impact perf, but if so, they don't contain the word "perf", so they don't show up in the list above and we'll be skipping them for the purposes of this exercise. Anyway, we go back and run the benchmarks at these commits, starting with the oldest one and working forward. Note that I had to pass `-C` to `make` at the beginning because initially there was no suitable top-level `Makefile` for us to use. (git co 75cc367 && make -C lua/wincent/commandt/lib clean && make -C lua/wincent/commandt/lib && git co main -- lua/wincent/commandt/private/benchmark.lua && TIMES=5 bin/benchmarks/matcher.lua && git reset HEAD && git co .) (git co a91c298 && make -C lua/wincent/commandt/lib clean && make -C lua/wincent/commandt/lib && git co main -- lua/wincent/commandt/private/benchmark.lua && TIMES=5 bin/benchmarks/matcher.lua && git reset HEAD && git co .) (git co 4b98334 && make clean && make && git co main -- lua/wincent/commandt/private/benchmark.lua && TIMES=5 bin/benchmarks/matcher.lua && git reset HEAD && git co .) (git co 8dfda23 && make clean && make && git co main -- lua/wincent/commandt/private/benchmark.lua && TIMES=5 bin/benchmarks/matcher.lua && git reset HEAD && git co .) With the historical benchmark data now seeded, we can try some different `BASE` values and observe how the performance gets bigger and bigger as we go farther back in time. First up, start with the current `HEAD` and no `BASE`. That is, we should see current performance, and no meaningful delta (ie. no `p` value). Indeed, we see exactly that when we run it twice in a row: TIMES=5 bin/benchmarks/matcher.lua # ie. no BASE Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20211 0.21416 0.03242 [-0.1%] (0.20212) (0.21416) (0.03242) [-0.1%] command-t 0.14975 0.15900 0.03848 [+0.4%] (0.14975) (0.15900) (0.03849) [+0.4%] chromium (subset) 1.17113 1.17970 0.01456 [-0.4%] (0.26319) (0.26443) (0.00197) [-1.0%] chromium (whole) 0.90269 0.90448 0.00444 [-2.3%] (0.10272) (0.10495) (0.00279) [-8.9%] big (400k) 1.35537 1.35903 0.00908 [-0.4%] (0.14771) (0.15109) (0.00506) [+0.4%] total 3.78496 3.81636 0.05603 [-0.8%] (0.86743) (0.89363) (0.06559) [-1.2%] Now we compare against 2025-07-03 and again see no significant difference because there has been no perf-related work since then: BASE=8dfda23 TIMES=5 bin/benchmarks/matcher.lua Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20217 0.21286 0.02449 [-0.4%] (0.20217) (0.21286) (0.02448) [-0.4%] command-t 0.14968 0.15818 0.03143 [-1.3%] (0.14968) (0.15818) (0.03144) [-1.3%] chromium (subset) 1.18562 1.19020 0.01601 [+0.2%] (0.26808) (0.27069) (0.00804) [+1.6%] chromium (whole) 0.89829 0.90328 0.01043 [+0.3%] (0.10158) (0.10262) (0.00231) [-0.3%] big (400k) 1.35498 1.36317 0.01556 [+0.1%] (0.14805) (0.14954) (0.00344) [-1.6%] total 3.80532 3.82769 0.04708 [+0.1%] (0.87382) (0.89390) (0.04786) [-0.1%] Now we go back to 2025-06-24 and see our first significant perf change: BASE=4b98334 TIMES=5 bin/benchmarks/matcher.lua Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20301 0.21354 0.02899 [-6.2%] 0.05 (0.20301) (0.21354) (0.02898) [-6.2%] 0.05 command-t 0.15010 0.15802 0.03212 [-10.7%] 0.05 (0.15010) (0.15802) (0.03212) [-10.7%] 0.05 chromium (subset) 1.17620 1.18015 0.01067 [-11.0%] 0.05 (0.25964) (0.26566) (0.00807) [-6.2%] 0.05 chromium (whole) 0.90050 0.90376 0.00371 [-17.7%] 0.05 (0.10576) (0.10661) (0.00270) [-13.3%] 0.05 big (400k) 1.35552 1.35942 0.00706 [-17.6%] 0.05 (0.14950) (0.15033) (0.00155) [-16.5%] 0.05 total 3.79207 3.81489 0.06748 [-14.7%] 0.05 (0.86998) (0.89416) (0.06213) [-9.6%] 0.05 Back to 2024-08-13 we wee a bigger change: BASE=a91c298 TIMES=5 bin/benchmarks/matcher.lua Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20270 0.21348 0.02811 [-9.7%] 0.05 (0.20270) (0.21348) (0.02810) [-9.7%] 0.05 command-t 0.15069 0.15926 0.03167 [-8.7%] (0.15069) (0.15931) (0.03165) [-8.6%] chromium (subset) 1.13067 1.16924 0.04352 [-17.2%] 0.05 (0.25553) (0.26593) (0.01224) [-9.5%] 0.05 chromium (whole) 0.89872 0.90251 0.00634 [-29.0%] 0.05 (0.10030) (0.10405) (0.00689) [-25.0%] 0.05 big (400k) 1.35960 1.36199 0.00417 [-27.8%] 0.05 (0.15053) (0.15393) (0.00614) [-23.3%] 0.05 total 3.75276 3.80648 0.08109 [-23.0%] 0.05 (0.87232) (0.89669) (0.06312) [-13.6%] 0.05 Finally, we go all the way back to 2022-07-29 and see an even bigger change, as expected: BASE=75cc367 TIMES=5 bin/benchmarks/matcher.lua Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20291 0.22017 0.05556 [-2.7%] (0.20291) (0.22017) (0.05555) [-2.7%] command-t 0.14949 0.15046 0.00143 [-20.9%] 0.05 (0.14949) (0.15046) (0.00143) [-20.9%] 0.05 chromium (subset) 1.17581 1.17860 0.00856 [-32.5%] 0.05 (0.26507) (0.26827) (0.00483) [-19.7%] 0.05 chromium (whole) 0.90010 0.90297 0.00423 [-61.9%] 0.05 (0.10188) (0.10636) (0.00576) [-52.5%] 0.05 big (400k) 1.35858 1.36409 0.01234 [-65.3%] 0.05 (0.14769) (0.15255) (0.00860) [-59.0%] 0.05 total 3.79522 3.81629 0.05523 [-49.0%] 0.05 (0.87816) (0.89781) (0.05850) [-26.3%] 0.05 Note that when reading these tables it's important to focus on the % changes, not the absolute values, because all of the latter are just repeatedly showing the current performance values. In summary, the time deltas that we see are: - Baseline (2025-08-08): n/a - From 2025-08-08 to 2022-07-29: -49.0% (CPU time) and -26.3% (wall time) - From 2025-08-08 to 2024-08-13: -23.0% (CPU time) and -13.6% (wall time) - From 2025-08-08 to 2025-06-24: -14.7% (CPU time) and -9.6% (wall time) - From 2025-08-08 to 2025-07-03: -0.8% (CPU time) and -1.2% (wall time) (but as noted above, that last row is noise only, with no `p` value significance.)
diff --git a/lua/wincent/commandt/private/benchmark.lua b/lua/wincent/commandt/private/benchmark.lua
@@ -11,6 +11,97 @@ local lib = require('wincent.commandt.private.lib')
 
 lib.epoch() -- Force eager loading of C library.
 
+local function get_git_hash()
+  local handle = io.popen('command git rev-parse HEAD 2>/dev/null')
+  if not handle then
+    return nil
+  end
+  local hash = handle:read('*line')
+  handle:close()
+
+  if not hash or hash == '' then
+    return nil
+  end
+
+  -- Check if work tree is dirty
+  handle = io.popen('command git status --porcelain 2>/dev/null')
+  if not handle then
+    return hash
+  end
+
+  local status = handle:read('*all')
+  handle:close()
+
+  if status and status ~= '' then
+    return hash .. '-dirty'
+  else
+    return hash
+  end
+end
+
+local function is_dirty(hash)
+  return string.sub(hash, -string.len('-dirty')) == '-dirty'
+end
+
+local function find_baseline(log, base_input)
+  if not base_input then
+    return nil
+  end
+
+  -- Parse BASE (eg. "deadbeef" or "deadbeef-dirty").
+  local base_hash = base_input
+  local dirty_base = false
+  if base_input:sub(-6) == '-dirty' then
+    base_hash = base_input:sub(1, -7)
+    dirty_base = true
+  end
+
+  -- Scan backwards through log entries.
+  for i = #log, 1, -1 do
+    local entry = log[i]
+    if entry.hash then
+      -- Look for exact match.
+      if entry.hash == base_input then
+        return entry
+      end
+
+      -- Look for prefix match.
+      local dirty_entry = is_dirty(entry.hash)
+      local entry_hash = dirty_entry and entry.hash:sub(1, -7) or entry.hash
+
+      if entry_hash:sub(1, #base_hash) == base_hash then
+        if dirty_base and dirty_entry then
+          -- User wants dirty match and we found one.
+          return entry
+        elseif not dirty_base and not dirty_entry then
+          -- User wants clean match and we found one.
+          return entry
+        end
+      end
+    end
+  end
+
+  -- Scan again looking for fallback dirty match.
+  if not dirty_base then
+    for i = #log, 1, -1 do
+      local entry = log[i]
+      if entry.hash and is_dirty(entry.hash) then
+        local entry_hash = entry.hash:sub(1, -7)
+        if entry_hash:sub(1, #base_hash) == base_hash then
+          print('Warning: Using dirty version of hash ' .. base_hash)
+          return entry
+        end
+      end
+    end
+  end
+
+  return nil
+end
+
+local function red(text)
+  return '\027[31m' .. text .. '\027[0m'
+end
+
 local reduce = function(list, initial, cb)
   local acc = initial
   for i, value in ipairs(list) do
@@ -240,8 +331,22 @@ local function benchmark(options)
   local ok, log = pcall(require, options.log)
   log = ok and log or {}
 
+  local base_hash = os.getenv('BASE')
+  local previous
+
+  if base_hash then
+    previous = find_baseline(log, base_hash)
+    if not previous then
+      print(red('Error: Could not find baseline hash "' .. base_hash .. '" in benchmark logs'))
+      os.exit(1)
+    end
+  else
+    previous = log[#log]
+  end
+
   local results = {
     when = os.date(),
+    hash = get_git_hash(),
     timings = {},
   }
 
@@ -306,8 +411,6 @@ local function benchmark(options)
     end
   end
 
-  local previous = log[#log]
-
   for label, metrics in pairs(results.timings) do
     metrics['cpu (best)'] = math.min(unpack(metrics.cpu))
     metrics['wall (best)'] = math.min(unpack(metrics.wall))