Add BENCHMARKING.md file and more comments

kov · kov · commit 21e7a0748d1f · 2024-11-02T11:55:29.000-03:00
Add links to the original paper, and an explanation of the overall
design to the implementation file.
diff --git a/BENCHMARKING.md b/BENCHMARKING.md
@@ -0,0 +1,63 @@
+# Benchmarking diff
+
+The engine used by our diff tool tries to balance execution time with patch
+quality. It implements the Myers algorithm with a few heuristics which are also
+used by GNU diff to avoid pathological cases.
+
+The original paper can be found here:
+- https://link.springer.com/article/10.1007/BF01840446
+
+Currently, not all tricks used by GNU diff are adopted by our implementation.
+For instance, GNU diff will isolate lines that only exist in each of the files
+and not include them on the diffing process. It also does post-processing of the
+edits to produce more cohesive hunks. Both of these combinar should make it
+produce better patches for large files which are very different.
+
+Run `cargo build --release` before benchmarking after you make a change!
+
+## How to benchmark
+
+It is recommended that you use the 'hyperfine' tool to run your benchmarks. This
+is an example of how to run a comparison with GNU diff:
+
+```
+> hyperfine -N -i --warmup 2 --output=pipe 'diff t/huge t/huge.3'
+'./target/release/diffutils diff t/huge t/huge.3'
+Benchmark 1: diff t/huge t/huge.3
+  Time (mean ± σ):     136.3 ms ±   3.0 ms    [User: 88.5 ms, System: 17.9 ms]
+  Range (min … max):   131.8 ms … 144.4 ms    21 runs
+
+  Warning: Ignoring non-zero exit code.
+
+Benchmark 2: ./target/release/diffutils diff t/huge t/huge.3
+  Time (mean ± σ):      74.4 ms ±   1.0 ms    [User: 47.6 ms, System: 24.9 ms]
+  Range (min … max):    72.9 ms …  77.1 ms    41 runs
+
+  Warning: Ignoring non-zero exit code.
+
+Summary
+  ./target/release/diffutils diff t/huge t/huge.3 ran
+    1.83 ± 0.05 times faster than diff t/huge t/huge.3
+>
+```
+
+As you can see, you should provide both commands you want to compare on a single
+invocation of 'hyperfine'. Each as a single argument, so use quotes. These are
+the relevant parameters:
+
+- -N: avoids using a shell as intermediary to run the command
+- -i: ignores non-zero exit code, which diff uses to mean files differ
+- --warmup 2: 2 runs before measuring, warms up I/O cache for large files
+- --output=pipe: disable any potential optimizations based on output destination
+
+## Inputs
+
+Performance will vary based on several factors, the main ones being:
+
+- how large the files being compared are
+- how different the files being compared are
+- how large and far between sequences of equal lines are
+
+When looking at performance improvements, testing small and large (tens of MBs)
+which have few differences, many differences, completely different is important
+to cover all of the potential pathological cases.
diff --git a/src/engine.rs b/src/engine.rs
@@ -3,6 +3,44 @@
 // For the full copyright and license information, please view the LICENSE-*
 // files that was distributed with this source code.
 
+// This engine implements the Myers diff algorithm, which uses a double-ended
+// diagonal search to identify the longest common subsequence (LCS) between two
+// collections. The original paper can be found here:
+//
+// https://link.springer.com/article/10.1007/BF01840446
+//
+// Unlike a naive LCS implementation, which covers all possible combinations,
+// the Myers algorithm gradualy expands the search space, and only encodes
+// the furthest progress made by each diagonal rather than storing each step
+// of the search on a matrix.
+//
+// This makes it a lot more memory-efficient, as it only needs 2 * (m + n)
+// positions to represent the state of the search, where m and n are the number
+// of items in the collections being compared, whereas the naive LCS requires
+// m * n positions.
+//
+// The downside is it is more compute-intensive than the naive method when
+// searching through very different files. This may lead to unnacceptable run
+// time in pathological cases (large, completely different files), so heuristics
+// are often used to bail on the search if it gets too costly and/or a good enough
+// subsequence has been found.
+//
+// We implement 3 main heuristics that are also used by GNU diff:
+//
+// 1. if we found a large enough common subsequence (also known as a 'snake')
+// and have searched for a while, we return that one
+//
+// 2. if we have searched for a significant chunk of the collections (with a
+// minimum of 4096 iterations, so we cover easy cases fully) and have not found
+// one, we use whatever we have, even if it is a small snake or no snake at all
+//
+// 3. we keep track of the overall cost of the various searches that are done
+// over the course of the divide and conquer strategy, and if that becomes too
+// large we give up on trying to find long similarities altogether
+//
+// This last heuristic could be improved significantly in the future if we
+// implement an optimization that separates items that only appear in either
+// collection and remove them from the diffing process, like GNU diff does.
 use std::fmt::Debug;
 use std::ops::{Index, IndexMut, RangeInclusive};