Skip to content

Commit 21e7a07

Browse files
committed
Add BENCHMARKING.md file and more comments
Add links to the original paper, and an explanation of the overall design to the implementation file.
1 parent 7485daf commit 21e7a07

File tree

2 files changed

+101
-0
lines changed

2 files changed

+101
-0
lines changed

BENCHMARKING.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Benchmarking diff
2+
3+
The engine used by our diff tool tries to balance execution time with patch
4+
quality. It implements the Myers algorithm with a few heuristics which are also
5+
used by GNU diff to avoid pathological cases.
6+
7+
The original paper can be found here:
8+
- https://link.springer.com/article/10.1007/BF01840446
9+
10+
Currently, not all tricks used by GNU diff are adopted by our implementation.
11+
For instance, GNU diff will isolate lines that only exist in each of the files
12+
and not include them on the diffing process. It also does post-processing of the
13+
edits to produce more cohesive hunks. Both of these combinar should make it
14+
produce better patches for large files which are very different.
15+
16+
Run `cargo build --release` before benchmarking after you make a change!
17+
18+
## How to benchmark
19+
20+
It is recommended that you use the 'hyperfine' tool to run your benchmarks. This
21+
is an example of how to run a comparison with GNU diff:
22+
23+
```
24+
> hyperfine -N -i --warmup 2 --output=pipe 'diff t/huge t/huge.3'
25+
'./target/release/diffutils diff t/huge t/huge.3'
26+
Benchmark 1: diff t/huge t/huge.3
27+
Time (mean ± σ): 136.3 ms ± 3.0 ms [User: 88.5 ms, System: 17.9 ms]
28+
Range (min … max): 131.8 ms … 144.4 ms 21 runs
29+
30+
Warning: Ignoring non-zero exit code.
31+
32+
Benchmark 2: ./target/release/diffutils diff t/huge t/huge.3
33+
Time (mean ± σ): 74.4 ms ± 1.0 ms [User: 47.6 ms, System: 24.9 ms]
34+
Range (min … max): 72.9 ms … 77.1 ms 41 runs
35+
36+
Warning: Ignoring non-zero exit code.
37+
38+
Summary
39+
./target/release/diffutils diff t/huge t/huge.3 ran
40+
1.83 ± 0.05 times faster than diff t/huge t/huge.3
41+
>
42+
```
43+
44+
As you can see, you should provide both commands you want to compare on a single
45+
invocation of 'hyperfine'. Each as a single argument, so use quotes. These are
46+
the relevant parameters:
47+
48+
- -N: avoids using a shell as intermediary to run the command
49+
- -i: ignores non-zero exit code, which diff uses to mean files differ
50+
- --warmup 2: 2 runs before measuring, warms up I/O cache for large files
51+
- --output=pipe: disable any potential optimizations based on output destination
52+
53+
## Inputs
54+
55+
Performance will vary based on several factors, the main ones being:
56+
57+
- how large the files being compared are
58+
- how different the files being compared are
59+
- how large and far between sequences of equal lines are
60+
61+
When looking at performance improvements, testing small and large (tens of MBs)
62+
which have few differences, many differences, completely different is important
63+
to cover all of the potential pathological cases.

src/engine.rs

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,44 @@
33
// For the full copyright and license information, please view the LICENSE-*
44
// files that was distributed with this source code.
55

6+
// This engine implements the Myers diff algorithm, which uses a double-ended
7+
// diagonal search to identify the longest common subsequence (LCS) between two
8+
// collections. The original paper can be found here:
9+
//
10+
// https://link.springer.com/article/10.1007/BF01840446
11+
//
12+
// Unlike a naive LCS implementation, which covers all possible combinations,
13+
// the Myers algorithm gradualy expands the search space, and only encodes
14+
// the furthest progress made by each diagonal rather than storing each step
15+
// of the search on a matrix.
16+
//
17+
// This makes it a lot more memory-efficient, as it only needs 2 * (m + n)
18+
// positions to represent the state of the search, where m and n are the number
19+
// of items in the collections being compared, whereas the naive LCS requires
20+
// m * n positions.
21+
//
22+
// The downside is it is more compute-intensive than the naive method when
23+
// searching through very different files. This may lead to unnacceptable run
24+
// time in pathological cases (large, completely different files), so heuristics
25+
// are often used to bail on the search if it gets too costly and/or a good enough
26+
// subsequence has been found.
27+
//
28+
// We implement 3 main heuristics that are also used by GNU diff:
29+
//
30+
// 1. if we found a large enough common subsequence (also known as a 'snake')
31+
// and have searched for a while, we return that one
32+
//
33+
// 2. if we have searched for a significant chunk of the collections (with a
34+
// minimum of 4096 iterations, so we cover easy cases fully) and have not found
35+
// one, we use whatever we have, even if it is a small snake or no snake at all
36+
//
37+
// 3. we keep track of the overall cost of the various searches that are done
38+
// over the course of the divide and conquer strategy, and if that becomes too
39+
// large we give up on trying to find long similarities altogether
40+
//
41+
// This last heuristic could be improved significantly in the future if we
42+
// implement an optimization that separates items that only appear in either
43+
// collection and remove them from the diffing process, like GNU diff does.
644
use std::fmt::Debug;
745
use std::ops::{Index, IndexMut, RangeInclusive};
846

0 commit comments

Comments
 (0)