Commit 74d2fad
committed
diff: use custom Myers-based engine to improve performance
Until now we have used the engine provided by the 'diff' crate to do
the actual diffing. It implements a Longest Common Subsequence algorithm
that explores all potential offsets between the two collections.
This produces high quality diffs, but has the downside of requiring a
huge amount of memory for the left x right lines matrix, which makes it
unable to process big files (~36MB):
> ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar
memory allocation of 2202701222500 bytes failed
fish: Job 1, './target/release/diffutils diff…' terminated by signal SIGABRT (Abort)
The author has begun an implementation of the Myers algorithm, which
will be offered as an alternative to the full LCS one, but has not
made any progress on merging it for months, and has not been responsive.
It probably makes sense for us to have our own engine, in any case,
so that we can evolve it along with the tool and make any adjustments
or apply any heuristics we decide could be helpful for matching GNU
diff's behavior or performance.
The Myers algorithm is a more efficient implementation of LCS, as it
only uses a couple vectors with 2 * (m + n) positions, rather than the
m * n positions used by the full LCS matrix, where m and n are the
number of lines of each file.
With this new engine we outperform GNU diff significantly when comparing
those two big files that are largely equal, with changes at the top and
bottom while producing the exact same diff and using almost exactly the
same amount of memory at the peak:
Benchmark 1: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar
Time (mean ± σ): 105.0 ms ± 2.5 ms [User: 62.2 ms, System: 41.6 ms]
Range (min … max): 101.7 ms … 111.3 ms 28 runs
Warning: Ignoring non-zero exit code.
Benchmark 2: diff test-data/huge-base test-data/huge-very-similar
Time (mean ± σ): 1.119 s ± 0.003 s [User: 1.068 s, System: 0.044 s]
Range (min … max): 1.115 s … 1.126 s 10 runs
Warning: Ignoring non-zero exit code.
Summary
./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar ran
10.66 ± 0.26 times faster than diff test-data/huge-base test-data/huge-very-similar
It's not all flowers, however. Without heuristics we suffer on files
which are very different, especially if they are large, but even if
they are small. Diffing two ~36MB and completely different files may
take tens of minutes - but it at least works. This is where our ability
to add custom heuristics is helpful, though - we can avoid some of the
most pathological cases. Those come on the next couple commits.
Benchmark 1: ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3
Time (mean ± σ): 6.5 ms ± 0.3 ms [User: 5.5 ms, System: 0.8 ms]
Range (min … max): 6.1 ms … 8.0 ms 435 runs
Warning: Ignoring non-zero exit code.
Benchmark 2: diff test-data/LGPL2.1 test-data/GPL3
Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 1.1 ms, System: 0.3 ms]
Range (min … max): 1.4 ms … 4.1 ms 1968 runs
Warning: Ignoring non-zero exit code.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 3: ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3
Time (mean ± σ): 2.1 ms ± 0.2 ms [User: 1.2 ms, System: 0.8 ms]
Range (min … max): 1.8 ms … 2.9 ms 1435 runs
Warning: Ignoring non-zero exit code.
Summary
diff test-data/LGPL2.1 test-data/GPL3 ran
1.42 ± 0.17 times faster than ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3
4.35 ± 0.43 times faster than ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3
It is worth pointing out as well that the reason GNU diff is outperformed
in that best case scenario is because it does a lot more work to enable
other optimizations we do not implement such as hashing each line and
separating out those that only appear on one of the files. That work
adds up on big files, but allows GNU diff to outperform by a similar
factor when the files are not just different by rearranging lines, but
by having completely different lines.1 parent 82cdac6 commit 74d2fad
File tree
9 files changed
+581
-23
lines changed- src
9 files changed
+581
-23
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
23 | 22 | | |
| 23 | + | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
45 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
46 | 52 | | |
47 | 53 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
77 | 78 | | |
78 | 79 | | |
79 | 80 | | |
80 | | - | |
| 81 | + | |
81 | 82 | | |
82 | | - | |
| 83 | + | |
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| |||
101 | 102 | | |
102 | 103 | | |
103 | 104 | | |
104 | | - | |
| 105 | + | |
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
| |||
132 | 133 | | |
133 | 134 | | |
134 | 135 | | |
135 | | - | |
| 136 | + | |
136 | 137 | | |
137 | 138 | | |
138 | 139 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
74 | | - | |
| 75 | + | |
75 | 76 | | |
76 | | - | |
| 77 | + | |
77 | 78 | | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
84 | | - | |
| 85 | + | |
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
88 | | - | |
| 89 | + | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| |||
0 commit comments