Skip to content

Commit 74d2fad

Browse files
committed
diff: use custom Myers-based engine to improve performance
Until now we have used the engine provided by the 'diff' crate to do the actual diffing. It implements a Longest Common Subsequence algorithm that explores all potential offsets between the two collections. This produces high quality diffs, but has the downside of requiring a huge amount of memory for the left x right lines matrix, which makes it unable to process big files (~36MB): > ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar memory allocation of 2202701222500 bytes failed fish: Job 1, './target/release/diffutils diff…' terminated by signal SIGABRT (Abort) The author has begun an implementation of the Myers algorithm, which will be offered as an alternative to the full LCS one, but has not made any progress on merging it for months, and has not been responsive. It probably makes sense for us to have our own engine, in any case, so that we can evolve it along with the tool and make any adjustments or apply any heuristics we decide could be helpful for matching GNU diff's behavior or performance. The Myers algorithm is a more efficient implementation of LCS, as it only uses a couple vectors with 2 * (m + n) positions, rather than the m * n positions used by the full LCS matrix, where m and n are the number of lines of each file. With this new engine we outperform GNU diff significantly when comparing those two big files that are largely equal, with changes at the top and bottom while producing the exact same diff and using almost exactly the same amount of memory at the peak: Benchmark 1: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 105.0 ms ± 2.5 ms [User: 62.2 ms, System: 41.6 ms] Range (min … max): 101.7 ms … 111.3 ms 28 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 1.119 s ± 0.003 s [User: 1.068 s, System: 0.044 s] Range (min … max): 1.115 s … 1.126 s 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar ran 10.66 ± 0.26 times faster than diff test-data/huge-base test-data/huge-very-similar It's not all flowers, however. Without heuristics we suffer on files which are very different, especially if they are large, but even if they are small. Diffing two ~36MB and completely different files may take tens of minutes - but it at least works. This is where our ability to add custom heuristics is helpful, though - we can avoid some of the most pathological cases. Those come on the next couple commits. Benchmark 1: ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 6.5 ms ± 0.3 ms [User: 5.5 ms, System: 0.8 ms] Range (min … max): 6.1 ms … 8.0 ms 435 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 1.1 ms, System: 0.3 ms] Range (min … max): 1.4 ms … 4.1 ms 1968 runs Warning: Ignoring non-zero exit code. Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 3: ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 2.1 ms ± 0.2 ms [User: 1.2 ms, System: 0.8 ms] Range (min … max): 1.8 ms … 2.9 ms 1435 runs Warning: Ignoring non-zero exit code. Summary diff test-data/LGPL2.1 test-data/GPL3 ran 1.42 ± 0.17 times faster than ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 4.35 ± 0.43 times faster than ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 It is worth pointing out as well that the reason GNU diff is outperformed in that best case scenario is because it does a lot more work to enable other optimizations we do not implement such as hashing each line and separating out those that only appear on one of the files. That work adds up on big files, but allows GNU diff to outperform by a similar factor when the files are not just different by rearranging lines, but by having completely different lines.
1 parent 82cdac6 commit 74d2fad

File tree

9 files changed

+581
-23
lines changed

9 files changed

+581
-23
lines changed

Cargo.lock

Lines changed: 150 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,12 @@ path = "src/main.rs"
1616

1717
[dependencies]
1818
chrono = "0.4.38"
19-
diff = "0.1.13"
2019
itoa = "1.0.11"
2120
regex = "1.10.4"
2221
same-file = "1.0.6"
2322
unicode-width = "0.2.0"
23+
tracing = "0.1.40"
24+
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
2425

2526
[dev-dependencies]
2627
pretty_assertions = "1.4.0"
@@ -42,6 +43,11 @@ ci = ["github"]
4243
# The installers to generate for each app
4344
installers = []
4445
# Target platforms to build apps for (Rust target-triple syntax)
45-
targets = ["aarch64-apple-darwin", "x86_64-apple-darwin", "x86_64-unknown-linux-gnu", "x86_64-pc-windows-msvc"]
46+
targets = [
47+
"aarch64-apple-darwin",
48+
"x86_64-apple-darwin",
49+
"x86_64-unknown-linux-gnu",
50+
"x86_64-pc-windows-msvc",
51+
]
4652
# Publish jobs to run in CI
4753
pr-run-mode = "plan"

src/context_diff.rs

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
use std::collections::VecDeque;
77
use std::io::Write;
88

9+
use crate::engine::{self, Edit};
910
use crate::params::Params;
1011
use crate::utils::do_write_line;
1112
use crate::utils::get_modification_time;
@@ -77,9 +78,9 @@ fn make_diff(
7778
// Rust only allows allocations to grow to isize::MAX, and this is bigger than that.
7879
let mut expected_lines_change_idx: usize = !0;
7980

80-
for result in diff::slice(&expected_lines, &actual_lines) {
81+
for result in engine::diff(&expected_lines, &actual_lines) {
8182
match result {
82-
diff::Result::Left(str) => {
83+
Edit::Delete(str) => {
8384
if lines_since_mismatch > context_size && lines_since_mismatch > 0 {
8485
results.push(mismatch);
8586
mismatch = Mismatch::new(
@@ -101,7 +102,7 @@ fn make_diff(
101102
line_number_expected += 1;
102103
lines_since_mismatch = 0;
103104
}
104-
diff::Result::Right(str) => {
105+
Edit::Insert(str) => {
105106
if lines_since_mismatch > context_size && lines_since_mismatch > 0 {
106107
results.push(mismatch);
107108
mismatch = Mismatch::new(
@@ -132,7 +133,7 @@ fn make_diff(
132133
line_number_actual += 1;
133134
lines_since_mismatch = 0;
134135
}
135-
diff::Result::Both(str, _) => {
136+
Edit::Keep(str) => {
136137
expected_lines_change_idx = !0;
137138
// if one of them is missing a newline and the other isn't, then they don't actually match
138139
if (line_number_actual > actual_lines_count)

src/ed_diff.rs

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
use std::io::Write;
77

8+
use crate::engine::{self, Edit};
89
use crate::params::Params;
910
use crate::utils::do_write_line;
1011

@@ -71,21 +72,21 @@ fn make_diff(expected: &[u8], actual: &[u8], stop_early: bool) -> Result<Vec<Mis
7172
return Err(DiffError::MissingNL);
7273
}
7374

74-
for result in diff::slice(&expected_lines, &actual_lines) {
75+
for result in engine::diff(&expected_lines, &actual_lines) {
7576
match result {
76-
diff::Result::Left(str) => {
77+
Edit::Delete(str) => {
7778
if !mismatch.actual.is_empty() {
7879
results.push(mismatch);
7980
mismatch = Mismatch::new(line_number_expected, line_number_actual);
8081
}
8182
mismatch.expected.push(str.to_vec());
8283
line_number_expected += 1;
8384
}
84-
diff::Result::Right(str) => {
85+
Edit::Insert(str) => {
8586
mismatch.actual.push(str.to_vec());
8687
line_number_actual += 1;
8788
}
88-
diff::Result::Both(_str, _) => {
89+
Edit::Keep(_str) => {
8990
line_number_expected += 1;
9091
line_number_actual += 1;
9192
if !mismatch.actual.is_empty() || !mismatch.expected.is_empty() {

0 commit comments

Comments
 (0)