-
Notifications
You must be signed in to change notification settings - Fork 394
LCS Algorithms
Finding a minimal set of changes between two files is a well-researched problem, often described as finding the 'longest common subsequence' (LCS) or 'shortest edit script'.
See also discussion here: https://github.com/mitsuhiko/similar/issues/44#issuecomment-1412178765
Wu, Sun, Udi Manber, Eugene Wimberly Myers and Webb Miller. "An O(NP) Sequence Comparison Algorithm." Inf. Process. Lett. 35 (1990): 317-323.
https://doi.org/10.1016/0020-0190(90)90035-V
This is the algorithm used in difftastic, and is also used in Subversion's diff.
Myers, Eugene Wimberly. "An O(ND) difference algorithm and its variations." Algorithmica 1 (1986): 251-266.
https://doi.org/10.1007/BF01840446
This algorithm is widely implemented, including in the diff
crate. There's a great introduction here.
GNU diff has a bunch of heuristics to avoid pathological cases that can occur in a simple implementation. The imara-diff crate implements these in Rust.
Google's go-cmp
also has an interesting
variant
that optimises for performance over minimal diffs.
https://doi.org/10.1145/359581.359603
The original 1970s diff
tool used the Hunt-McIlroy algorithm
(sometimes called the Hunt–Szymanski algorithm).