-
Notifications
You must be signed in to change notification settings - Fork 2
Description
To avoid potential copyright issues and decrease the size of distributions, one may want to use delta files for mass distribution. The delta files are useless on their own, but can transform into a useful file if matched with their intended companion file on the target computer.
Let us say we have file B, derived from source file A. What we do is, match derivative file B against source file A, build a diff file by linking to substitute data from source file A, then distribute that diff file, which is then rebuilt to a fully functional file against source file A, assuming it exists on the target computer.
Algorithm
If we were to implement a custom tool for this, then it definitely would need to be implemented in C++ for speed. It can heavily utilize threads to speed up binary matching by delegating sectors of a file to different threads. Then it can perform fuzzy binary matching in appropriate steps, to try find suitable matches between files. Search can start with small patterns from one file to another, then expand to search left and right once the starter pattern is found. The biggest overlaps win the matching if there are multiple overlapping ones in ranges. This algorithm can then be tweaked for performance.
Asking ChatGPT for existing solutions
begin question
Does a tool exist that can binary match any two files A and B, so that if B is a revision of A, it builds a new smaller file C with sufficient information, to be able to recreate file B with files A and C?
end question
ChatGPT reply
begin reply
Yes, there are tools designed to perform this kind of binary diff and patch operation. These tools compare two files, A and B, and generate a smaller file C (often called a delta or patch) that contains the differences between A and B. Using A and the patch file C, one can reconstruct B. Here are some of the tools that perform this function:
bsdiff:
- Description: bsdiff is a binary diff tool that generates binary patches. It is efficient and particularly well-suited for large binaries.
- Reconstruction: You can use bspatch to apply the patch and recreate file B from A and C.
- Command Example:
bsdiff fileA fileB fileC
bspatch fileA fileC fileB_reconstructedxdelta:
- Description: xdelta is another binary diff and patch tool that works well with binary files and is often used for compressing data by only storing changes between files.
- Reconstruction: The xdelta command can apply the delta to the original file.
- Command Example:
xdelta3 -e -s fileA fileB fileC
xdelta3 -d -s fileA fileC fileB_reconstructedzsync:
- Description: zsync is primarily used for synchronizing files over a network and is very efficient in terms of bandwidth usage. It creates a small metadata file that can be used to update the original file to the new version.
- Reconstruction: Typically used in network file synchronization, but can also be used locally.
- Command Example:
zsyncmake fileB -o fileC
zsync fileCRsync:
- Description: While rsync is commonly used for remote synchronization, it can also be used locally to generate and apply binary diffs.
- Reconstruction: rsync uses a rolling checksum algorithm to efficiently transfer changes.
- Command Example:
rsync --only-write-batch=fileC fileA fileB
rsync --read-batch=fileC fileB_reconstructedConclusion
These tools are highly effective in scenarios where you need to minimize the storage or transmission size of file updates, such as in software distribution, backup systems, and version control systems. Each tool has its own strengths, so the choice of tool may depend on your specific requirements, such as the size of the files, the typical size of changes, and the environment in which you're working.
end reply