|
5 | 5 | /*
|
6 | 6 | * Idea here is very simple.
|
7 | 7 | *
|
8 |
| - * We have total of (sz-N+1) N-byte overlapping sequences in buf whose |
9 |
| - * size is sz. If the same N-byte sequence appears in both source and |
10 |
| - * destination, we say the byte that starts that sequence is shared |
11 |
| - * between them (i.e. copied from source to destination). |
| 8 | + * Almost all data we are interested in are text, but sometimes we have |
| 9 | + * to deal with binary data. So we cut them into chunks delimited by |
| 10 | + * LF byte, or 64-byte sequence, whichever comes first, and hash them. |
12 | 11 | *
|
13 |
| - * For each possible N-byte sequence, if the source buffer has more |
14 |
| - * instances of it than the destination buffer, that means the |
15 |
| - * difference are the number of bytes not copied from source to |
16 |
| - * destination. If the counts are the same, everything was copied |
17 |
| - * from source to destination. If the destination has more, |
18 |
| - * everything was copied, and destination added more. |
| 12 | + * For those chunks, if the source buffer has more instances of it |
| 13 | + * than the destination buffer, that means the difference are the |
| 14 | + * number of bytes not copied from source to destination. If the |
| 15 | + * counts are the same, everything was copied from source to |
| 16 | + * destination. If the destination has more, everything was copied, |
| 17 | + * and destination added more. |
19 | 18 | *
|
20 | 19 | * We are doing an approximation so we do not really have to waste
|
21 | 20 | * memory by actually storing the sequence. We just hash them into
|
22 | 21 | * somewhere around 2^16 hashbuckets and count the occurrences.
|
23 |
| - * |
24 |
| - * The length of the sequence is arbitrarily set to 8 for now. |
25 | 22 | */
|
26 | 23 |
|
27 | 24 | /* Wild guess at the initial hash size */
|
|
0 commit comments