Skip to content

Deflate output not deterministic when the compressor is reused with deflateReset #459

@QrczakMK

Description

@QrczakMK

Deflate output may depend on previous contents of a Window.

Scenario 1: compress data A.

Scenario 2: compress data B, deflateReset, compress data A.

Sample data A ends with "\201\201\201\201\201\000\201\201\260\201\201\201\201\201\201\201\201\201" (C++ literal syntax; I can include the whole data here if needed, but reproduction program would require too much work to isolate).

The difference in compressed outputs for compression level 6 (levels 2..6 are affected, with slightly different outputs):

 ...
 inflate:         length 5
 inflate:         distance 9
 inflate:         length 4
-inflate:         distance 5
+inflate:         distance 13
 inflate:         end of block
 inflate:   check matches trailer

Both outputs are correct, but the output should be deterministic.


Scenarios 1 and 2 start diverging here:

let scan_val = u64::from_ne_bytes(
core::slice::from_raw_parts(scan_start, 8)
.try_into()
.unwrap(),
);

In scenario 1, scan_val == 0x0000000081818181, where the 0x81's are real bytes, and the 0x00's are - now I am guessing - garbage bytes from internal state outside the copy of the real data. They are not supposed to affect the output, because they will be verified later.

Trying a match at cur_match == 111 (distance 13), where match_val == 0x8181818181818181. This yields len == 4.

Looking for a better match. At cur_match == 103 (distance 5), match_val == 0xb081810081818181. This yields len == 5, because the 0x00 in the input data happens to match the meaningless 0x00 after the data! The match looks better. It is checked to exceed input data though:

/* Do not look for matches beyond the end of the input. */
if len > lookahead {
return (lookahead, match_start);
}

The search stops, because a yet better match is impossible. The truncated match is emitted, with length 4 at distance 5.

In scenario 2, at this point scan_val == 0x9898989881818181. This also contains 4 garbage bytes, but this time they come from an unrelated compression session.

The first match is as before.

Looking for a better match. This time the 0x00 does not match, and len == 4 again. The match is not better, so the best match so far is still cur_match == 111 (distance 13).

Looking for a better match further. At cur_match == 102 (distance 4), match_val == 0x8181008181818181. This is still not better.

There are no more matches to try. The first match at distance 13 is emitted.


Possible fix:

--- a/google3/third_party/rust/zlib_rs/v0_5/src/deflate/longest_match.rs
+++ b/google3/third_party/rust/zlib_rs/v0_5/src/deflate/longest_match.rs
@@ -254,8 +254,9 @@ fn longest_match_help<const SLOW: bool>(
         if len > best_len {
             match_start = cur_match - match_offset;
 
-            /* Do not look for matches beyond the end of the input. */
-            if len > lookahead {
+            // Do not look for better matches if the current match reaches or exceeds
+            // the end of the input.
+            if len >= lookahead {
                 return (lookahead, match_start);
             }
             best_len = len;

Another possible fix:

--- a/google3/third_party/rust/zlib_rs/v0_5/src/deflate/longest_match.rs
+++ b/google3/third_party/rust/zlib_rs/v0_5/src/deflate/longest_match.rs
@@ -254,9 +254,10 @@ fn longest_match_help<const SLOW: bool>(
         if len > best_len {
             match_start = cur_match - match_offset;
 
-            /* Do not look for matches beyond the end of the input. */
+            // Do not look for matches beyond the end of the input. This is necessary
+            // to make deflate deterministic.
             if len > lookahead {
-                return (lookahead, match_start);
+                len = lookahead;
             }
             best_len = len;
             if best_len >= nice_match as usize {

They yield different outputs in that case.

The first fix retains the opportunity for an early exit. The second fix matches the C implementation more closely: https://github.com/madler/zlib/blob/ecbaf031f81ddfcff200dcfd052df48c9047f3cf/deflate.c#L1415-L1418

I did not verify whether the scenario in C otherwise matches the scenario in Rust, but the logic can be easily seen to avoid this non-determinism.

Please decide which fix is better, or propose another one. I am currently verifying that they do not appear to break any our code.

Could you prepare a pull request please?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions