Skip to content

Commit 661d75a

Browse files
authored
BUG: Snappy checksum check (#2252)
<!-- Thanks for opening a pull request! --> # Rationale for this change The `SnappyCodec.decompress()` method has a bug where the CRC32 checksum is extracted from the compressed data **after** the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly. **Root Cause:** In the current implementation: 1. `data = data[0:-4]` removes the last 4 bytes (checksum) from the data 2. `checksum = data[-4:]` then tries to get the checksum from the already-truncated data 3. This means `checksum` contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum) **Solution:** Extract the checksum **before** truncating the data: ```python checksum = data[-4:] # store checksum before truncating data data = data[0:-4] # remove checksum from the data ``` This ensures data integrity checks work correctly for snappy-compressed Avro data. # Are these changes tested? The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors. The change is minimal and only reorders two existing lines of code, making it low-risk. # Are there any user-facing changes? **Yes** - This is a bug fix that improves functionality: - **Before:** Snappy-compressed Avro data would fail to decompress with "Checksum failure" errors even when the data and checksum were valid - **After:** Snappy-compressed Avro data with valid checksums will decompress correctly This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.
1 parent 58e5ad6 commit 661d75a

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

pyiceberg/avro/codecs/snappy_codec.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,9 @@ def compress(data: bytes) -> tuple[bytes, int]:
5151
@staticmethod
5252
def decompress(data: bytes) -> bytes:
5353
# Compressed data includes a 4-byte CRC32 checksum
54-
data = data[0:-4]
54+
checksum = data[-4:] # store checksum before truncating data
55+
data = data[0:-4] # remove checksum from the data
5556
uncompressed = snappy.decompress(data)
56-
checksum = data[-4:]
5757
SnappyCodec._check_crc32(uncompressed, checksum)
5858
return uncompressed
5959

0 commit comments

Comments
 (0)