Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions Lib/zipfile/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,8 +270,7 @@ def _EndRecData64(fpin, offset, endrec):
if diskno != 0 or disks > 1:
raise BadZipFile("zipfiles that span multiple disks are not supported")

# Assume no 'zip64 extensible data'
fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2)
fpin.seek(reloff, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think struct.unpack returns relative to the struct, and that will be at offset - sizeEndCentDir64Locator (where it was read from). SEEK_SET / 0 isn't right as reloff is a relative position from where the struct was read, not an absolute position in the file.

Copy link
Author

@VladRassokhin VladRassokhin Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the variable is named reloff and the spec states "relative offset of the zip64 end of central directory record" without specifying relative to what, in reality it's offset from the beginning of the file. See code which writes it

stringEndArchive64Locator, 0, pos2, 1)

Also, the same in the libzip code: https://github.com/nih-at/libzip/blob/d0ebf7fa268ae2e59e575cb3a72e6bc259e3fdd8/lib/zip_open.c#L853

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmaloney wdyt on renaming reloff, worth changing to e.g. eocd_offset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely would be clearer to me, but not sure it's worth the extra noise in the diff though / additional lines changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offsets are complicated by data that may precede the start of the zip content which is why some of the tests are failing. reloff is the number of bytes from the start of the first local file header in the zip which may not be the actual start of the file. I can't think of a good way of directly computing the start of the zip64 end of record block if there's data preceding the start of the zip part of the file. Might have to do a bit of a search.

    # Assume no 'zip64 extensible data'
    fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2)
    data = fpin.read(sizeEndCentDir64)
    if len(data) != sizeEndCentDir64:
        return endrec

    sig, sz, create_version, read_version, disk_num, disk_dir, \
        dircount, dircount2, dirsize, diroffset = \
        struct.unpack(structEndArchive64, data)
    if sig != stringEndArchive64:
        loc_pos = fpin.tell()
        # Seek to the earliest possible eocd64 start
        fpin.seek(reloff, 0)
        # Read all the data between here and the eocd64 locator
        data = fpin.read(loc_pos - reloff)
        start = data.rfind(stringEndArchive64)
        if start >= 0 and len(data) - start > sizeEndCentDir64:
            sig, sz, create_version, read_version, disk_num, disk_dir, \
                dircount, dircount2, dirsize, diroffset = \
                struct.unpack(structEndArchive64, data[start:start+sizeEndCentDir64])
            if sig != stringEndArchive64:
                return endrec
        else:
            return endrec

My code needs improvement as data = fpin.read(loc_pos - reloff) might read a substantial amount of data if there's a big blob of data before the zip. It would also be a good idea to check that the sizes of the offsets are consistent with regards to:

  • the zip64 end locator is actually sz bytes from the position just after the sz field
  • The position of the stringEndArchive64 signature found using rfind is the same position as the stringEndArchive64 signature that is found directly after the central directory.

data = fpin.read(sizeEndCentDir64)
if len(data) != sizeEndCentDir64:
return endrec
Expand All @@ -281,6 +280,8 @@ def _EndRecData64(fpin, offset, endrec):
if sig != stringEndArchive64:
return endrec

size_zip64_tail_records = sz + 12 + sizeEndCentDir64Locator

# Update the original endrec using data from the ZIP64 record
endrec[_ECD_SIGNATURE] = sig
endrec[_ECD_DISK_NUMBER] = disk_num
Expand All @@ -289,6 +290,8 @@ def _EndRecData64(fpin, offset, endrec):
endrec[_ECD_ENTRIES_TOTAL] = dircount2
endrec[_ECD_SIZE] = dirsize
endrec[_ECD_OFFSET] = diroffset
# Adjust location for Zip64 extension structures
endrec[_ECD_LOCATION] -= size_zip64_tail_records
return endrec


Expand Down Expand Up @@ -1453,9 +1456,6 @@ def _RealGetContents(self):

# "concat" is zero, unless zip was concatenated to another file
concat = endrec[_ECD_LOCATION] - size_cd - offset_cd
if endrec[_ECD_SIGNATURE] == stringEndArchive64:
# If Zip64 extension structures are present, account for them
concat -= (sizeEndCentDir64 + sizeEndCentDir64Locator)

if self.debug > 2:
inferred = concat + offset_cd
Expand Down
Loading