Skip to content

Conversation

@danifus
Copy link
Owner

@danifus danifus commented Jul 26, 2019

No description provided.

danifus and others added 29 commits July 26, 2019 19:02
Replace masking with integers directly with the new global variables.
Easier than writing out `flags | mask` each time.
** This commit changes the __init__ signature of ZipExtFile **

- ZipExtFile is now exclusively responsible for the following segments:
    [local file header]
    [encryption header]
    [file data]
    [data descriptor]
- It is responsible for initialising any decryptors too.
** This undoes the previous __init__ method change a few commits ago **
The code to select compressors and decompressors has been moved to
subclasses to allow subclasses to extend this process.

Also adds a method around _check_compression in ZipFile for a similar
purpose.
This allows these classes which are used inside ZipFile to be
overridden in ZipFile subclasses without having to duplicate and alter
any method which contains references to them.
** This changes the default content of the `extra` field in the local
header to be empty **

Previously, if a file was opened via a ZipInfo instance that had data in
the `extra` field, we may have erroneously left the previous values
there while appending any new or modified values after the existing
content.

This behaviour differs to that of writing the central header `extra`
field where we check that a zip64 entry is not already present and
remove it if it is present (via `_strip_extra`). All other extra fields
are copied across in this instance (which may not be correct either).
** Changes the behaviour of zip64 extra data handling as it now works
when a diskno field is present where there is only 1 or 2 other fields present **
- We now move an index over the extra fields rather than rewriting each
time an extra block was read.
- Methods that handle the extra data now just take the length and
payload bytes.
- This creates a hook for subclasses to add addtional integrity checks
after the file has been read.
This makes all writing of files (directories are handled differently)
contained within this class. The local file header often gets rewritten
when closing the file item to fix up compressed size and someother
things.

One of the tests needed a slight adjustment so `StoredTestsWithSourceFile`
would pass when testing broken files. This doesn't change the behaviour
of writing files. `StoredTestsWithSourceFile.test_writing_errors()`
would fail as OSError wasn't being raised in the
`_ZipWriteFile.close()` (in addition to where `stop == count` would
indicate OSError should have been raised).
Still not as fast as the module level decrypt approach prior to fixing
the seeking bug.

From some basic profiling, if we use a coroutine to encapsulate
`decrypt()`, we can get speeds slightly faster than the original
approach. It is a question of if we want that additional complexity.
To enable subclasses of the classes defined in the zipfile module to
alter the contents of the written zipfile, the methods responsible for
encoding the local file header, central directory and end of file
records have been refactored into the following pattern:

- A method collects the parameters to be encoded, a method encodes those
  parameters to a struct and a method that ties those two methods
  together.

The `get_*_params()` methods can be overridden to alter the params to be
written and implement new features defined in the zip spec.

The separate methods for encoding the structs (`_encode_*()`) also act
as a sanity check that all the required parameters have been supplied
and no unknown parameters are present.
A previous change in the zipfile refactor changeset defaulted the extra
data to be encoded in the local file header to be empty bytes.  This was
because different content may appear in the local file extra data
compared to the central directory extra data (different zip64 fields for
instance).

If opening a file from a ZipInfo instance, the extra data is initialised
with data read from the central directory.

On reflection, the zip64 difference is the only difference between the
two encodings I know of and we can account for that by stripping and
rewritting the zip64 content.

Prior to this changeset the zip64 section was not stripped in the local
file header which may have led to multiple zip64 sections appearing in
files written after being opened with a ZipInfo instance which had zip64
data in its extra data.
The signature of `open()` remains unchanged but _open_to_write() and
_open_to_read() can take kwargs now. This will enable subclasses to be
able to pass additional arguments to `open()`, to pass through to
`_open_to_write()` and `_open_to_read()` without having to duplicate the
contents of `open()`.
While we still raise an error if a password is supplied when trying to
write, this will help people subclass ZipFile and add encryption
functionality.
Small unification of how compress_size is counted when compression
method is ZIP_STORED.
This clean up fixes some short-comings identified when implementing the
AES code used to show the utility of this refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants