- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 33.2k
gh-117779: Fix reading duplicated entries in zipfile by name #129254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-117779: Fix reading duplicated entries in zipfile by name #129254
Conversation
| I wonder how far should we backport this change. #110016 was a security fix and was backported to 3.8. | 
c683cf7    to
    b8a9cea      
    Compare
  
    | 
 I didn't expect it either. And I suspect the creation of those duplicate entries to be a bug, but they do clearly exist in the wild. 
 I'm slightly unsure about this fix though. For one, it breaks the workaround we added to  IMO it would be admittedly more complex but also more consistent -- and avoid any errors processing these ZIP files -- to explicitly handle entries with duplicate offsets: letting all of them have the same  (edited for clarity) | 
| Sorry for the delay in responding. I experimented with different solutions and wrote new tests. To me, this is a ZIP bomb. Even if it does not allow to fill the disk space by decompressing, it still spend CPU time for uncompressing the same data. It may consume a lot of memory if the decompressed entries are store in a list instead of a dict. But since ZIP archives with duplicated entries encountered in practice, created not by malicious, but just bad software, perhaps we could lower the severity level from error to warning. Warnings can be converted to exceptions, if you want to make your program safer. @gpshead, what is your opinion on this? | 
| I like the warning idea. | 
| A warning seems reasonable, yeah. But I dislike the "except the last one" edge case where it still breaks, surely that can be avoided? | 
| IMO a warning indeed makes perfect sense but I think entries with the same  | 
| Warnings or exceptions are emitted for some abnormal ZIP files from ancient times, but it is not possible to detect all anomalies, and even if it is possible, it is not always practical. For example, in a "quote-overlap" zip bomb, the most nested file can be read without error. It is not a large issue if one of entries with the same offset can be read without error or warning. For consistency with how non-overlapped files with the same name are handled, it should be the last entry. | 
| Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13. | 
…ythonGH-129254) (cherry picked from commit 0f04f24) Co-authored-by: Serhiy Storchaka <[email protected]>
…ythonGH-129254) (cherry picked from commit 0f04f24) Co-authored-by: Serhiy Storchaka <[email protected]>
| GH-132263 is a backport of this pull request to the 3.13 branch. | 
| GH-132264 is a backport of this pull request to the 3.12 branch. | 
…H-129254) (GH-132264) (cherry picked from commit 0f04f24) Co-authored-by: Serhiy Storchaka <[email protected]>
…H-129254) (GH-132263) (cherry picked from commit 0f04f24) Co-authored-by: Serhiy Storchaka <[email protected]>
| zipf.read('a') | ||
|  | ||
| @requires_zlib() | ||
| def test_full_overlap_different_names2(self): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have given this function a different name, something that distinguishes it from the non-2 version... or at the very least add a docstring explaining what it's asserting. Future readers are going to have a hard time discerning the motivations for these two tests based on null vs. 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are very similar, the only difference is what central directory name matches the local header name -- the first or the second. I hope it will be clear from the comments.
| data = ( | ||
| b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xa0lH\x05\xe2\x1e' | ||
| b'8\xbb\x10\x00\x00\x00\t\x04\x00\x00\x01\x00\x00\x00b\xed' | ||
| b'\xc0\x81\x08\x00\x00\x00\xc00\xd6\xfbK\\d\x0b`P' | ||
| b'K\x01\x02\x14\x00\x14\x00\x00\x00\x08\x00\xa0lH\x05\xe2' | ||
| b'\x1e8\xbb\x10\x00\x00\x00\t\x04\x00\x00\x01\x00\x00\x00\x00' | ||
| b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00aPK' | ||
| b'\x01\x02\x14\x00\x14\x00\x00\x00\x08\x00\xa0lH\x05\xe2\x1e' | ||
| b'8\xbb\x10\x00\x00\x00\t\x04\x00\x00\x01\x00\x00\x00\x00\x00' | ||
| b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00bPK\x05' | ||
| b'\x06\x00\x00\x00\x00\x02\x00\x02\x00^\x00\x00\x00/\x00\x00' | ||
| b'\x00\x00\x00' | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, binary data like this should be accompanied by an explanation of what it represents and ideally how it was constructed. Should this test start failing, how will someone know what this zip file represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll add comments. Please see #137152.
The data was created manually. I created simple ZIP files using the zip command, and then edited them -- removed or moved the data, changed file names, updated offsets. I tested how it worked with unpatched zipfile (all was read successfully).
Only the data for test_quoted_overlap was generated by a script provided in the original report, and then simplified.
| I'm only just now getting around to reviewing this PR. I don't have enough context to make a good opinion about the choice and definitely no reason to revive the conversation now that it's merged. | 
Uh oh!
There was an error while loading. Please reload this page.