Skip to content

Conversation

0xCCF4
Copy link

@0xCCF4 0xCCF4 commented Jun 21, 2024

When reading a zipfile from within a zipfile, the inside zipfile stream is not seekable. Given the case that we are dealing with large files, it might be unfeasible to buffer the whole nested file to RAM. It may be better to stream the zip file instead. Relying on the local file header and data descriptors to determine the entries of the zip file. Completely ignoring the central directory at the file end.

Current project state:

  • There exists a method for streaming a non-seekable zip file stream.
  • This method does not handle data descriptors
  • Zip files with a data descriptor can not be processed by this function

My contribution:

  • Changed this method to allow parsing a zip file with data descriptors
  • Added tests for this feature

Current state:

  • Prepare the library for reading without requiring a Take object
  • Change the CRC checker to allow setting the CRC to check in the end
  • Security considerations regarding this zip reading approach
  • Add a custom reader the looks ahead for a data descriptor and then terminates the reading of a zip file content

Security Considerations:
Reading a zip archive that way without knowing the file size in advance may result in parsing inconsistency. Since the file content of a single file may be attacker controlled, we must assume that an attacker may craft a file that contains a sequence looking like a data descriptor. After that an attacker might put the header of a new zip file entry.
Parsing the file in streaming fashion will return two files. Parsing with the central directory information will results in just one file read. This should be regarded as security risk. Still, allowing to read zip files with data descriptors in a streamed fashion is useful for some applications. I therefore introduced the types UntrustedValue and MaybeUnstrusted. See https://github.com/0xCCF4/zip2/blob/f6b5da958028c5cb90d6de7b5b56a378de52fc95/src/result.rs#L110
If a library user would like to use this streaming functionality this security risk must be explicitly accepted.

Related issues:
#162

@0xCCF4 0xCCF4 marked this pull request as draft June 21, 2024 07:42
@0xCCF4 0xCCF4 changed the title draft: Add means to read a zip file part by part draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024
@0xCCF4 0xCCF4 marked this pull request as ready for review June 24, 2024 12:19
@0xCCF4 0xCCF4 changed the title draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024
@0xCCF4
Copy link
Author

0xCCF4 commented Jun 24, 2024

The changes kind of exploded 🙈 😅 - but now it finally works. Looking forward to a review.
Best regards

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decompressing readers only exist when the feature flag for their compression method is enabled. I may have more comments later.

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems mostly functionally correct, but I have some performance concerns.

src/read.rs Outdated
}

let limit_reader = (reader as &'a mut dyn Read).take(result.compressed_size);
fn read_a_byte(&mut self) -> io::Result<Option<u8>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading one byte at a time like this won't be efficient. Instead, read a block into the look-ahead buffer and use memchr::memmem::find to check for a data descriptor signature. (Be sure to leave size_of::<Magic>() bytes in the buffer, in case the descriptor signature straddles two blocks!)

src/read.rs Outdated
Comment on lines 1771 to 1785
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE
{
// potentially found a data descriptor
// check if size matches
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {
Ok(data_descriptor) => data_descriptor,
Err(_) => return None,
};
let data_descriptor_size = data_descriptor.compressed_size;

if data_descriptor_size == self.number_read_total_actual as u32 {
return Some(data_descriptor);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interpret already checks whether the magic matches.

Suggested change
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE
{
// potentially found a data descriptor
// check if size matches
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {
Ok(data_descriptor) => data_descriptor,
Err(_) => return None,
};
let data_descriptor_size = data_descriptor.compressed_size;
if data_descriptor_size == self.number_read_total_actual as u32 {
return Some(data_descriptor);
}
}
let Ok(data_descriptor) = ZipDataDescriptor::interpret(&self.look_ahead_buffer) else {
return None;
};
let data_descriptor_size = data_descriptor.compressed_size;
if data_descriptor_size == self.number_read_total_actual as u32 {
Some(data_descriptor)
} else {
None
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the CRC32 here as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs some architectural changes. The function above operates on the compressed zip file. The CRC is calculated from the uncompressed data. With the current architecture checking CRC in this function is not possible.

Currently I have no good idea to solve this. Still, the CRC is checked, so if an "data descriptor" occurs in the stream - Which is only the case if the file size in the data descriptor is correct - the CRC is likely wrong. In this case only the first half/or part of the file is returned to the user, but then an error is thrown when the CRC check fails; reaching the stream EOF.

@Pr0methean Pr0methean added Please address review comments Some review comments are still open. and removed review in progress labels Jul 6, 2024
@0xCCF4
Copy link
Author

0xCCF4 commented Jul 8, 2024

Review todos:

  • Simple review suggestions
  • Convert Vec to VecDequeue
  • Optimize read_a_byte to read more than a byte at once 😄
  • Check CRC once a potential data descriptor is found

@Pr0methean Pr0methean added the Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. label Jul 27, 2024
0xCCF4 and others added 4 commits September 10, 2025 16:06
…reader: &mut S) that reads the next zipfile entry from a stream by potential parsing the data descriptor

This is an alternative method to read a zip file. If possible, use the ZipArchive functions as some information will be missing when reading this manner.

This method extends the existing read_zipfile_from_stream method when the stream is seekable.

This method is superior to ZipArchive in the special case that the underlying stream implementation must buffer all seen/read data to provide seek back support and the memory consumption must be kept small. This could be the case when reading a ZipFile B.zip nested within a zip file A.zip, that is stored on disk. Since A is seekable when stored as file, A.zip can be read using ZipArchive. Be B a zip file stored in A, then we can read B's content using a decompressing reader. The problem is that the decompressing reader will not be seekable (due to decompression). When one would like to still read contents of B.zip without extracting A to disk, the file B.zip must be buffered in RAM. When using ZipArchive to read B.zip from RAM, the whole B.zip file must be buffered to RAM because the central directory of B.zip is located at the end of the file B.zip. This method will read B.zip from the start of the file and returning the first file entry found in B.zip. After the execution of this function and the ZipFile return value is dropped, the reader will be positioned at the end of the file entry. Since this function will never seek back to before the initial position of the stream when the function was called, the underlying stream implementation may discard, after dropping ZipFile, all buffered data before the current position of the stream. Summarizing: In given scenario, this method must not buffer the whole B.zip file to RAM, but only the first file entry.

# Conflicts:
#	src/read.rs
@Pr0methean Pr0methean force-pushed the feature-read-from-seekable-stream branch from 2ee9c18 to a0ced19 Compare September 10, 2025 23:48
let block = read_local_fileblock(reader)?;
let block = match block {
Some(block) => block,
None => return Ok(None),

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
)?,
}))
// we can't look for a data descriptor since we can't seek back
// TODO: provide a method that buffers the read data and allows seeking back

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@Pr0methean
Copy link
Member

@0xCCF4 I've rebased this against master, but some tests are failing in CI. Could you please take a look?

@Pr0methean Pr0methean added Please fix failing tests Tests are failing with this change; please fix them. and removed Please address review comments Some review comments are still open. Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. labels Sep 10, 2025
@0xCCF4
Copy link
Author

0xCCF4 commented Sep 11, 2025

@Pr0methean, I fixed the deflate CI test.

The other failing tests are:

  • read::stream::test::file_and_dir_predicates
  • read::stream::test::test_can_create_destination
  • read::stream::test::zip_read_streaming
  • read::test::zip_read_streaming

The first zip file entry is read successfully, the second one fails in each case.
I suppose the problem is that between entries, some random bytes are scattered around, and one has to search for the next magic before trying to read the next entry.

At this time, I unfortunately do not have the capacity to look into this further.

@Pr0methean Pr0methean marked this pull request as draft September 11, 2025 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Please fix failing tests Tests are failing with this change; please fix them.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants