-
Notifications
You must be signed in to change notification settings - Fork 302
ARJ: correct byte accounting and truncation errors #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* `getBytesRead()` could drift from the actual archive size after a full read. * Exceptions on truncation errors were inconsistent or missing. * `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields. * **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream. * **Consistent truncation handling:** * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved). * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`. * **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit. * Add assertion that `getBytesRead()` equals the archive size after full consumption. * Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.
src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java
Show resolved
Hide resolved
The static import makes it harder to distinguish calls that need to count bytes from those that do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ppkarwasz
I left a few comments.
src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java
Show resolved
Hide resolved
…length-computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ppkarwasz
I replied to your question.
src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java
Outdated
Show resolved
Hide resolved
> [!CAUTION] > **Source-incompatible** (callers may need to add `throws IOException` or a catch). > **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor). ## Motivation Several `ArchiveInputStream` implementations either - must read/validate bytes up front (e.g., magic headers), or - may fail immediately when the underlying stream is unreadable. Today we’re inconsistent: * Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared. * Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**). This makes error handling uneven for users and complicates eager validation. ## What this changes * All archive `InputStream` constructors now declare `throws IOException`. * **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`. * **AR**: move reading of the global signature into the constructor (eager validation). No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate. For the ARJ format this was discussed in #723 (comment). > [!NOTE] > Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
> [!CAUTION] > **Source-incompatible** (callers may need to add `throws IOException` or a catch). > **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor). Several `ArchiveInputStream` implementations either - must read/validate bytes up front (e.g., magic headers), or - may fail immediately when the underlying stream is unreadable. Today we’re inconsistent: * Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared. * Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**). This makes error handling uneven for users and complicates eager validation. * All archive `InputStream` constructors now declare `throws IOException`. * **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`. * **AR**: move reading of the global signature into the constructor (eager validation). No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate. For the ARJ format this was discussed in #723 (comment). > [!NOTE] > Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
> [!CAUTION] > **Source-incompatible** (callers may need to add `throws IOException` or a catch). > **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor). Several `ArchiveInputStream` implementations either - must read/validate bytes up front (e.g., magic headers), or - may fail immediately when the underlying stream is unreadable. Today we’re inconsistent: * Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared. * Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**). This makes error handling uneven for users and complicates eager validation. * All archive `InputStream` constructors now declare `throws IOException`. * **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`. * **AR**: move reading of the global signature into the constructor (eager validation). No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate. For the ARJ format this was discussed in #723 (comment). > [!NOTE] > Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ppkarwasz
Merged #731, please resolve conflicts here. TY!
I resolved the conflicts and sorted the methods alphabetically. |
@ppkarwasz TY, merged! |
In the current implementation:
getBytesRead()
could drift from the actual archive size after a full read.DataInputStream
(big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.This PR introduces:
Accurate byte accounting: count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data.
getBytesRead()
now matches the archive length at end-of-stream.Consistent truncation handling:
ArchiveException
wrapping anEOFException
(cause preserved).EOFException
fromgetNextEntry()
/read()
.Endianness refactor: replace
DataInputStream
withEndianUtils
, removing several bespoke helpers and making intent explicit.Add assertion that
getBytesRead()
equals the archive size after full consumption.Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.