ARJ: correct byte accounting and truncation errors #723

ppkarwasz · 2025-10-06T13:03:59Z

In the current implementation:

getBytesRead() could drift from the actual archive size after a full read.
Exceptions on truncation errors were inconsistent or missing.
DataInputStream (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

This PR introduces:

Accurate byte accounting: count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. getBytesRead() now matches the archive length at end-of-stream.
Consistent truncation handling:
- Truncation in the main (archive) header, read during construction, now throws an ArchiveException wrapping an EOFException (cause preserved).
- Truncation in file headers or file data is propagated as a plain EOFException from getNextEntry()/read().
Endianness refactor: replace DataInputStream with EndianUtils, removing several bespoke helpers and making intent explicit.
Add assertion that getBytesRead() equals the archive size after full consumption.
Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* `getBytesRead()` could drift from the actual archive size after a full read. * Exceptions on truncation errors were inconsistent or missing. * `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields. * **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream. * **Consistent truncation handling:** * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved). * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`. * **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit. * Add assertion that `getBytesRead()` equals the archive size after full consumption. * Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

garydgregory

Hi @ppkarwasz
I left a few comments.

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java

…length-computation

garydgregory

Hi @ppkarwasz
I replied to your question.

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java

> [!CAUTION] > **Source-incompatible** (callers may need to add `throws IOException` or a catch). > **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor). ## Motivation Several `ArchiveInputStream` implementations either - must read/validate bytes up front (e.g., magic headers), or - may fail immediately when the underlying stream is unreadable. Today we’re inconsistent: * Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared. * Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**). This makes error handling uneven for users and complicates eager validation. ## What this changes * All archive `InputStream` constructors now declare `throws IOException`. * **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`. * **AR**: move reading of the global signature into the constructor (eager validation). No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate. For the ARJ format this was discussed in #723 (comment). > [!NOTE] > Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.

> [!CAUTION] > **Source-incompatible** (callers may need to add `throws IOException` or a catch). > **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor). Several `ArchiveInputStream` implementations either - must read/validate bytes up front (e.g., magic headers), or - may fail immediately when the underlying stream is unreadable. Today we’re inconsistent: * Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared. * Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**). This makes error handling uneven for users and complicates eager validation. * All archive `InputStream` constructors now declare `throws IOException`. * **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`. * **AR**: move reading of the global signature into the constructor (eager validation). No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate. For the ARJ format this was discussed in #723 (comment). > [!NOTE] > Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.

garydgregory

Hi @ppkarwasz
Merged #731, please resolve conflicts here. TY!

…length-computation

ppkarwasz · 2025-10-16T19:04:32Z

@garydgregory,

I resolved the conflicts and sorted the methods alphabetically.

garydgregory · 2025-10-17T12:08:29Z

@ppkarwasz TY, merged!

ppkarwasz commented Oct 6, 2025

View reviewed changes

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java Show resolved Hide resolved

ppkarwasz added 3 commits October 6, 2025 15:49

fix: failing legacy test

824465b

fix: checkstyle error

7f80ae2

fix: remove EndianUtils static import

54a209e

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

garydgregory requested changes Oct 11, 2025

View reviewed changes

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java Outdated Show resolved Hide resolved

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java Show resolved Hide resolved

ppkarwasz mentioned this pull request Oct 12, 2025

ARJ: strict header validation and selfExtracting option #728

Open

Merge remote-tracking branch 'apache/commons_io_2_21_0' into fix/arj-…

5720fa3

…length-computation

garydgregory requested changes Oct 15, 2025

View reviewed changes

src/main/java/org/apache/commons/compress/archivers/arj/ArjArchiveInputStream.java Outdated Show resolved Hide resolved

ppkarwasz mentioned this pull request Oct 16, 2025

Source breaking change: Declare IOException on archive InputStream constructors #731

Merged

garydgregory requested changes Oct 16, 2025

View reviewed changes

ppkarwasz added 4 commits October 16, 2025 20:01

Merge remote-tracking branch 'apache/commons_io_2_21_0' into fix/arj-…

64103e4

…length-computation

Fix failing test

d729f4e

Sort methods

e62420f

Remove unused method

6a510ea

ppkarwasz requested a review from garydgregory October 16, 2025 21:00

garydgregory merged commit 2e84319 into commons_io_2_21_0 Oct 17, 2025
16 checks passed

garydgregory deleted the fix/arj-length-computation branch October 17, 2025 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARJ: correct byte accounting and truncation errors #723

ARJ: correct byte accounting and truncation errors #723

Uh oh!

ppkarwasz commented Oct 6, 2025

Uh oh!

Uh oh!

garydgregory left a comment

Uh oh!

Uh oh!

Uh oh!

garydgregory left a comment

Uh oh!

Uh oh!

garydgregory left a comment

Uh oh!

ppkarwasz commented Oct 16, 2025

Uh oh!

Uh oh!

garydgregory commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARJ: correct byte accounting and truncation errors #723

ARJ: correct byte accounting and truncation errors #723

Uh oh!

Conversation

ppkarwasz commented Oct 6, 2025

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

ppkarwasz commented Oct 16, 2025

Uh oh!

Uh oh!

garydgregory commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants