Skip to content

Conversation

ppkarwasz
Copy link
Contributor

In the current implementation:

  • getBytesRead() could drift from the actual archive size after a full read.
  • Exceptions on truncation errors were inconsistent or missing.
  • DataInputStream (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

This PR introduces:

  • Accurate byte accounting: count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. getBytesRead() now matches the archive length at end-of-stream.

  • Consistent truncation handling:

    • Truncation in the main (archive) header, read during construction, now throws an ArchiveException wrapping an EOFException (cause preserved).
    • Truncation in file headers or file data is propagated as a plain EOFException from getNextEntry()/read().
  • Endianness refactor: replace DataInputStream with EndianUtils, removing several bespoke helpers and making intent explicit.

  • Add assertion that getBytesRead() equals the archive size after full consumption.

  • Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.
The static import makes it harder to distinguish calls that need to count bytes from those that do not.
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz
I left a few comments.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz
I replied to your question.

ppkarwasz added a commit that referenced this pull request Oct 16, 2025
> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

## Motivation

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

## What this changes

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
ppkarwasz added a commit that referenced this pull request Oct 16, 2025
> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
garydgregory pushed a commit that referenced this pull request Oct 16, 2025
> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz
Merged #731, please resolve conflicts here. TY!

@ppkarwasz
Copy link
Contributor Author

@garydgregory,

I resolved the conflicts and sorted the methods alphabetically.

@garydgregory garydgregory merged commit 2e84319 into commons_io_2_21_0 Oct 17, 2025
16 checks passed
@garydgregory
Copy link
Member

@ppkarwasz TY, merged!

@garydgregory garydgregory deleted the fix/arj-length-computation branch October 17, 2025 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants