Skip to content

Provide access to a wrapper that validates that bytes being read are valid UTF-8#947

Draft
dralley wants to merge 8 commits intotafia:masterfrom
dralley:assume-utf8
Draft

Provide access to a wrapper that validates that bytes being read are valid UTF-8#947
dralley wants to merge 8 commits intotafia:masterfrom
dralley:assume-utf8

Conversation

@dralley
Copy link
Copy Markdown
Collaborator

@dralley dralley commented Feb 22, 2026

This is a small stepping stone towards #158

Comment thread src/errors.rs
fn from(error: IoError) -> Error {
Self::Io(Arc::new(error))
match error.kind() {
IoErrorKind::InvalidData => Self::Encoding(error.downcast::<EncodingError>().expect(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently stabilized in 1.79 https://doc.rust-lang.org/std/io/struct.Error.html#method.downcast

So we would need to bump MSRV.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you will rebase you PR for whatever reason, it would be good to add this to the commit that bumps MSRV so we quickly could find reasons if required

Copy link
Copy Markdown
Collaborator Author

@dralley dralley May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but on the other hand it's just 1.79, probably 90% of the (actively maintained) ecosystem has a newer MSRV than that already.

IMO the benefit is marginal when the pull request is one click away from the commit, and the commit itself is only one commit away from the commit that is "the reason"

I could just note that in the commit message.

Comment thread tests/encodings.rs Outdated
Comment thread src/encoding.rs Outdated
}
}

impl<R: Read> Read for Utf8ValidatingReader<R> {
Copy link
Copy Markdown
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do more testing and review on this implementation myself as well, I don't trust it yet.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 22, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 87.16730% with 135 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.16%. Comparing base (a759d65) to head (e3ade48).
⚠️ Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
benches/encoding.rs 0.00% 72 Missing ⚠️
src/encoding.rs 95.09% 47 Missing ⚠️
src/reader/ns_reader.rs 0.00% 6 Missing ⚠️
src/reader/buffered_reader.rs 0.00% 4 Missing ⚠️
src/reader/mod.rs 25.00% 3 Missing ⚠️
src/events/attributes.rs 0.00% 2 Missing ⚠️
src/errors.rs 80.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #947      +/-   ##
==========================================
+ Coverage   55.08%   58.16%   +3.07%     
==========================================
  Files          44       45       +1     
  Lines       16911    18655    +1744     
==========================================
+ Hits         9316    10851    +1535     
- Misses       7595     7804     +209     
Flag Coverage Δ
unittests 58.16% <87.16%> (+3.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/reader/mod.rs
Copy link
Copy Markdown
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me, rather that implement own decoding stream, we may use encoding_rs_io or similar library to do the conversion.

I also just created #948 to show you where we may change the reader from the reader that guessing encoding to the reader that transparently decodes. It seems to me that it is at the level of a slightly higher level XmlReader that honest decoding should be introduced.

Comment thread src/encoding.rs Outdated
Comment thread src/encoding.rs Outdated
Comment thread src/reader/mod.rs Outdated
Comment thread src/encoding.rs
@dralley dralley force-pushed the assume-utf8 branch 2 times, most recently from c2dcb4b to a6fe9e5 Compare February 22, 2026 18:59
Comment thread src/reader/state.rs
let event = BytesDecl::from_start(BytesStart::wrap(content, 3, self.decoder()));

// TODO: once we can assume that the parser is operating on UTF-8, then we can throw
// an error here if we see a non-UTF-8 encoding... if encoding/decoding is not enabled.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume this yet because it's still possible to create non-validating readers under `[cfg(not(feature = "encoding"))]

@dralley dralley force-pushed the assume-utf8 branch 3 times, most recently from f302282 to 2a1bbb0 Compare February 22, 2026 19:46
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented Feb 22, 2026

@Mingun Ignore for the time being the exact details of the Utf8ValidatingReader implementation - I already mentioned it still needs some additional work.

Architecturally, do you have any objections to the direction of this work? The general approach, how it's layered, etc.

Or the (long-term) goal of taking all decoding, encoding detection and/or validation, and BOM detection and/or stripping (and maybe EOL normalization) OUT of the parser itself and into a pre-processor, provided doing so either improves performance or has minimal cost?

It seems to me, rather that implement own decoding stream, we may use encoding_rs_io or similar library to do the conversion.

Yes, I agree, that's the path I was going down on the initial implementation. But assuming we want to continue keeping a separate encoding feature, the non encoding implementation can still provide the same guarantees with validation, and that then simplifies everything written on top. You could get rid of the API divergence between encoding and not(encoding), safely move to a str / String based API, etc.

edit: probably this path could also be implemented with encoding_rs, it would just require carrying that dependency always. I am flexible on the details

I also just created #948 to show you where we may change the reader from the reader that guessing encoding to the reader that transparently decodes. It seems to me that it is at the level of a slightly higher level XmlReader that honest decoding should be introduced.

I still think the better approach is to just push all preprocessing down underneath the parsing code, and take advantage of the simplifications that makes possible. And the ability to be able to parse UTF-16.

Yes maybe you can skip over performing some validation that way, but there is also a cost to running the validation (and maybe allocations) many times over small buffers instead of once over a large buffer. Not to mention the additional internal complexity of Decoder infecting so many different object types (which also increases the size of all of the structs), even if it was otherwise hidden from the user.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented Feb 22, 2026

Feature-wise, this PR is now complete. It's just a matter of improving the quality of Utf8ValidatingReader and deciding whether the new constructors are how we want it to be exposed.

Global decoding, improving APIs, etc. -- all of that is for future PRs. It should be able to be bolted onto this infrastructure though.

Comment thread src/encoding.rs Outdated

/// A reader wrapper that ensures only valid UTF-8 bytes are read.
///
/// This reader uses [`str::from_utf8()`] and [`Utf8Error::valid_up_to()`] to validate
Copy link
Copy Markdown
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to also try simdutf8

Probably this could also be done with encoding_rs. If you are open to requiring encoding_rs always, then we could ditch this implementation completely. I just figured that we probably wouldn't want to do that.

@dralley dralley force-pushed the assume-utf8 branch 3 times, most recently from eb13980 to 65aae52 Compare February 26, 2026 06:27
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented Feb 26, 2026

I'm working on a more efficient implementation that assumes R: BufRead. In the meantime, could you please give your opinion on the overall architecture? Everything except for the Utf8ValidatingReader itself is ready for review.

@dralley dralley force-pushed the assume-utf8 branch 4 times, most recently from f193173 to d0c4a1f Compare March 2, 2026 05:05
@dralley dralley marked this pull request as ready for review March 2, 2026 05:14
@dralley dralley requested a review from Mingun March 4, 2026 23:38
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented Mar 22, 2026

@Mingun Please take a look at my comments when you get a moment, I'd like to move this work along.

As mentioned I remain pretty flexible on the implementation (e.g. if you'd prefer to just use encoding_rs for everything) and basic API (I don't like the from_*_validating() constructors much, it was just convenient to get testing going without committing to API breakage)

@Mingun
Copy link
Copy Markdown
Collaborator

Mingun commented Mar 23, 2026

I will try to find time to review this and other latest issues / PRs in the end of week.

Copy link
Copy Markdown
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecturally, do you have any objections to the direction of this work? The general approach, how it's layered, etc.

Architecturally, I suggest that the parser inside still work with [u8], since it does not require UTF-8 guarantees to work. We only need guarantees of XML-compatible encoding1, and these are all single-byte encodings that encoding_rs supports. On the other hand, users usually work with UTF-8 text and they would like to have an API that operates on str / String rather than [u8] / Vec<u8>. I think this is quite achievable, but we will have to introduce duplicate structures for event types or somehow parametrize existing ones.

We can achieve correctness now2 by substituting R in Reader<R>, which will transcode the readable data to UTF-8 on the fly, or, as in the case of this PR, check that the data is correctly encoded in UTF-8 (which is also a kind of zero-cost transcoding).

In this sense, I agree with the Utf8ValidatingReader approach implemented here. To further develop and provide the str-API, I assume that one should move towards implementing the new methods in

impl<R> Reader<Utf8ValidatingReader<R>> {
}

which will call the existing [u8]-based methods and convert the result to str (using unsafe methods, since we will check that the input was in UTF-8).

Or the (long-term) goal of taking all decoding, encoding detection and/or validation, and BOM detection and/or stripping (and maybe EOL normalization) OUT of the parser itself and into a pre-processor, provided doing so either improves performance or has minimal cost?

This is probably closer to my vision, but more research is needed here. For example, we want to report offsets in units of the original byte source, and not in units of the source after normalization/decoding, which will be hidden from the original user.

I still think the better approach is to just push all preprocessing down underneath the parsing code, and take advantage of the simplifications that makes possible. And the ability to be able to parse UTF-16.

Yes, #948 just allows you to do it. If the first event is an XML declaration (and the XML standard says that the XML declaration should be the very first event, even empty lines in front are not allowed. This, by the way, allows you to detect UTF-16-like encoding without using BOM), then we can extract the encoding from it and replace the reader in EntityReader with another one that will implement decoding by large chunks before parsing (since Box<dyn BufRead> is stored there, the type will not change).

edit: probably this path could also be implemented with encoding_rs, it would just require carrying that dependency always. I am flexible on the details

In theory, we could use encoding_rs_io or encoding_rs_rw if we make PR there, which will make their dependence on encoding_rs optional, and the only supported operation in the absence of encoding_rs will check the input to UTF-8 using standard library methods.


I choose Request changes variant, because I feel that at least that can be changed:

  • use BufReader::with_capacity instead of ChunkedReader
  • use BufRead in Utf8BytesReader
  • detect_encoding could be extracted to separate PR
  • I think, that new constructors are redundant now. We always can add them later if required

Footnotes

  1. That is, those in which XML markup characters have the same codes as in ASCII / UTF-8

  2. Of course, currently we should know the encoding in advance. If we want to change it after reading the XML declaration or BOM, then we will need to change the Reader from which we read. I believe it should be easier to do after the introduction of a new XML reader in #948

Comment thread src/errors.rs
fn from(error: IoError) -> Error {
Self::Io(Arc::new(error))
match error.kind() {
IoErrorKind::InvalidData => Self::Encoding(error.downcast::<EncodingError>().expect(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you will rebase you PR for whatever reason, it would be good to add this to the commit that bumps MSRV so we quickly could find reasons if required

Comment thread src/encoding.rs
Comment on lines +762 to +792
/// Helper reader that returns data in fixed-size chunks
struct ChunkedReader<'a> {
data: &'a [u8],
pos: usize,
chunk_size: usize,
}

impl<'a> ChunkedReader<'a> {
fn new(data: &'a [u8], chunk_size: usize) -> Self {
Self {
data,
pos: 0,
chunk_size,
}
}
}

impl<'a> Read for ChunkedReader<'a> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if self.pos >= self.data.len() {
return Ok(0);
}
let len = self
.chunk_size
.min(buf.len())
.min(self.data.len() - self.pos);
buf[..len].copy_from_slice(&self.data[self.pos..self.pos + len]);
self.pos += len;
Ok(len)
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not necessary, you may use BufReader::with_capacity, the effect will be the same.

Copy link
Copy Markdown
Collaborator Author

@dralley dralley May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, BufReader has various tricks to bypass the buffer entirely if the buffer is too small

Thus reads are not necessarily capped at "capacity" bytes

https://doc.rust-lang.org/src/std/io/buffered/bufreader.rs.html#339-342

Comment thread src/encoding.rs
Comment on lines +909 to +916
loop {
let mut buf = [0u8; 10];
let n = reader.read(&mut buf).unwrap();
if n == 0 {
break;
}
result.extend_from_slice(&buf[..n]);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may use more strict check here: we expect, that we will need only two calls of read to get the data and the 3rd call should return 0.

Copy link
Copy Markdown
Collaborator Author

@dralley dralley May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8ValidatingReader internally buffers until it can return a valid sequence. So one single call to {Utf8ValidatingReader}.read() ought to successfully return the 2 bytes, but that maps to two read() calls on the inner ChunkedReader. And then I suppose a 2nd read() of the Utf8ValidatingReader would map to a single read() call on the internal ChunkedReader. So 2 external calls = 3 internal calls.

There's no way to read the intermediate results from ChunkedReader in the test. But we could store a call_count to assert on the number of internal read() calls, potentially.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean to unroll loop manually. That should be possible, because we expect that on some iteration n would be 0 and the number of iterations should be constant.

Comment thread src/encoding.rs
Comment on lines +929 to +936
loop {
let mut buf = [0u8; 10];
let n = reader.read(&mut buf).unwrap();
if n == 0 {
break;
}
result.extend_from_slice(&buf[..n]);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: 3 + 1 calls

Comment thread src/encoding.rs
Comment on lines +949 to +956
loop {
let mut buf = [0u8; 10];
let n = reader.read(&mut buf).unwrap();
if n == 0 {
break;
}
result.extend_from_slice(&buf[..n]);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: 4 + 1 calls

Comment thread src/encoding.rs
Comment on lines +529 to +535
if !self.encoding_checked {
self.encoding_checked = true;

let available = self.inner.fill_buf()?;
// detect_encoding uses starts_with, so patterns longer than the
// available data simply won't match — no length guard needed.
if let Some(detected) = detect_encoding(available) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't encoding_checked set too early? If the first returned buffer will not give us enough bytes to detect encoding we will not detect it at all. That is possible even for BufReader with large buffer, because fill_buf may continuously return small buffer until it will not be consumed. That was the reason of #939

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried to be clear that the specific details of the implementation were not ready for review, just the overall concept. I wanted agreement on general direction before I put in the additional hours needed to polish - which it definitely needs.

But, yes that's true.

Comment thread src/encoding.rs
/// Returns the expected total number of bytes in a UTF-8 character given its first byte
/// (2, 3, or 4). Used to determine how many continuation bytes are needed to complete a
/// pending incomplete sequence.
fn utf8_char_width(first_byte: u8) -> usize {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be const?

Suggested change
fn utf8_char_width(first_byte: u8) -> usize {
const fn utf8_char_width(first_byte: u8) -> usize {

Also, it seems, that using compare will faster (spied at utf8_width):
https://godbolt.org/z/oxsjY8nja

Comment thread src/encoding.rs
/// misreported as invalid UTF-8.
///
/// The caller must ensure that `bytes[..index]` contains valid UTF-8 data.
fn floor_char_boundary(bytes: &[u8], index: usize) -> usize {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it possible to make it const, add const.

Comment thread src/encoding.rs
Comment on lines +629 to +642
match std::str::from_utf8(available) {
Ok(_) => {
// All available bytes are valid UTF-8. Copy as many complete characters as
// fit in buf. We must land on a character boundary to avoid consuming a
// partial character from the BufRead — otherwise the next fill_buf() would
// start with orphaned continuation bytes, causing a false validation error.
let len = floor_char_boundary(available, buf.len());
if len == 0 {
return Ok(0);
}
buf[..len].copy_from_slice(&available[..len]);
self.inner.consume(len);
return Ok(len);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You validate some input multiple times here. Imagine, that available is 8Kb, but buf only 4 bytes. You will validate 8192 bytes, but consume only 1..=4. Then validate 8188..=8191 bytes (which already was checked) and consume only 1..=4 and so on.

Limit available by buf.len() and validate only that part. That may be slower because you will validate by chunks that depends by size of output buffer. Maybe better approach would be

  • maintain the length of already validated data
  • return them without checking when requested not more than that length
  • when new data coming, validate only them

Comment thread src/encoding.rs
Comment on lines +486 to +489
/// Small buffer for incomplete UTF-8 sequences at BufRead boundaries.
/// At most 3 bytes (the start of a 2, 3, or 4-byte sequence).
pending: [u8; 3],
pending_len: u8,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if you pull this code into a separate auxiliary struct, the algorithm will become clearer. The same approach is used in encoding_rs_rw

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it ends up being simpler tbh. Minimal difference, more code.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented Apr 20, 2026

Alright, looking at this again now (spent some time on another project)

@dralley dralley force-pushed the assume-utf8 branch 2 times, most recently from 7219da6 to 770fc11 Compare May 2, 2026 02:08
@dralley dralley marked this pull request as draft May 2, 2026 02:08
@dralley dralley force-pushed the assume-utf8 branch 5 times, most recently from 1131c0c to d4a5517 Compare May 2, 2026 03:01
dralley added 2 commits May 1, 2026 23:02
It does not currently do any decoding, but this provides a place where
the validating and decoding functionality can be abstracted.
@dralley dralley force-pushed the assume-utf8 branch 3 times, most recently from f5c94ea to 86a73d5 Compare May 2, 2026 03:30
dralley added 2 commits May 1, 2026 23:32
Needed for: Error::downcast() used in the next commit
Do all of the plumbing necessary to return EncodingError directly from
Utf8ValidatingReader using IoError::InvalidData + error downcasting.

The Utf8 variant of EncodingError now holds an error enum, as we cannot
create instances of Utf8Error ourselves.
dralley added 2 commits May 1, 2026 23:36
In cases where the input is sufficiently short and doesn't contain
invalid sequences, Utf8ValidatingReader was unable to detect the input
as being not-UTF-8

We now call detect_encoding() during the first read() so that it can
more effectively raise the appropriate errors. Doing this (and BOM
stripping) upstream of the parser makes it possible to eliminate this
responsibility from the parser, once it can be relied upon on all code
paths.
It simplifies the implementation to just require BufRead.
dralley added 2 commits May 1, 2026 23:44
Utf8BytesReader now performs BOM stripping and returns only UTF-8 bytes
on all codepaths.
Keeping this around temporarily until I figure out what to do with it

Added constructors that include native utf-8 validation

The goal is to adopt this functionality into the standard constructors,
but backwards compatibility is tricky - this gives more room to
experiment first.

Reader::from_reader_validating()
Reader::from_file_validating()
NsReader::from_reader_validating()
NsReader::from_file_validating()

(when "encoding" feature is not enabled)
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 2, 2026

@Mingun If you think it would be worthwhile, I'm open to splitting this PR up this PR slightly and only providing DecodingReader for now. It's much simpler than the Utf8ValidatingReader implementation and approximately as fast, and would at least ship something useful for the next release. Obviously it's still an opt-in feature.

@Mingun
Copy link
Copy Markdown
Collaborator

Mingun commented May 2, 2026

Yes, if you think, that PR may be split into isolated parts, I think, it would be worthwhile. You still may keep this PR, just rebase it over the other smaller PRs as required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants