Optimize `std::str::Chars::next` and `std::str::Chars::next_back` #142038

Kmeakin · 2025-06-04T19:08:45Z

Before/after for next: https://godbolt.org/z/Yb9TGc4va
Before/after for next_back: https://godbolt.org/z/v6x7GWsj1

std::sys_common::wtf8::Wtf8CodePoints will also benefit from this, since it uses the same next_code_point and next_code_point_reverse functions internally.

I also added tests for all codepoints in the range 0..=char::MAX (including surrogats that can only appear in WTF-8), so the new implementations have been exhaustively tested

rustbot · 2025-06-04T19:08:50Z

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Kmeakin · 2025-07-03T21:47:27Z

ping @scottmcm ?

library/alloctests/tests/str.rs

scottmcm · 2025-07-04T00:20:21Z

So you know, you can make diff views in godbolt: https://godbolt.org/z/Thn1bf9qG

library/core/src/str/validations.rs

scottmcm

The general structure here does make sense to me, but overall I feel like it removed a bunch of helpers and constants unnecessarily. Not having utf8_first_byte, sure, but this ends up repeating the X << 6 | (Y & 0x3F) in a bunch of places, so keeping the utf8_acc_cont_byte to do that would make sense to me. The standard library is always compiled with optimizations, and the MIR inliner will inline it, so there's no reason to avoid the function call. Having the u32::from in there would also make the two functions more similar, since now the forward one is using as u32 in a different line instead with no obvious reason whey they should differ.

Kmeakin · 2025-07-07T23:51:16Z

The general structure here does make sense to me, but overall I feel like it removed a bunch of helpers and constants unnecessarily. Not having utf8_first_byte, sure, but this ends up repeating the X << 6 | (Y & 0x3F) in a bunch of places, so keeping the utf8_acc_cont_byte to do that would make sense to me. The standard library is always compiled with optimizations, and the MIR inliner will inline it, so there's no reason to avoid the function call. Having the u32::from in there would also make the two functions more similar, since now the forward one is using as u32 in a different line instead with no obvious reason whey they should differ.

I could not get LLVM to produce the movzx even with various combinations of assume and disjoint_bitor. I'll file an issue against LLVM instead of putting micro-optimizations like that in Rust

Kmeakin · 2025-07-16T00:53:45Z

r? @scottmcm

rustbot · 2025-07-16T00:53:48Z

Requested reviewer is already assigned to this pull request.

Please choose another assignee.

scottmcm · 2025-08-29T17:02:55Z

r? libs

tgross35

From an implementation standpoint this looks good, but the unsafety isn't encapsulated correctly. This should be a pretty easy fix.

Could you update the godbolt links in the top post after this change?

View changes since this review

library/core/src/str/validations.rs

tgross35 · 2025-09-05T06:03:10Z

library/core/src/str/validations.rs

+    // SAFETY: `bytes` produces a UTF-8-like string
+    let mut next_byte = || unsafe {
+        let b = *bytes.next().unwrap_unchecked();
+        assume(utf8_is_cont_byte(b));
+        b
+    };
+
+    // SAFETY: `bytes` produces a UTF-8-like string
+    let combine = |c: u32, b: u8| unsafe { disjoint_bitor(c << 6, u32::from(b & CONT_MASK)) };


These preconditions don't match up; by this API it is "safe" but completely unsound to call next_byte() 5 times on a 4 byte codepoint. And the safety comments don't cover the precondition.

Instead, this could probably be a function:

#[inline] unsafe fn advance_mask<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> u32 { // next_byte followed by combine }

to ensure that unsafety is encapsulated. Applies to both the forward and backward version

The closures doesn't escape the body of the function, so we shouldn't need to worry about them being used unsoundly

Potential use isn't really an issue; the problem is that the calls to next_byte() look innocently safe, but that isn't accurate.

library/alloctests/tests/str.rs

tgross35 · 2025-09-05T06:19:19Z

I'll rerun after the above change but let's get a baseline

@bors2 try
@rust-timer queue

Optimize `std::str::Chars::next` and `std::str::Chars::next_back`

rust-bors · 2025-09-05T08:35:17Z

☀️ Try build successful (CI)
Build commit: 38eb248 (38eb24858adfeb14a237f88695c147e618d258bc, parent: 91edc3ebccc4daa46c20a93f4709862376da1fdd)

rust-timer · 2025-09-05T10:21:56Z

Finished benchmarking commit (38eb248): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.1%, 0.4%]	10
Regressions ❌ (secondary)	0.9%	[0.1%, 1.6%]	13
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[0.1%, 0.4%]	10

Max RSS (memory usage)

Results (primary 4.8%, secondary 2.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	4.8%	[3.6%, 6.0%]	2
Regressions ❌ (secondary)	3.7%	[2.7%, 4.7%]	4
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-2.5%	[-2.5%, -2.5%]	1
All ❌✅ (primary)	4.8%	[3.6%, 6.0%]	2

Cycles

Results (secondary 3.7%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.7%	[3.4%, 4.0%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Binary size

Results (primary 0.0%, secondary -0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.4%]	15
Regressions ❌ (secondary)	0.1%	[0.0%, 0.1%]	6
Improvements ✅ (primary)	-0.1%	[-0.2%, -0.0%]	10
Improvements ✅ (secondary)	-0.1%	[-0.2%, -0.1%]	38
All ❌✅ (primary)	0.0%	[-0.2%, 0.4%]	25

Bootstrap: 468.177s -> 466.627s (-0.33%)
Artifact size: 390.48 MiB -> 390.47 MiB (-0.00%)

tgross35 · 2025-09-05T18:46:48Z

Well, those results are a bit interesting. From the first godbolt link in the top post I'm not really sure why we're showing regressions; the prelude is identical, the fastest path starting at LBB0_3 is still 9 instructions to the return, but the other two paths look like they do actually gain an instruction.

Cc the asm analyzer expert @hanna-kruppe who could probably provide some more insight.

hanna-kruppe · 2025-09-05T19:25:39Z

At least one benchmark that's gotten slower (unicode-normalization) uses chars() itself, so the regression might mean that rustc takes longer to compile the new implementation (this shows up in leaf crates because next_code_point and everything leading up to it is #[inline]). If this was a regression from char-based iteration in rustc executing more slowly, I would expect slowdowns across many more benchmarks.

tgross35 · 2025-09-10T20:55:05Z

Since the perf job doesn't really show anything useful here, do you have local benchmarks indicating an improvement?

There are only 0x10FFFF possible codepoints, so we can exhaustively test all of them.

By reordering some operations, we can expose some opportunites for CSE. Also convert the series of nested `if` branches to early return, which IMO makes the code clearer. Comparison of assembly before and after for `next_code_point`: https://godbolt.org/z/9Te84YzhK Comparison of assembly before and after for `next_code_point_reverse`: https://godbolt.org/z/fTx1a7oz1

rustbot · 2025-09-15T21:51:00Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Kmeakin · 2025-09-15T23:48:24Z

Benchmark results

Very strange, I need to investigate why the regression is so large on x86

AArch64 (Apple M3)

# Before
    str::iter::chars_sum::emoji     127817.70ns/iter  +/- 5261.26
    str::iter::chars_sum::en        205930.23ns/iter  +/- 9895.86
    str::iter::chars_sum::ru        128701.38ns/iter  +/- 3786.70
    str::iter::chars_sum::zh        127595.83ns/iter  +/- 4139.78
    str::iter::chars_sum_rev::emoji 118793.73ns/iter  +/- 2567.38
    str::iter::chars_sum_rev::en    203907.30ns/iter  +/- 9240.01
    str::iter::chars_sum_rev::ru    118621.43ns/iter  +/- 6272.55
    str::iter::chars_sum_rev::zh    120471.43ns/iter +/- 17688.21

# After
    str::iter::chars_sum::emoji     127281.27ns/iter +/- 4162.60
    str::iter::chars_sum::en        184459.38ns/iter +/- 5338.10
    str::iter::chars_sum::ru        129938.55ns/iter +/- 5699.32
    str::iter::chars_sum::zh        128248.26ns/iter +/- 3094.75
    str::iter::chars_sum_rev::emoji 138365.27ns/iter +/- 9381.55
    str::iter::chars_sum_rev::en    192812.52ns/iter +/- 9568.14
    str::iter::chars_sum_rev::ru    137198.26ns/iter +/- 4194.93
    str::iter::chars_sum_rev::zh    137196.53ns/iter +/- 4287.87

x86_64 (AMD Ryzen 9 9950X)

# Before
    str::iter::chars_sum::emoji      86071.50ns/iter  +/- 1779.54
    str::iter::chars_sum::en        111593.85ns/iter +/- 13602.43
    str::iter::chars_sum::ru         85952.98ns/iter  +/- 2205.90
    str::iter::chars_sum::zh         85990.57ns/iter  +/- 2480.61
    str::iter::chars_sum_rev::emoji  85084.79ns/iter  +/- 2116.92
    str::iter::chars_sum_rev::en    141257.73ns/iter  +/- 1072.84
    str::iter::chars_sum_rev::ru     85227.11ns/iter  +/- 1724.26
    str::iter::chars_sum_rev::zh     85248.70ns/iter  +/- 2139.76

# After
    str::iter::chars_sum::emoji      98827.02ns/iter  +/- 5825.00
    str::iter::chars_sum::en        112401.29ns/iter +/- 19397.71
    str::iter::chars_sum::ru         98462.94ns/iter   +/- 984.64
    str::iter::chars_sum::zh         98610.26ns/iter  +/- 1061.37
    str::iter::chars_sum_rev::emoji  91778.54ns/iter  +/- 8640.72
    str::iter::chars_sum_rev::en    141549.49ns/iter +/- 52848.65
    str::iter::chars_sum_rev::ru     92033.68ns/iter  +/- 1760.72
    str::iter::chars_sum_rev::zh     91682.21ns/iter  +/- 1702.64

rustbot assigned scottmcm Jun 4, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 4, 2025

scottmcm reviewed Jul 4, 2025

View reviewed changes

library/alloctests/tests/str.rs Outdated Show resolved Hide resolved

scottmcm reviewed Jul 4, 2025

View reviewed changes

library/core/src/str/validations.rs Outdated Show resolved Hide resolved

scottmcm requested changes Jul 4, 2025

View reviewed changes

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 4, 2025

Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 26b614c to 54a699b Compare July 7, 2025 23:48

Kmeakin requested a review from scottmcm August 9, 2025 20:54

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 9, 2025

rustbot assigned tgross35 and unassigned scottmcm Aug 29, 2025

tgross35 requested changes Sep 5, 2025

View reviewed changes

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 5, 2025

This comment has been minimized.

Sign in to view

rust-bors bot added a commit that referenced this pull request Sep 5, 2025

Auto merge of #142038 - Kmeakin:km/optimize-str-chars-iterator, r=<try>

38eb248

Optimize `std::str::Chars::next` and `std::str::Chars::next_back`

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 5, 2025

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Sep 5, 2025

Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 54a699b to 3069ce8 Compare September 10, 2025 16:48

This comment has been minimized.

Sign in to view

Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 3069ce8 to 652813e Compare September 10, 2025 17:01

Kmeakin added 3 commits September 15, 2025 21:00

Add exhaustive tests for next_code_point and next_code_point_reverse

48d0413

There are only 0x10FFFF possible codepoints, so we can exhaustively test all of them.

Add benchmarks for char iterators

2c4b068

Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 652813e to 43c4909 Compare September 15, 2025 21:50

Optimize std::str::Chars::next and std::str::Chars::next_back #142038

Are you sure you want to change the base?

Optimize std::str::Chars::next and std::str::Chars::next_back #142038

Conversation

Kmeakin commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Jun 4, 2025

Uh oh!

Kmeakin commented Jul 3, 2025

Uh oh!

Uh oh!

scottmcm commented Jul 4, 2025

Uh oh!

Uh oh!

scottmcm left a comment

Choose a reason for hiding this comment

Uh oh!

Kmeakin commented Jul 7, 2025

Uh oh!

Kmeakin commented Jul 16, 2025

Uh oh!

rustbot commented Jul 16, 2025

Uh oh!

scottmcm commented Aug 29, 2025

Uh oh!

tgross35 left a comment • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgross35 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Kmeakin Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

tgross35 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgross35 commented Sep 5, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

rust-bors bot commented Sep 5, 2025

Uh oh!

This comment has been minimized.

rust-timer commented Sep 5, 2025

Overall result: ❌ regressions - please read the text below

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

tgross35 commented Sep 5, 2025

Uh oh!

hanna-kruppe commented Sep 5, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

tgross35 commented Sep 10, 2025

Uh oh!

rustbot commented Sep 15, 2025

Uh oh!

Kmeakin commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

AArch64 (Apple M3)

x86_64 (AMD Ryzen 9 9950X)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Optimize `std::str::Chars::next` and `std::str::Chars::next_back` #142038

Optimize `std::str::Chars::next` and `std::str::Chars::next_back` #142038

Kmeakin commented Jun 4, 2025 •

edited

Loading

tgross35 left a comment •

edited by rustbot

Loading

Kmeakin commented Sep 15, 2025 •

edited

Loading