Skip to content

Conversation

Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Jun 4, 2025

Before/after for next: https://godbolt.org/z/Yb9TGc4va
Before/after for next_back: https://godbolt.org/z/v6x7GWsj1

std::sys_common::wtf8::Wtf8CodePoints will also benefit from this, since it uses the same next_code_point and next_code_point_reverse functions internally.

I also added tests for all codepoints in the range 0..=char::MAX (including surrogats that can only appear in WTF-8), so the new implementations have been exhaustively tested

@rustbot
Copy link
Collaborator

rustbot commented Jun 4, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 4, 2025
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Jul 3, 2025

ping @scottmcm ?

@scottmcm
Copy link
Member

scottmcm commented Jul 4, 2025

So you know, you can make diff views in godbolt: https://godbolt.org/z/Thn1bf9qG

Copy link
Member

@scottmcm scottmcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general structure here does make sense to me, but overall I feel like it removed a bunch of helpers and constants unnecessarily. Not having utf8_first_byte, sure, but this ends up repeating the X << 6 | (Y & 0x3F) in a bunch of places, so keeping the utf8_acc_cont_byte to do that would make sense to me. The standard library is always compiled with optimizations, and the MIR inliner will inline it, so there's no reason to avoid the function call. Having the u32::from in there would also make the two functions more similar, since now the forward one is using as u32 in a different line instead with no obvious reason whey they should differ.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 4, 2025
@Kmeakin Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 26b614c to 54a699b Compare July 7, 2025 23:48
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Jul 7, 2025

The general structure here does make sense to me, but overall I feel like it removed a bunch of helpers and constants unnecessarily. Not having utf8_first_byte, sure, but this ends up repeating the X << 6 | (Y & 0x3F) in a bunch of places, so keeping the utf8_acc_cont_byte to do that would make sense to me. The standard library is always compiled with optimizations, and the MIR inliner will inline it, so there's no reason to avoid the function call. Having the u32::from in there would also make the two functions more similar, since now the forward one is using as u32 in a different line instead with no obvious reason whey they should differ.

I could not get LLVM to produce the movzx even with various combinations of assume and disjoint_bitor. I'll file an issue against LLVM instead of putting micro-optimizations like that in Rust

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Jul 16, 2025

r? @scottmcm

@rustbot
Copy link
Collaborator

rustbot commented Jul 16, 2025

Requested reviewer is already assigned to this pull request.

Please choose another assignee.

@Kmeakin Kmeakin requested a review from scottmcm August 9, 2025 20:54
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 9, 2025
@scottmcm
Copy link
Member

r? libs

@rustbot rustbot assigned tgross35 and unassigned scottmcm Aug 29, 2025
Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an implementation standpoint this looks good, but the unsafety isn't encapsulated correctly. This should be a pretty easy fix.

Could you update the godbolt links in the top post after this change?

View changes since this review

Comment on lines 29 to 37
// SAFETY: `bytes` produces a UTF-8-like string
let mut next_byte = || unsafe {
let b = *bytes.next().unwrap_unchecked();
assume(utf8_is_cont_byte(b));
b
};

// SAFETY: `bytes` produces a UTF-8-like string
let combine = |c: u32, b: u8| unsafe { disjoint_bitor(c << 6, u32::from(b & CONT_MASK)) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These preconditions don't match up; by this API it is "safe" but completely unsound to call next_byte() 5 times on a 4 byte codepoint. And the safety comments don't cover the precondition.

Instead, this could probably be a function:

#[inline]
unsafe fn advance_mask<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> u32 {
    // next_byte followed by combine
}

to ensure that unsafety is encapsulated. Applies to both the forward and backward version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closures doesn't escape the body of the function, so we shouldn't need to worry about them being used unsoundly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential use isn't really an issue; the problem is that the calls to next_byte() look innocently safe, but that isn't accurate.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 5, 2025
@tgross35
Copy link
Contributor

tgross35 commented Sep 5, 2025

I'll rerun after the above change but let's get a baseline

@bors2 try
@rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

rust-bors bot added a commit that referenced this pull request Sep 5, 2025
Optimize `std::str::Chars::next` and `std::str::Chars::next_back`
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 5, 2025
@rust-bors
Copy link

rust-bors bot commented Sep 5, 2025

☀️ Try build successful (CI)
Build commit: 38eb248 (38eb24858adfeb14a237f88695c147e618d258bc, parent: 91edc3ebccc4daa46c20a93f4709862376da1fdd)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (38eb248): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
0.2% [0.1%, 0.4%] 10
Regressions ❌
(secondary)
0.9% [0.1%, 1.6%] 13
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [0.1%, 0.4%] 10

Max RSS (memory usage)

Results (primary 4.8%, secondary 2.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
4.8% [3.6%, 6.0%] 2
Regressions ❌
(secondary)
3.7% [2.7%, 4.7%] 4
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.5% [-2.5%, -2.5%] 1
All ❌✅ (primary) 4.8% [3.6%, 6.0%] 2

Cycles

Results (secondary 3.7%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.7% [3.4%, 4.0%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results (primary 0.0%, secondary -0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.4%] 15
Regressions ❌
(secondary)
0.1% [0.0%, 0.1%] 6
Improvements ✅
(primary)
-0.1% [-0.2%, -0.0%] 10
Improvements ✅
(secondary)
-0.1% [-0.2%, -0.1%] 38
All ❌✅ (primary) 0.0% [-0.2%, 0.4%] 25

Bootstrap: 468.177s -> 466.627s (-0.33%)
Artifact size: 390.48 MiB -> 390.47 MiB (-0.00%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Sep 5, 2025
@tgross35
Copy link
Contributor

tgross35 commented Sep 5, 2025

Well, those results are a bit interesting. From the first godbolt link in the top post I'm not really sure why we're showing regressions; the prelude is identical, the fastest path starting at LBB0_3 is still 9 instructions to the return, but the other two paths look like they do actually gain an instruction.

Cc the asm analyzer expert @hanna-kruppe who could probably provide some more insight.

@hanna-kruppe
Copy link
Contributor

At least one benchmark that's gotten slower (unicode-normalization) uses chars() itself, so the regression might mean that rustc takes longer to compile the new implementation (this shows up in leaf crates because next_code_point and everything leading up to it is #[inline]). If this was a regression from char-based iteration in rustc executing more slowly, I would expect slowdowns across many more benchmarks.

@Kmeakin Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 54a699b to 3069ce8 Compare September 10, 2025 16:48
@rustbot

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 3069ce8 to 652813e Compare September 10, 2025 17:01
@tgross35
Copy link
Contributor

Since the perf job doesn't really show anything useful here, do you have local benchmarks indicating an improvement?

There are only 0x10FFFF possible codepoints, so we can exhaustively test
all of them.
By reordering some operations, we can expose some opportunites for
CSE. Also convert the series of nested `if` branches to early return,
which IMO makes the code clearer.

Comparison of assembly before and after for `next_code_point`:
https://godbolt.org/z/9Te84YzhK

Comparison of assembly before and after for `next_code_point_reverse`:
https://godbolt.org/z/fTx1a7oz1
@Kmeakin Kmeakin force-pushed the km/optimize-str-chars-iterator branch from 652813e to 43c4909 Compare September 15, 2025 21:50
@rustbot
Copy link
Collaborator

rustbot commented Sep 15, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Sep 15, 2025

Benchmark results

Very strange, I need to investigate why the regression is so large on x86

AArch64 (Apple M3)

# Before
    str::iter::chars_sum::emoji     127817.70ns/iter  +/- 5261.26
    str::iter::chars_sum::en        205930.23ns/iter  +/- 9895.86
    str::iter::chars_sum::ru        128701.38ns/iter  +/- 3786.70
    str::iter::chars_sum::zh        127595.83ns/iter  +/- 4139.78
    str::iter::chars_sum_rev::emoji 118793.73ns/iter  +/- 2567.38
    str::iter::chars_sum_rev::en    203907.30ns/iter  +/- 9240.01
    str::iter::chars_sum_rev::ru    118621.43ns/iter  +/- 6272.55
    str::iter::chars_sum_rev::zh    120471.43ns/iter +/- 17688.21

# After
    str::iter::chars_sum::emoji     127281.27ns/iter +/- 4162.60
    str::iter::chars_sum::en        184459.38ns/iter +/- 5338.10
    str::iter::chars_sum::ru        129938.55ns/iter +/- 5699.32
    str::iter::chars_sum::zh        128248.26ns/iter +/- 3094.75
    str::iter::chars_sum_rev::emoji 138365.27ns/iter +/- 9381.55
    str::iter::chars_sum_rev::en    192812.52ns/iter +/- 9568.14
    str::iter::chars_sum_rev::ru    137198.26ns/iter +/- 4194.93
    str::iter::chars_sum_rev::zh    137196.53ns/iter +/- 4287.87

x86_64 (AMD Ryzen 9 9950X)

# Before
    str::iter::chars_sum::emoji      86071.50ns/iter  +/- 1779.54
    str::iter::chars_sum::en        111593.85ns/iter +/- 13602.43
    str::iter::chars_sum::ru         85952.98ns/iter  +/- 2205.90
    str::iter::chars_sum::zh         85990.57ns/iter  +/- 2480.61
    str::iter::chars_sum_rev::emoji  85084.79ns/iter  +/- 2116.92
    str::iter::chars_sum_rev::en    141257.73ns/iter  +/- 1072.84
    str::iter::chars_sum_rev::ru     85227.11ns/iter  +/- 1724.26
    str::iter::chars_sum_rev::zh     85248.70ns/iter  +/- 2139.76

# After
    str::iter::chars_sum::emoji      98827.02ns/iter  +/- 5825.00
    str::iter::chars_sum::en        112401.29ns/iter +/- 19397.71
    str::iter::chars_sum::ru         98462.94ns/iter   +/- 984.64
    str::iter::chars_sum::zh         98610.26ns/iter  +/- 1061.37
    str::iter::chars_sum_rev::emoji  91778.54ns/iter  +/- 8640.72
    str::iter::chars_sum_rev::en    141549.49ns/iter +/- 52848.65
    str::iter::chars_sum_rev::ru     92033.68ns/iter  +/- 1760.72
    str::iter::chars_sum_rev::zh     91682.21ns/iter  +/- 1702.64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf-regression Performance regression. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants