Skip to content

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Oct 7, 2025

  • Refactor the current functionality into a helper function
  • Use as_chunks to encourage auto-vectorization in the optimized chunk processing function
  • Add a codegen test checking for vectorization and no panicking
  • Add benches for eq_ignore_ascii_case

The optimized function is initially only enabled for x86_64 which has sse2 as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.

Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.

Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.

before:
    str::eq_ignore_ascii_case::bench_large_str_eq         4942.30ns/iter +/- 48.20
    str::eq_ignore_ascii_case::bench_medium_str_eq         632.01ns/iter +/- 16.87
    str::eq_ignore_ascii_case::bench_str_17_bytes_eq        16.28ns/iter  +/- 0.45
    str::eq_ignore_ascii_case::bench_str_31_bytes_eq        35.23ns/iter  +/- 2.28
    str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq       7.56ns/iter  +/- 0.22
    str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq    2.64ns/iter  +/- 0.06
after:
    str::eq_ignore_ascii_case::bench_large_str_eq         611.63ns/iter +/- 28.29
    str::eq_ignore_ascii_case::bench_medium_str_eq         77.10ns/iter +/- 19.76
    str::eq_ignore_ascii_case::bench_str_17_bytes_eq        3.49ns/iter  +/- 0.39
    str::eq_ignore_ascii_case::bench_str_31_bytes_eq        3.50ns/iter  +/- 0.27
    str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq      7.27ns/iter  +/- 0.09
    str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq   2.60ns/iter  +/- 0.05

Refactor the current functionality into a helper function
Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
Add a codegen test
Add benches for `eq_ignore_ascii_case`

The optimized function is initially only enabled for x86_64 which has `sse2` as
part of its baseline, but none of the code is platform specific. Other
platforms with SIMD instructions may also benefit from this implementation.

Performance improvements only manifest for slices of 16 bytes or longer, so the
optimized path is gated behind a length check for greater than or equal to 16.
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 7, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 7, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Refactor the eq check into an inner function for reuse in tail checking

Rather than fall back to the simple implementation for tail handling,
load the last 16 bytes to take advantage of vectorization. This doesn't
seem to negatively impact check time even when the remainder count is low.
@okaneco
Copy link
Contributor Author

okaneco commented Oct 7, 2025

I've pushed a commit to avoid falling back to the scalar checking for the remainder handling.

We reload the last 16 bytes of the slices if there's a remainder, which improves the 31 byte case and doesn't seem to regress the 17 byte case.

scalar tail handling
    ascii::eq_ignore_ascii_case::bench_long_str_eq          54.75ns/iter +/- 1.51
    ascii::eq_ignore_ascii_case::bench_str_17_bytes_eq       4.77ns/iter +/- 0.12
    ascii::eq_ignore_ascii_case::bench_str_31_bytes_eq      23.00ns/iter +/- 4.56
    ascii::eq_ignore_ascii_case::bench_str_of_8_bytes_eq     7.61ns/iter +/- 0.16
    ascii::eq_ignore_ascii_case::bench_str_under_8_bytes_eq  2.61ns/iter +/- 0.07
load last 16 bytes of the slice, newest commit
    ascii::eq_ignore_ascii_case::bench_long_str_eq          51.60ns/iter +/- 5.28
    ascii::eq_ignore_ascii_case::bench_str_17_bytes_eq       3.62ns/iter +/- 0.54
    ascii::eq_ignore_ascii_case::bench_str_31_bytes_eq       3.56ns/iter +/- 0.27
    ascii::eq_ignore_ascii_case::bench_str_of_8_bytes_eq     7.79ns/iter +/- 1.01
    ascii::eq_ignore_ascii_case::bench_str_under_8_bytes_eq  2.73ns/iter +/- 0.05

let (other_chunks, _) = other.as_chunks::<N>();

// Branchless check to encourage auto-vectorization
const fn eq_ignore_ascii_inner(lhs: &[u8; N], rhs: &[u8; N]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I copied this code into compiler explorer with -C opt_level=3, the call to eq_ignore_ascii_inner did not get inlined. I would suggest to mark this function #[inline(always)] and add a CHECK-NOT: call in the codegen test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the annotation and filecheck adaptation in a5ba248

Add #[inline(always)] to inner function and check not for filecheck test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants