How to cache string representation of ASCII bytes? #112315

Peter-Juhasz · 2025-02-09T07:39:20Z

Peter-Juhasz
Feb 9, 2025

I would like to save new string allocations for common patterns while reading from a stream/pipe where these strings occur as an ASCII-encoded stream of bytes.

Example use case: read common header names from HTTP/1.

First approach
Build a FrozenDictionary<byte[], string> from the Strings I want to cache as an end-to-end mapping, and compare the ReadOnlySpan<byte> during reads before allocating a new String.

But it turns out:

FrozenDictionary<byte[], ...> does not support alternate lookup for ReadOnlySpan<byte>
Even if it would, is there any comparer I could use? At least I could build one using:
- There is Ascii.EqualsIgnoreCase for equality comparison
- But how do I implement GetHashCode for the byte representation which can ignore casing?

Second approach
Use a FrozenSet<string> with StringComparer.OrdinalIgnoreCase to store the strings I want to cache.

But this approach requires an intermediate step of converting bytes to chars, possibly using Ascii.ToUtf16 in a temporary buffer, which seems unnecessary.

Third approach
Use Ascii.EqualsIgnoreCase(readSpan, "MyString"u8) to compare manually, but while it would work for a few, it doesn't scale.

What is your recommendation for this problem to solve? Thank you.

EgorBo · 2025-02-09T08:16:35Z

EgorBo
Feb 9, 2025
Collaborator

But how do I implement GetHashCode for the byte representation which can ignore casing?

Assuming it's ASCII only and mostly letters, you can just unconditionally apply | 0x20 to each byte (e.g. via some vectorized way) and calculate hashcode for the resulting Span. Alternatively, you can just copy-paste GetNonRandomizedHashCodeOrdinalIgnoreCase impl from dotnet/runtime (without the ICU fallback if you're sure it's ASCII only).

If the alternate lookup is not working, as a trivial workaround - use map [Frozen]Dictionary<int, string[]> ? where int is for your custom hashcode

3 replies

Peter-Juhasz Feb 9, 2025
Author

Assuming it's ASCII only and mostly letters, you can just unconditionally apply | 0x20 to each byte (e.g. via some vectorized way) and calculate hashcode for the resulting Span.

So, you suggest I could convert the bytes to same case first and then calculate the hash code? There are APIs Ascii.ToLowerInPlace/ToUpperInPlace and HashCode.AddBytes to do this, it could work.

[...] use map [Frozen]Dictionary<int, string[]> ? where int is for your custom hashcode

And for overcoming the limitation of the alternate lookup, you suggest that I would only use the dictionary halfway for the hash code lookup and I would do the equality check using Ascii.EqualsIgnoreCase manually after I have a match (to make sure it is not a collision)? (There is a tiny chance that in some runs there would be one or more hash collisions, so that run would not benefit from all caching - but that is fine I think.)

EgorBo Feb 9, 2025
Collaborator

Ascii.ToLowerInPlace/ToUpperInPlace and HashCode.AddBytes to do this, it could work.

if you care about performance, you might want to avoid them and stick to GetNonRandomizedHashCodeOrdinalIgnoreCase because hashcode doesn't have to be precise, e.g. normally case is changed only for letters, but for hashcode it's fine to just do | 0x20 to all bytes

Peter-Juhasz Feb 9, 2025
Author

Thanks for pointing to the internal implementation, it is super useful for learning. But I would prefer to stay within the APIs .NET provides, for two reasons: a) I don't want to introduce unsafe code, especially that b) I don't have the mental capacity to understand the unsafe code fully right now (which deals with UTF-16 strings) and decompose it to the ASCII case. I just wanted to make a "straightforward" optimization from the pieces we already have and also limit the scope to that level.

This is my draft:
https://gist.github.com/Peter-Juhasz/e2fabb3173859d34d2c61074cb3b31f3

neon-sunset · 2025-02-09T12:03:46Z

neon-sunset
Feb 9, 2025

I would like to save new string allocations for common patterns while reading from a stream/pipe where these strings occur as an ASCII-encoded stream of bytes.

The lookup and caching overall will be more expensive than just doing a transient allocation which dies in Gen 0. You cannot cheat this - because hashcode calculation and subsequent indexing into intern table and comparison, even if you use something like GxHash, especially where you normalize the casing, will cost more CPU time than just taking a slice and dealing with it as is.

If you can enforce the lifetime of the buffer that observes network input, you can simply take a span out of it and then use it instead. Alternatively, you can hand-roll a simple bytestring-ish primitive which will implement standard comparison operations. It really comes down to the use case but I just wanted to post that complicated caching/interning strategies which require case conversion, hashing and then lookup to do interning will have non-trivial cost, often greater than plain allocations.

The most important choice is deciding here whether to transcode the input upfront or keep the data in ASCII/UTF-8 form, and then deciding the appropriate container for that.

There is some "preliminary" work that I've done to research this in the past so feel free to steal any ideas from here:

https://github.com/U8String/U8String/blob/main/Sources/U8String/Comparers/U8AsciiIgnoreCaseComparer.cs
https://github.com/U8String/U8String/blob/main/Sources/U8String/CaseConverters/U8AsciiCaseConverter.cs

(the ascii-case-insensitive hashcode implementation there is reasonably efficient but can be done better)

Also, given how fast ASCII case conversion is, it's probably the best to just normalize e.g. header names/values rather than deal with comparison/hashing complexity. With that said, you have to be mindful of the cost and frequency of repeated strings in the incoming data to make the interning worth it. If most strings will be new/unique, it's a permanent overhead to try to lookup something first.

If you can make whatever consumes the strings consume ASCII byte spans instead, that would be the best, if not - at least the char spans with pooling / stack allocating temporary buffers for that. This will help with memory churn, otherwise might as well just ensure the data is being allocated and transcoded only once and then focus on winning performance in the application elsewhere. Data transcoding, unless all the application does is ingesting text and then sending it over a different transport, is a surprisingly small part of the overall profile even in high-throughput applications.

p.s.: be very careful with ConditionalWeakTable it can come at a steep GC cost if the entries are added and then die quickly, do not recommend 😅

2 replies

Peter-Juhasz Feb 9, 2025
Author

Thanks for your valuable insights!

The lookup and caching overall will be more expensive than just doing a transient allocation which dies in Gen 0.

You are right, we need to measure this, whether the "optimized" implementation is better overall (including GC cost) or not.

https://github.com/U8String/U8String/blob/main/Sources/U8String/Comparers/U8AsciiIgnoreCaseComparer.cs
https://github.com/U8String/U8String/blob/main/Sources/U8String/CaseConverters/U8AsciiCaseConverter.cs

I'll take a look at the links, thanks!

If you can make whatever consumes the strings consume ASCII byte spans instead, that would be the best [...]

Our use case is a low-level HTTP client implementation, where we read the same header names (and many times the same header values as well) in raw byte form from each response, but we need to pass those values to higher-level APIs which use a regular string representation. This is where the idea came from, if we need to construct and allocate the same ~20 strings from the same bytes for processing each response, how could we reuse and cache them.

MihaZupan Feb 9, 2025
Collaborator

If the use case is getting header name/value strings out of bytes, you can look at Kestrel and HttpClient implementations for inspiration. Often you can narrow down the possible well-known candidates with just a couple of comparisons instead of a hash-based lookup.
For example this is how HttpClient does it for header names:

runtime/src/libraries/System.Net.Http/src/System/Net/Http/Headers/KnownHeaders.cs

Line 160 in 34ec4f5

private static KnownHeader? GetCandidate<T>(T key)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to cache string representation of ASCII bytes? #112315

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to cache string representation of ASCII bytes? #112315

Uh oh!

Peter-Juhasz Feb 9, 2025

Replies: 2 comments · 5 replies

Uh oh!

EgorBo Feb 9, 2025 Collaborator

Uh oh!

Peter-Juhasz Feb 9, 2025 Author

Uh oh!

Uh oh!

EgorBo Feb 9, 2025 Collaborator

Uh oh!

Peter-Juhasz Feb 9, 2025 Author

Uh oh!

Uh oh!

neon-sunset Feb 9, 2025

Uh oh!

Peter-Juhasz Feb 9, 2025 Author

Uh oh!

MihaZupan Feb 9, 2025 Collaborator

Peter-Juhasz
Feb 9, 2025

Replies: 2 comments 5 replies

EgorBo
Feb 9, 2025
Collaborator

Peter-Juhasz Feb 9, 2025
Author

EgorBo Feb 9, 2025
Collaborator

Peter-Juhasz Feb 9, 2025
Author

neon-sunset
Feb 9, 2025

Peter-Juhasz Feb 9, 2025
Author

MihaZupan Feb 9, 2025
Collaborator