How to cache string representation of ASCII bytes? #112315
Replies: 2 comments 5 replies
-
Assuming it's ASCII only and mostly letters, you can just unconditionally apply If the alternate lookup is not working, as a trivial workaround - use map |
Beta Was this translation helpful? Give feedback.
-
The lookup and caching overall will be more expensive than just doing a transient allocation which dies in Gen 0. You cannot cheat this - because hashcode calculation and subsequent indexing into intern table and comparison, even if you use something like GxHash, especially where you normalize the casing, will cost more CPU time than just taking a slice and dealing with it as is. If you can enforce the lifetime of the buffer that observes network input, you can simply take a span out of it and then use it instead. Alternatively, you can hand-roll a simple bytestring-ish primitive which will implement standard comparison operations. It really comes down to the use case but I just wanted to post that complicated caching/interning strategies which require case conversion, hashing and then lookup to do interning will have non-trivial cost, often greater than plain allocations. The most important choice is deciding here whether to transcode the input upfront or keep the data in ASCII/UTF-8 form, and then deciding the appropriate container for that. There is some "preliminary" work that I've done to research this in the past so feel free to steal any ideas from here: https://github.com/U8String/U8String/blob/main/Sources/U8String/Comparers/U8AsciiIgnoreCaseComparer.cs (the ascii-case-insensitive hashcode implementation there is reasonably efficient but can be done better) Also, given how fast ASCII case conversion is, it's probably the best to just normalize e.g. header names/values rather than deal with comparison/hashing complexity. With that said, you have to be mindful of the cost and frequency of repeated strings in the incoming data to make the interning worth it. If most strings will be new/unique, it's a permanent overhead to try to lookup something first. If you can make whatever consumes the strings consume ASCII byte spans instead, that would be the best, if not - at least the char spans with pooling / stack allocating temporary buffers for that. This will help with memory churn, otherwise might as well just ensure the data is being allocated and transcoded only once and then focus on winning performance in the application elsewhere. Data transcoding, unless all the application does is ingesting text and then sending it over a different transport, is a surprisingly small part of the overall profile even in high-throughput applications. p.s.: be very careful with |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to save new string allocations for common patterns while reading from a stream/pipe where these strings occur as an ASCII-encoded stream of bytes.
Example use case: read common header names from HTTP/1.
First approach
Build a
FrozenDictionary<byte[], string>
from theString
s I want to cache as an end-to-end mapping, and compare theReadOnlySpan<byte>
during reads before allocating a newString
.But it turns out:
FrozenDictionary<byte[], ...>
does not support alternate lookup forReadOnlySpan<byte>
Ascii.EqualsIgnoreCase
for equality comparisonGetHashCode
for the byte representation which can ignore casing?Second approach
Use a
FrozenSet<string>
withStringComparer.OrdinalIgnoreCase
to store the strings I want to cache.But this approach requires an intermediate step of converting
byte
s tochar
s, possibly usingAscii.ToUtf16
in a temporary buffer, which seems unnecessary.Third approach
Use
Ascii.EqualsIgnoreCase(readSpan, "MyString"u8)
to compare manually, but while it would work for a few, it doesn't scale.What is your recommendation for this problem to solve? Thank you.
Beta Was this translation helpful? Give feedback.
All reactions