Skip to content

Commit b69c49d

Browse files
committed
Docs: UTF-8 Fold & Search with PErf numbers
150x improvement over PyICU `icu.StringSearch` baseline
1 parent 48a8ccb commit b69c49d

File tree

1 file changed

+164
-10
lines changed

1 file changed

+164
-10
lines changed

README.md

Lines changed: 164 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,38 @@ __Who is this for?__
6161
<th align="center" width="25%">Python</th>
6262
<th align="center" width="25%">StringZilla</th>
6363
</tr>
64+
<!-- Unicode case-folding -->
65+
<tr>
66+
<td colspan="4" align="center">Unicode case-folding, expanding characters like <code>ß</code> → <code>ss</code></td>
67+
</tr>
68+
<tr>
69+
<td align="center">⚪</td>
70+
<td align="center">⚪</td>
71+
<td align="center">
72+
<code>.casefold</code><br/>
73+
<span style="color:#ABABAB;">x86:</span> <b>0.4</b> GB/s
74+
</td>
75+
<td align="center">
76+
<code>sz.utf8_case_fold</code><br/>
77+
<span style="color:#ABABAB;">x86:</span> <b>1.3</b> GB/s
78+
</td>
79+
</tr>
80+
<!-- Unicode case-insensitive search -->
81+
<tr>
82+
<td colspan="4" align="center">Unicode case-insensitive substring search</td>
83+
</tr>
84+
<tr>
85+
<td align="center">⚪</td>
86+
<td align="center">⚪</td>
87+
<td align="center">
88+
<code>icu.StringSearch</code><br/>
89+
<span style="color:#ABABAB;">x86:</span> <b>0.02</b> GB/s
90+
</td>
91+
<td align="center">
92+
<code>utf8_case_insensitive_find</code><br/>
93+
<span style="color:#ABABAB;">x86:</span> <b>3.0</b> GB/s
94+
</td>
95+
</tr>
6496
<!-- Substrings, normal order -->
6597
<tr>
6698
<td colspan="4" align="center">find the first occurrence of a random word from text, ≅ 5 bytes long</td>
@@ -155,7 +187,7 @@ __Who is this for?__
155187
</tr>
156188
<!-- Random Generation -->
157189
<tr>
158-
<td colspan="4" align="center">Random string from a given alphabet, 20 bytes long <sup>5</sup></td>
190+
<td colspan="4" align="center">Random string from a given alphabet, 20 bytes long <sup>3</sup></td>
159191
</tr>
160192
<tr>
161193
<td align="center">
@@ -203,7 +235,7 @@ __Who is this for?__
203235
</tr>
204236
<!-- Sorting -->
205237
<tr>
206-
<td colspan="4" align="center">Get sorted order, ≅ 8 million English words <sup>6</sup></td>
238+
<td colspan="4" align="center">Get sorted order, ≅ 8 million English words <sup>4</sup></td>
207239
</tr>
208240
<tr>
209241
<td align="center">
@@ -235,7 +267,7 @@ __Who is this for?__
235267
<td align="center">⚪</td>
236268
<td align="center">⚪</td>
237269
<td align="center">
238-
via <code>NLTK</code> <sup>3</sup> and <code>CuDF</code><br/>
270+
via <code>NLTK</code> <sup>5</sup> and <code>CuDF</code><br/>
239271
<span style="color:#ABABAB;">x86:</span> <b>1,615,306</b> &centerdot;
240272
<span style="color:#ABABAB;">arm:</span> <b>1,349,980</b> &centerdot;
241273
<span style="color:#ABABAB;">cuda:</span> <b>6,532,411,354</b> CUPS
@@ -255,7 +287,7 @@ __Who is this for?__
255287
<td align="center">⚪</td>
256288
<td align="center">⚪</td>
257289
<td align="center">
258-
via <code>biopython</code> <sup>4</sup><br/>
290+
via <code>biopython</code> <sup>6</sup><br/>
259291
<span style="color:#ABABAB;">x86:</span> <b>575,981,513</b> &centerdot;
260292
<span style="color:#ABABAB;">arm:</span> <b>436,350,732</b> CUPS
261293
</td>
@@ -285,16 +317,16 @@ To inspect collision resistance and distribution shapes for our hashers, see __[
285317
> For CUDA benchmarks, the Nvidia H100 GPUs were used.
286318
> <sup>1</sup> Unlike other libraries, LibC requires strings to be NULL-terminated.
287319
> <sup>2</sup> Six whitespaces in the ASCII set are: ` \t\n\v\f\r`. Python's and other standard libraries have specialized functions for those.
288-
> <sup>3</sup> Most Python libraries for strings are also implemented in C.
289-
> <sup>4</sup> Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
290-
> <sup>5</sup> All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
320+
> <sup>3</sup> All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
291321
> The C++ STL and StringZilla benchmarks used a 64-bit [Mersenne Twister][faq-mersenne-twister] as the generator.
292322
> For C, C++, and StringZilla, an in-place update of the string was used.
293323
> In Python every string had to be allocated as a new object, which makes it less fair.
294-
> <sup>6</sup> Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
324+
> <sup>4</sup> Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
295325
> That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation.
296326
> The latter is very common in database engines and is most similar to `numpy.argsort`.
297327
> The current StringZilla solution can be at least 4x faster without loss of generality.
328+
> <sup>5</sup> Most Python libraries for strings are also implemented in C.
329+
> <sup>6</sup> Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
298330
299331
[faq-mersenne-twister]: https://en.wikipedia.org/wiki/Mersenne_Twister
300332

@@ -521,6 +553,30 @@ OpenSSL (powering `hashlib`) has faster Assembly kernels, but StringZilla avoids
521553
- OpenSSL-backed `hashlib.sha256`: 12.6s
522554
- StringZilla end-to-end: 4.0s — __3× faster!__
523555

556+
### Unicode Case-Folding and Case-Insensitive Search
557+
558+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
559+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
560+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
561+
562+
```python
563+
import stringzilla as sz
564+
565+
sz.utf8_case_fold('HELLO') # b'hello'
566+
sz.utf8_case_fold('Straße') # b'strasse' — ß (1 char) expands to "ss" (2 chars)
567+
sz.utf8_case_fold('efficient') # b'efficient' — ffi ligature (1 char) expands to "ffi" (3 chars)
568+
```
569+
570+
The case-insensitive search returns the byte offset of the match, handling expansions correctly.
571+
572+
```python
573+
import stringzilla as sz
574+
575+
sz.utf8_case_insensitive_find('Der große Hund', 'GROSSE') # 4 — finds "große" at codepoint 4
576+
sz.utf8_case_insensitive_find('Straße', 'STRASSE') # 0 — ß matches "SS"
577+
sz.utf8_case_insensitive_find('efficient', 'EFFICIENT') # 0 — ffi ligature matches "FFI"
578+
```
579+
524580
### Collection-Level Operations
525581

526582
Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices with minimal memory footprint.
@@ -932,6 +988,46 @@ auto b = "some string"_sv; // sz::string_view
932988

933989
[stl-literal]: https://en.cppreference.com/w/cpp/string/basic_string_view/operator%22%22sv
934990

991+
### Unicode Case-Folding and Case-Insensitive Search
992+
993+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
994+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
995+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
996+
997+
```c
998+
char source[] = "Straße"; // German: "Street"
999+
char destination[64]; // Must be at least 3x source length
1000+
sz_size_t result_len = sz_utf8_case_fold(source, strlen(source), destination);
1001+
// destination now contains "strasse" (7 bytes), result_len = 7
1002+
```
1003+
1004+
The case-insensitive search API returns a pointer to the start of the first relevant glyph in the haystack, or `NULL` if not found.
1005+
It outputs the length of the matched haystack substring in bytes, and accepts a metadata structure to speed up repeated searches for the same needle.
1006+
1007+
```c
1008+
sz_utf8_case_insensitive_needle_metadata_t metadata = {};
1009+
sz_size_t match_length;
1010+
sz_cptr_t match = sz_utf8_case_insensitive_find(
1011+
haystack, haystack_len,
1012+
needle, needle_len,
1013+
&metadata, // Reuse for queries with the same needle
1014+
&match_length // Output: bytes consumed in haystack
1015+
);
1016+
```
1017+
1018+
Same functionality is available in C++:
1019+
1020+
```cpp
1021+
namespace sz = ashvardanian::stringzilla;
1022+
1023+
sz::string_view text = "Hello World"; // Single search
1024+
auto [offset, length] = text.utf8_case_insensitive_find("HELLO");
1025+
1026+
sz::utf8_case_insensitive_needle pattern("hello"); // Repeated searches with pre-compiled pattern
1027+
for (auto const& haystack : haystacks)
1028+
auto match = haystack.utf8_case_insensitive_find(pattern);
1029+
```
1030+
9351031
### Similarity Scores
9361032
9371033
StringZilla exposes high-performance, batch-oriented similarity via the `stringzillas/stringzillas.h` header.
@@ -1647,6 +1743,44 @@ let digest = hasher.digest();
16471743
let mac = sz::hmac_sha256(b"secret", b"Hello, world!");
16481744
```
16491745

1746+
1747+
### Unicode Case-Folding and Case-Insensitive Search
1748+
1749+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
1750+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
1751+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
1752+
1753+
```rust
1754+
use stringzilla::stringzilla as sz;
1755+
1756+
let source = "Straße"; // German: "Street"
1757+
let mut dest = [0u8; 64]; // Must be at least 3x source length
1758+
let len = sz::utf8_case_fold(source, &mut dest);
1759+
assert_eq!(&dest[..len], b"strasse"); // ß (2 bytes) → "ss" (2 bytes)
1760+
```
1761+
1762+
The case-insensitive search returns `Some((offset, matched_length))` or `None`.
1763+
The `matched_length` may differ from needle length due to expansions.
1764+
1765+
```rust
1766+
use stringzilla::stringzilla::{utf8_case_insensitive_find, Utf8CaseInsensitiveNeedle};
1767+
1768+
// Single search — ß (C3 9F) matches "SS"
1769+
if let Some((offset, len)) = utf8_case_insensitive_find("Straße", "STRASSE") {
1770+
assert_eq!(offset, 0);
1771+
assert_eq!(len, 7); // "Straße" is 7 bytes
1772+
}
1773+
1774+
// Repeated searches with pre-compiled needle metadata
1775+
let needle = Utf8CaseInsensitiveNeedle::new(b"STRASSE");
1776+
for haystack in &["Straße", "STRASSE", "strasse"] {
1777+
if let Some((offset, len)) = utf8_case_insensitive_find(haystack, &needle) {
1778+
println!("Found at byte {} with length {}", offset, len);
1779+
}
1780+
}
1781+
```
1782+
1783+
16501784
### Similarity Scores
16511785

16521786
StringZilla exposes high-performance, batch-oriented similarity via the `szs` module.
@@ -2329,7 +2463,7 @@ Very small inputs fall back to insertion sort.
23292463
- Average time complexity: O(n log n)
23302464
- Worst-case time complexity: quadratic (due to QuickSort), mitigated in practice by 3‑way partitioning and the n‑gram staging
23312465

2332-
### Unicode, UTF-8, and Wide Characters
2466+
### Unicode 17, UTF-8, and Wide Characters
23332467

23342468
Most StringZilla operations are byte-level, so they work well with ASCII and UTF-8 content out of the box.
23352469
In some cases, like edit-distance computation, the result of byte-level evaluation and character-level evaluation may differ.
@@ -2339,10 +2473,30 @@ In some cases, like edit-distance computation, the result of byte-level evaluati
23392473

23402474
Java, JavaScript, Python 2, C#, and Objective-C, however, use wide characters (`wchar`) - two byte long codes, instead of the more reasonable fixed-length UTF-32 or variable-length UTF-8.
23412475
This leads [to all kinds of offset-counting issues][wide-char-offsets] when facing four-byte long Unicode characters.
2342-
So consider transcoding with [simdutf](https://github.com/simdutf/simdutf), if you are coming from such environments.
2476+
StringZilla uses proper 32-bit "runes" to represent unpacked Unicode codepoints, ensuring correct results in all operations.
2477+
Moreover, it implements the Unicode 17.0 standard, being practically the only library besides ICU and PCRE2 to do so, but with order(s) of magnitude better performance.
23432478

23442479
[wide-char-offsets]: https://josephg.com/blog/string-length-lies/
23452480

2481+
### Case-Folding and Case-Insensitive Search
2482+
2483+
StringZilla provides Unicode-aware case-insensitive substring search that handles the full complexity of Unicode case folding.
2484+
This includes multi-character expansions:
2485+
2486+
| Character | Codepoint | UTF-8 Bytes | Case-Folds To | Result Bytes |
2487+
| --------- | --------- | ----------- | ------------- | ------------ |
2488+
| `ß` | U+00DF | C3 9F | `ss` | 73 73 |
2489+
| `` | U+FB03 | EF AC 83 | `ffi` | 66 66 69 |
2490+
| `İ` | U+0130 | C4 B0 | `i` + `◌̇` | 69 CC 87 |
2491+
2492+
The search returns byte offsets and lengths in the original haystack, correctly handling length differences.
2493+
For example, searching for `"STRASSE"` (7 bytes) in `"Straße"` (7 bytes: 53 74 72 61 C3 9F 65) succeeds because both case-fold to `"strasse"`.
2494+
2495+
Note that Turkish `İ` and ASCII `I` are distinct: `İstanbul` case-folds to `i̇stanbul` (with combining dot), while `ISTANBUL` case-folds to `istanbul` (without).
2496+
They will not match each other — this is correct Unicode behavior for Turkish locale handling.
2497+
2498+
For wide-character environments (Java, JavaScript, Python 2, C#), consider transcoding with [simdutf](https://github.com/simdutf/simdutf).
2499+
23462500
## Dynamic Dispatch
23472501

23482502
Due to the high-level of fragmentation of SIMD support in different CPUs, StringZilla uses the names of select Intel and ARM CPU generations for its backends.

0 commit comments

Comments
 (0)