You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -285,16 +317,16 @@ To inspect collision resistance and distribution shapes for our hashers, see __[
285
317
> For CUDA benchmarks, the Nvidia H100 GPUs were used.
286
318
> <sup>1</sup> Unlike other libraries, LibC requires strings to be NULL-terminated.
287
319
> <sup>2</sup> Six whitespaces in the ASCII set are: ` \t\n\v\f\r`. Python's and other standard libraries have specialized functions for those.
288
-
> <sup>3</sup> Most Python libraries for strings are also implemented in C.
289
-
> <sup>4</sup> Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
290
-
> <sup>5</sup> All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
320
+
> <sup>3</sup> All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
291
321
> The C++ STL and StringZilla benchmarks used a 64-bit [Mersenne Twister][faq-mersenne-twister] as the generator.
292
322
> For C, C++, and StringZilla, an in-place update of the string was used.
293
323
> In Python every string had to be allocated as a new object, which makes it less fair.
294
-
> <sup>6</sup> Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
324
+
> <sup>4</sup> Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
295
325
> That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation.
296
326
> The latter is very common in database engines and is most similar to `numpy.argsort`.
297
327
> The current StringZilla solution can be at least 4x faster without loss of generality.
328
+
> <sup>5</sup> Most Python libraries for strings are also implemented in C.
329
+
> <sup>6</sup> Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
@@ -521,6 +553,30 @@ OpenSSL (powering `hashlib`) has faster Assembly kernels, but StringZilla avoids
521
553
- OpenSSL-backed `hashlib.sha256`: 12.6s
522
554
- StringZilla end-to-end: 4.0s — __3× faster!__
523
555
556
+
### Unicode Case-Folding and Case-Insensitive Search
557
+
558
+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
559
+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
560
+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
### Unicode Case-Folding and Case-Insensitive Search
992
+
993
+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
994
+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
995
+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
996
+
997
+
```c
998
+
char source[] = "Straße"; // German: "Street"
999
+
char destination[64]; // Must be at least 3x source length
### Unicode Case-Folding and Case-Insensitive Search
1748
+
1749
+
StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
1750
+
Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
1751
+
The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
1752
+
1753
+
```rust
1754
+
usestringzilla::stringzilla as sz;
1755
+
1756
+
letsource="Straße"; // German: "Street"
1757
+
letmutdest= [0u8; 64]; // Must be at least 3x source length
println!("Found at byte {} with length {}", offset, len);
1779
+
}
1780
+
}
1781
+
```
1782
+
1783
+
1650
1784
### Similarity Scores
1651
1785
1652
1786
StringZilla exposes high-performance, batch-oriented similarity via the `szs` module.
@@ -2329,7 +2463,7 @@ Very small inputs fall back to insertion sort.
2329
2463
- Average time complexity: O(n log n)
2330
2464
- Worst-case time complexity: quadratic (due to QuickSort), mitigated in practice by 3‑way partitioning and the n‑gram staging
2331
2465
2332
-
### Unicode, UTF-8, and Wide Characters
2466
+
### Unicode 17, UTF-8, and Wide Characters
2333
2467
2334
2468
Most StringZilla operations are byte-level, so they work well with ASCII and UTF-8 content out of the box.
2335
2469
In some cases, like edit-distance computation, the result of byte-level evaluation and character-level evaluation may differ.
@@ -2339,10 +2473,30 @@ In some cases, like edit-distance computation, the result of byte-level evaluati
2339
2473
2340
2474
Java, JavaScript, Python 2, C#, and Objective-C, however, use wide characters (`wchar`) - two byte long codes, instead of the more reasonable fixed-length UTF-32 or variable-length UTF-8.
2341
2475
This leads [to all kinds of offset-counting issues][wide-char-offsets] when facing four-byte long Unicode characters.
2342
-
So consider transcoding with [simdutf](https://github.com/simdutf/simdutf), if you are coming from such environments.
2476
+
StringZilla uses proper 32-bit "runes" to represent unpacked Unicode codepoints, ensuring correct results in all operations.
2477
+
Moreover, it implements the Unicode 17.0 standard, being practically the only library besides ICU and PCRE2 to do so, but with order(s) of magnitude better performance.
The search returns byte offsets and lengths in the original haystack, correctly handling length differences.
2493
+
For example, searching for `"STRASSE"` (7 bytes) in `"Straße"` (7 bytes: 53 74 72 61 C3 9F 65) succeeds because both case-fold to `"strasse"`.
2494
+
2495
+
Note that Turkish `İ` and ASCII `I` are distinct: `İstanbul` case-folds to `i̇stanbul` (with combining dot), while `ISTANBUL` case-folds to `istanbul` (without).
2496
+
They will not match each other — this is correct Unicode behavior for Turkish locale handling.
2497
+
2498
+
For wide-character environments (Java, JavaScript, Python 2, C#), consider transcoding with [simdutf](https://github.com/simdutf/simdutf).
2499
+
2346
2500
## Dynamic Dispatch
2347
2501
2348
2502
Due to the high-level of fragmentation of SIMD support in different CPUs, StringZilla uses the names of select Intel and ARM CPU generations for its backends.
0 commit comments