44
55![ StringWars Thumbnail] ( https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringWa.rs.jpg?raw=true )
66
7- There are many ** great ** libraries for string processing!
7+ There are many __ great __ libraries for string processing!
88Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.
99
10- Where Rust decimates C and C++, is the ** simplicity ** of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
10+ Where Rust decimates C and C++, is the __ simplicity __ of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
1111So, to accelerate the development of the [ ` StringZilla ` ] ( https://github.com/ashvardanian/StringZilla ) C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:
1212
1313- [ ` memchr ` ] ( https://github.com/BurntSushi/memchr ) for substring search.
@@ -24,7 +24,7 @@ Notably, I also favor modern hardware with support for a wider range SIMD instru
2424
2525> [ !IMPORTANT]
2626> The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.
27- > Most of them were obtained on Intel Sapphire Rapids ** (SPR)** and Granite Rapids ** (GNR)** CPUs and Nvidia Hopper-based ** H100 ** and Blackwell-based ** RTX 6000 ** Pro GPUs, using Rust with ` -C target-cpu=native ` optimization flag.
27+ > Most of them were obtained on Intel Sapphire Rapids __ (SPR)__ and Granite Rapids __ (GNR)__ CPUs and Nvidia Hopper-based __ H100 __ and Blackwell-based __ RTX 6000 __ Pro GPUs, using Rust with ` -C target-cpu=native ` optimization flag.
2828> To replicate the results, please refer to the [ Replicating the Results] ( #replicating-the-results ) section below.
2929
3030## Benchmarks at a Glance
@@ -50,7 +50,34 @@ xxhash.xxh3_64 █████▋ 0.04 ██████
5050
5151See [ bench_hash.md] ( bench_hash.md ) for details
5252
53- ### Substring Search
53+ ### Case-Insensitive UTF-8 Search
54+
55+ Unicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς).
56+ Throughput searching across ~ 100MB multilingual corpora:
57+
58+ ```
59+ Rust:
60+ English German
61+ stringzilla ████████████████████ 12.79 ████████████████████ 10.67 GB/s
62+ icu ▏ 0.08 ▏ 0.08 GB/s
63+
64+ Russian Korean
65+ stringzilla ████████████████████ 7.12 ████████████████████ 35.10 GB/s
66+ icu ▏ 0.14 ▏ 0.23 GB/s
67+
68+ Python:
69+ English German
70+ stringzilla ████████████████████ 5.61 ████████████████████ 6.08 GB/s
71+ regex ██▋ 0.77 ███ 0.90 GB/s
72+
73+ Russian Korean
74+ stringzilla ████████████████████ 5.70 ████████████████████ 20.05 GB/s
75+ regex ████████ 2.30 ████▋ 4.59 GB/s
76+ ```
77+
78+ See [ bench_unicode.md] ( bench_unicode.md ) for details
79+
80+ ### Exact Substring Search
5481
5582Substring search is offloaded to C's ` memmem ` or ` strstr ` in most languages, but SIMD-optimized implementations can do better.
5683Throughput on long lines:
@@ -93,13 +120,13 @@ Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single
93120Throughput on AMD Zen5 Turin:
94121
95122```
96- English Arabic
97123Newline splitting:
124+ English Arabic
98125stringzilla ████████████████ 15.45 ████████████████████ 18.34 GB/s
99126stdlib ██ 1.90 ██ 1.82 GB/s
100127
101- English Korean
102128Whitespace splitting:
129+ English Korean
103130stringzilla ████████████████████ 0.82 ████████████████████ 1.88 GB/s
104131stdlib ██████████████████▊ 0.77 ██████████▍ 0.98 GB/s
105132icu::WhiteSpace ██▋ 0.11 █▌ 0.15 GB/s
@@ -108,8 +135,8 @@ icu::WhiteSpace ██▋ 0.11 █▌
108135Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:
109136
110137```
111- English 16x German 6x
112138Case folding:
139+ English 16x German 6x
113140stringzilla ████████████████████ 7.53 ████████████████████ 2.59 GB/s
114141stdlib ██▌ 0.48 ███▎ 0.43 GB/s
115142
@@ -148,7 +175,7 @@ pyarrow.sort █████▌ 62.17 M cmp/s
148175list.sort ████▏ 47.06 M cmp/s
149176```
150177
151- GPU: ` cudf ` on H100 reaches ** 9 ,463 M cmp/s ** on short words.
178+ GPU: ` cudf ` on H100 reaches __ 9 ,463 M cmp/s __ on short words.
152179
153180See [ bench_sequence.md] ( bench_sequence.md ) for details
154181
@@ -178,8 +205,8 @@ It's computationally expensive with O(n\*m) complexity, but GPUs and multi-core
178205Levenshtein distance on ~ 1,000 byte lines (MCUPS = Million Cell Updates Per Second):
179206
180207```
181- 1 Core 1 Socket
182208Rust:
209+ 1 Core 1 Socket
183210bio::levenshtein █▏ 823
184211rapidfuzz ████████████████████ 14,316
185212stringzilla (384x GNR) ██████████████████▎ 13,084 ████████████████████ 3,084,270 MCUPS
@@ -195,8 +222,8 @@ Converting variable-length strings into fixed-length sketches (like Min-Hashing)
195222Throughput on ~ 1,000 byte lines:
196223
197224```
198- 1 Core 1 Socket
199225Rust:
226+ 1 Core 1 Socket
200227pc::MinHash ████████████████████ 3.16
201228stringzilla (384x GNR) ███▏ 0.51 ███████████████▍ 302.30 MB/s
202229stringzilla (H100) ████████████████████ 392.37 MB/s
@@ -325,11 +352,11 @@ The Cohere Wikipedia dataset provides pre-processed JSONL files for different la
325352This may be the optimal dataset for relative comparison of UTF-8 decoding and matching enginges in each individual environment.
326353Not all Wikipedia languages are available, but the following have been selected specifically:
327354
328- - ** Chinese (zh)** : 3-byte CJK characters, rare 1-byte punctuation
329- - ** Korean (ko)** : 3-byte Hangul syllables, frequent 1-byte punctuation
330- - ** Arabic (ar)** : 2-byte Arabic script, with regular 1-byte punctuation
331- - ** French (fr)** : Mixed 1-2 byte Latin with high diacritic density
332- - ** English (en)** : Mostly 1-byte ASCII baseline
355+ - __ Chinese (zh)__ : 3-byte CJK characters, rare 1-byte punctuation
356+ - __ Korean (ko)__ : 3-byte Hangul syllables, frequent 1-byte punctuation
357+ - __ Arabic (ar)__ : 2-byte Arabic script, with regular 1-byte punctuation
358+ - __ French (fr)__ : Mixed 1-2 byte Latin with high diacritic density
359+ - __ English (en)__ : Mostly 1-byte ASCII baseline
333360
334361To download and decompress one file from each language:
335362
@@ -353,13 +380,13 @@ Files are XZ-compressed plain text with documents separated by double-newlines.
353380
354381| Workload | Relevant Scripts | Best Test Languages |
355382| --------------------------- | --------------------------------- | ---------------------------------------------------- |
356- | ** Case Folding ** | Latin, Cyrillic, Greek, Armenian | Turkish (I/i), German (ss->SS), Greek, Russian |
357- | ** Normalization ** | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic |
358- | ** Whitespace Tokenization ** | Most scripts except CJK/Thai | English, Russian, Arabic vs. Chinese, Japanese, Thai |
359- | ** Grapheme Clusters ** | Indic, Thai, Khmer, Myanmar | Thai, Tamil, Myanmar, Khmer |
360- | ** RTL Handling ** | Arabic, Hebrew | Arabic, Hebrew, Persian |
383+ | __ Case Folding __ | Latin, Cyrillic, Greek, Armenian | Turkish (I/i), German (ss->SS), Greek, Russian |
384+ | __ Normalization __ | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic |
385+ | __ Whitespace Tokenization __ | Most scripts except CJK/Thai | English, Russian, Arabic vs. Chinese, Japanese, Thai |
386+ | __ Grapheme Clusters __ | Indic, Thai, Khmer, Myanmar | Thai, Tamil, Myanmar, Khmer |
387+ | __ RTL Handling __ | Arabic, Hebrew | Arabic, Hebrew, Persian |
361388
362- ** Bicameral scripts ** with various case folding rules:
389+ __ Bicameral scripts __ with various case folding rules:
363390
364391``` bash
365392curl -fL https://data.statmt.org/cc-100/en.txt.xz | xz -d > cc100_en.txt # 82 GB - English
@@ -379,7 +406,7 @@ curl -fL https://data.statmt.org/cc-100/pt.txt.xz | xz -d > cc100_pt.txt #
379406curl -fL https://data.statmt.org/cc-100/it.txt.xz | xz -d > cc100_it.txt # 7.8 GB - Italian
380407```
381408
382- ** Unicameral scripts ** without case folding, but with other normalization/segmentation challenges:
409+ __ Unicameral scripts __ without case folding, but with other normalization/segmentation challenges:
383410
384411``` bash
385412curl -fL https://data.statmt.org/cc-100/ar.txt.xz | xz -d > cc100_ar.txt # 5.4 GB - Arabic (RTL)
@@ -406,7 +433,7 @@ The [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/)
406433Each tar.gz contains ` *-sentences.txt ` (tab-separated ` id\tsentence ` ), ` *-words.txt ` (frequencies), and co-occurrence files.
407434Standard sizes: 10K, 30K, 100K, 300K, 1M sentences. Check for newer years at the download page.
408435
409- ** Bicameral scripts ** with various case folding rules:
436+ __ Bicameral scripts __ with various case folding rules:
410437
411438``` bash
412439curl -fL https://downloads.wortschatz-leipzig.de/corpora/eng_wikipedia_2016_1M.tar.gz | tar -xzf - -O ' eng_wikipedia_2016_1M/eng_wikipedia_2016_1M-sentences.txt' | cut -f2 > leipzig1M_en.txt
@@ -427,7 +454,7 @@ curl -fL https://downloads.wortschatz-leipzig.de/corpora/ita_wikipedia_2021_1M.t
427454curl -fL https://downloads.wortschatz-leipzig.de/corpora/lit_wikipedia_2021_300K.tar.gz | tar -xzf - -O ' lit_wikipedia_2021_300K/lit_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_lt.txt
428455```
429456
430- ** Unicameral scripts ** without case folding, but with other normalization/segmentation challenges:
457+ __ Unicameral scripts __ without case folding, but with other normalization/segmentation challenges:
431458
432459``` bash
433460curl -fL https://downloads.wortschatz-leipzig.de/corpora/ara_wikipedia_2021_1M.tar.gz | tar -xzf - -O ' ara_wikipedia_2021_1M/ara_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ar.txt
0 commit comments