Skip to content

Commit 39825f6

Browse files
committed
Docs: UTF-8 Search results
1 parent 9ed0776 commit 39825f6

File tree

2 files changed

+91
-26
lines changed

2 files changed

+91
-26
lines changed

README.md

Lines changed: 51 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44

55
![StringWars Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringWa.rs.jpg?raw=true)
66

7-
There are many **great** libraries for string processing!
7+
There are many __great__ libraries for string processing!
88
Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.
99

10-
Where Rust decimates C and C++, is the **simplicity** of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
10+
Where Rust decimates C and C++, is the __simplicity__ of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
1111
So, to accelerate the development of the [`StringZilla`](https://github.com/ashvardanian/StringZilla) C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:
1212

1313
- [`memchr`](https://github.com/BurntSushi/memchr) for substring search.
@@ -24,7 +24,7 @@ Notably, I also favor modern hardware with support for a wider range SIMD instru
2424

2525
> [!IMPORTANT]
2626
> The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.
27-
> Most of them were obtained on Intel Sapphire Rapids **(SPR)** and Granite Rapids **(GNR)** CPUs and Nvidia Hopper-based **H100** and Blackwell-based **RTX 6000** Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.
27+
> Most of them were obtained on Intel Sapphire Rapids __(SPR)__ and Granite Rapids __(GNR)__ CPUs and Nvidia Hopper-based __H100__ and Blackwell-based __RTX 6000__ Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.
2828
> To replicate the results, please refer to the [Replicating the Results](#replicating-the-results) section below.
2929
3030
## Benchmarks at a Glance
@@ -50,7 +50,34 @@ xxhash.xxh3_64 █████▋ 0.04 ██████
5050

5151
See [bench_hash.md](bench_hash.md) for details
5252

53-
### Substring Search
53+
### Case-Insensitive UTF-8 Search
54+
55+
Unicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς).
56+
Throughput searching across ~100MB multilingual corpora:
57+
58+
```
59+
Rust:
60+
English German
61+
stringzilla ████████████████████ 12.79 ████████████████████ 10.67 GB/s
62+
icu ▏ 0.08 ▏ 0.08 GB/s
63+
64+
Russian Korean
65+
stringzilla ████████████████████ 7.12 ████████████████████ 35.10 GB/s
66+
icu ▏ 0.14 ▏ 0.23 GB/s
67+
68+
Python:
69+
English German
70+
stringzilla ████████████████████ 5.61 ████████████████████ 6.08 GB/s
71+
regex ██▋ 0.77 ███ 0.90 GB/s
72+
73+
Russian Korean
74+
stringzilla ████████████████████ 5.70 ████████████████████ 20.05 GB/s
75+
regex ████████ 2.30 ████▋ 4.59 GB/s
76+
```
77+
78+
See [bench_unicode.md](bench_unicode.md) for details
79+
80+
### Exact Substring Search
5481

5582
Substring search is offloaded to C's `memmem` or `strstr` in most languages, but SIMD-optimized implementations can do better.
5683
Throughput on long lines:
@@ -93,13 +120,13 @@ Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single
93120
Throughput on AMD Zen5 Turin:
94121

95122
```
96-
English Arabic
97123
Newline splitting:
124+
English Arabic
98125
stringzilla ████████████████ 15.45 ████████████████████ 18.34 GB/s
99126
stdlib ██ 1.90 ██ 1.82 GB/s
100127
101-
English Korean
102128
Whitespace splitting:
129+
English Korean
103130
stringzilla ████████████████████ 0.82 ████████████████████ 1.88 GB/s
104131
stdlib ██████████████████▊ 0.77 ██████████▍ 0.98 GB/s
105132
icu::WhiteSpace ██▋ 0.11 █▌ 0.15 GB/s
@@ -108,8 +135,8 @@ icu::WhiteSpace ██▋ 0.11 █▌
108135
Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:
109136

110137
```
111-
English 16x German 6x
112138
Case folding:
139+
English 16x German 6x
113140
stringzilla ████████████████████ 7.53 ████████████████████ 2.59 GB/s
114141
stdlib ██▌ 0.48 ███▎ 0.43 GB/s
115142
@@ -148,7 +175,7 @@ pyarrow.sort █████▌ 62.17 M cmp/s
148175
list.sort ████▏ 47.06 M cmp/s
149176
```
150177

151-
GPU: `cudf` on H100 reaches **9,463 M cmp/s** on short words.
178+
GPU: `cudf` on H100 reaches __9,463 M cmp/s__ on short words.
152179

153180
See [bench_sequence.md](bench_sequence.md) for details
154181

@@ -178,8 +205,8 @@ It's computationally expensive with O(n\*m) complexity, but GPUs and multi-core
178205
Levenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):
179206

180207
```
181-
1 Core 1 Socket
182208
Rust:
209+
1 Core 1 Socket
183210
bio::levenshtein █▏ 823
184211
rapidfuzz ████████████████████ 14,316
185212
stringzilla (384x GNR) ██████████████████▎ 13,084 ████████████████████ 3,084,270 MCUPS
@@ -195,8 +222,8 @@ Converting variable-length strings into fixed-length sketches (like Min-Hashing)
195222
Throughput on ~1,000 byte lines:
196223

197224
```
198-
1 Core 1 Socket
199225
Rust:
226+
1 Core 1 Socket
200227
pc::MinHash ████████████████████ 3.16
201228
stringzilla (384x GNR) ███▏ 0.51 ███████████████▍ 302.30 MB/s
202229
stringzilla (H100) ████████████████████ 392.37 MB/s
@@ -325,11 +352,11 @@ The Cohere Wikipedia dataset provides pre-processed JSONL files for different la
325352
This may be the optimal dataset for relative comparison of UTF-8 decoding and matching enginges in each individual environment.
326353
Not all Wikipedia languages are available, but the following have been selected specifically:
327354

328-
- **Chinese (zh)**: 3-byte CJK characters, rare 1-byte punctuation
329-
- **Korean (ko)**: 3-byte Hangul syllables, frequent 1-byte punctuation
330-
- **Arabic (ar)**: 2-byte Arabic script, with regular 1-byte punctuation
331-
- **French (fr)**: Mixed 1-2 byte Latin with high diacritic density
332-
- **English (en)**: Mostly 1-byte ASCII baseline
355+
- __Chinese (zh)__: 3-byte CJK characters, rare 1-byte punctuation
356+
- __Korean (ko)__: 3-byte Hangul syllables, frequent 1-byte punctuation
357+
- __Arabic (ar)__: 2-byte Arabic script, with regular 1-byte punctuation
358+
- __French (fr)__: Mixed 1-2 byte Latin with high diacritic density
359+
- __English (en)__: Mostly 1-byte ASCII baseline
333360

334361
To download and decompress one file from each language:
335362

@@ -353,13 +380,13 @@ Files are XZ-compressed plain text with documents separated by double-newlines.
353380

354381
| Workload | Relevant Scripts | Best Test Languages |
355382
| --------------------------- | --------------------------------- | ---------------------------------------------------- |
356-
| **Case Folding** | Latin, Cyrillic, Greek, Armenian | Turkish (I/i), German (ss->SS), Greek, Russian |
357-
| **Normalization** | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic |
358-
| **Whitespace Tokenization** | Most scripts except CJK/Thai | English, Russian, Arabic vs. Chinese, Japanese, Thai |
359-
| **Grapheme Clusters** | Indic, Thai, Khmer, Myanmar | Thai, Tamil, Myanmar, Khmer |
360-
| **RTL Handling** | Arabic, Hebrew | Arabic, Hebrew, Persian |
383+
| __Case Folding__ | Latin, Cyrillic, Greek, Armenian | Turkish (I/i), German (ss->SS), Greek, Russian |
384+
| __Normalization__ | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic |
385+
| __Whitespace Tokenization__ | Most scripts except CJK/Thai | English, Russian, Arabic vs. Chinese, Japanese, Thai |
386+
| __Grapheme Clusters__ | Indic, Thai, Khmer, Myanmar | Thai, Tamil, Myanmar, Khmer |
387+
| __RTL Handling__ | Arabic, Hebrew | Arabic, Hebrew, Persian |
361388

362-
**Bicameral scripts** with various case folding rules:
389+
__Bicameral scripts__ with various case folding rules:
363390

364391
```bash
365392
curl -fL https://data.statmt.org/cc-100/en.txt.xz | xz -d > cc100_en.txt # 82 GB - English
@@ -379,7 +406,7 @@ curl -fL https://data.statmt.org/cc-100/pt.txt.xz | xz -d > cc100_pt.txt #
379406
curl -fL https://data.statmt.org/cc-100/it.txt.xz | xz -d > cc100_it.txt # 7.8 GB - Italian
380407
```
381408

382-
**Unicameral scripts** without case folding, but with other normalization/segmentation challenges:
409+
__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:
383410

384411
```bash
385412
curl -fL https://data.statmt.org/cc-100/ar.txt.xz | xz -d > cc100_ar.txt # 5.4 GB - Arabic (RTL)
@@ -406,7 +433,7 @@ The [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/)
406433
Each tar.gz contains `*-sentences.txt` (tab-separated `id\tsentence`), `*-words.txt` (frequencies), and co-occurrence files.
407434
Standard sizes: 10K, 30K, 100K, 300K, 1M sentences. Check for newer years at the download page.
408435

409-
**Bicameral scripts** with various case folding rules:
436+
__Bicameral scripts__ with various case folding rules:
410437

411438
```bash
412439
curl -fL https://downloads.wortschatz-leipzig.de/corpora/eng_wikipedia_2016_1M.tar.gz | tar -xzf - -O 'eng_wikipedia_2016_1M/eng_wikipedia_2016_1M-sentences.txt' | cut -f2 > leipzig1M_en.txt
@@ -427,7 +454,7 @@ curl -fL https://downloads.wortschatz-leipzig.de/corpora/ita_wikipedia_2021_1M.t
427454
curl -fL https://downloads.wortschatz-leipzig.de/corpora/lit_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'lit_wikipedia_2021_300K/lit_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_lt.txt
428455
```
429456

430-
**Unicameral scripts** without case folding, but with other normalization/segmentation challenges:
457+
__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:
431458

432459
```bash
433460
curl -fL https://downloads.wortschatz-leipzig.de/corpora/ara_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ara_wikipedia_2021_1M/ara_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ar.txt

bench_unicode.md

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,6 @@ On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following
6161
| Farsi 🇮🇷 | 235 MB/s | 858 MB/s | 4x | 475 MB/s | 1.42 GB/s | 3x |
6262
| Georgian 🇬🇪 | 294 MB/s | 192 MB/s | 1x | 689 MB/s | 488 MB/s | 1x |
6363
| Hebrew 🇮🇱 | 233 MB/s | 1.01 GB/s | 4x | 473 MB/s | 1.86 GB/s | 4x |
64-
| Hindi 🇮🇳 | 293 MB/s | 6.32 GB/s | 22x | 682 MB/s | 3.14 GB/s | 5x |
6564
| Italian 🇮🇹 | 439 MB/s | 2.29 GB/s | 5x | 268 MB/s | 1.93 GB/s | 7x |
6665
| Japanese 🇯🇵 | 330 MB/s | 3.51 GB/s | 11x | 726 MB/s | 2.00 GB/s | 3x |
6766
| Korean 🇰🇷 | 314 MB/s | 861 MB/s | 3x | 623 MB/s | 2.80 GB/s | 4x |
@@ -79,14 +78,53 @@ To rerun the benchmarks for all languages:
7978
RUSTFLAGS="-C target-cpu=native" cargo build --release --bench bench_unicode --features bench_unicode
8079
bin=$(find target/release/deps -name 'bench_unicode-*' -executable -type f | head -1)
8180

82-
for f in leipzig1M_*.txt leipzig300K_*.txt; do
81+
for f in leipzig*.txt; do
8382
[ -f "$f" ] || continue
8483
echo "=== $f ==="
8584
STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold" "$bin"
8685
STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold/" uv run bench_unicode.py
8786
done
8887
```
8988

89+
## Case-Insensitive Substring Search
90+
91+
| Language | Standard 🦀 | StringZilla 🦀 | | Standard 🐍 | StringZilla 🐍 | |
92+
| :----------- | ---------: | ------------: | ---: | ---------: | ------------: | ---: |
93+
| Arabic 🇸🇦 | 200 MB/s | 38.55 GB/s | 193x | 3.01 GB/s | 14.78 GB/s | 5x |
94+
| Armenian 🇦🇲 | 190 MB/s | 980 MB/s | 5x | 2.07 GB/s | 860 MB/s | 0x |
95+
| Bengali 🇧🇩 | 300 MB/s | 28.20 GB/s | 94x | 4.51 GB/s | 21.19 GB/s | 5x |
96+
| Chinese 🇨🇳 | 240 MB/s | 25.65 GB/s | 107x | 5.40 GB/s | 13.94 GB/s | 3x |
97+
| Czech 🇨🇿 | 90 MB/s | 7.41 GB/s | 82x | 1.38 GB/s | 6.36 GB/s | 5x |
98+
| Dutch 🇳🇱 | 90 MB/s | 12.61 GB/s | 140x | 860 MB/s | 7.99 GB/s | 9x |
99+
| English 🇬🇧 | 80 MB/s | 12.79 GB/s | 160x | 770 MB/s | 5.61 GB/s | 7x |
100+
| Farsi 🇮🇷 | 190 MB/s | 26.22 GB/s | 138x | 2.36 GB/s | 10.70 GB/s | 5x |
101+
| French 🇫🇷 | 90 MB/s | 10.77 GB/s | 120x | 1.10 GB/s | 6.83 GB/s | 6x |
102+
| Georgian 🇬🇪 | 190 MB/s | 1.03 GB/s | 5x | 3.20 GB/s | 620 MB/s | 0x |
103+
| German 🇩🇪 | 80 MB/s | 10.67 GB/s | 133x | 900 MB/s | 6.08 GB/s | 7x |
104+
| Greek 🇬🇷 | 130 MB/s | 2.57 GB/s | 20x | 1.38 GB/s | 2.48 GB/s | 2x |
105+
| Hebrew 🇮🇱 | 190 MB/s | 34.54 GB/s | 182x | 2.92 GB/s | 15.72 GB/s | 5x |
106+
| Italian 🇮🇹 | 80 MB/s | 12.99 GB/s | 162x | 970 MB/s | 8.87 GB/s | 9x |
107+
| Japanese 🇯🇵 | 220 MB/s | 21.71 GB/s | 99x | 4.88 GB/s | 13.17 GB/s | 3x |
108+
| Korean 🇰🇷 | 230 MB/s | 35.10 GB/s | 153x | 4.59 GB/s | 20.05 GB/s | 4x |
109+
| Polish 🇵🇱 | 90 MB/s | 10.50 GB/s | 117x | 1.29 GB/s | 8.02 GB/s | 6x |
110+
| Portuguese 🇧🇷 | 90 MB/s | 10.72 GB/s | 119x | 1.10 GB/s | 8.12 GB/s | 7x |
111+
| Russian 🇷🇺 | 140 MB/s | 7.12 GB/s | 51x | 2.30 GB/s | 5.70 GB/s | 2x |
112+
| Spanish 🇪🇸 | 90 MB/s | 11.62 GB/s | 129x | 1.02 GB/s | 6.33 GB/s | 6x |
113+
| Tamil 🇮🇳 | 270 MB/s | 29.53 GB/s | 109x | 5.81 GB/s | 23.11 GB/s | 4x |
114+
| Turkish 🇹🇷 | 90 MB/s | 8.18 GB/s | 91x | 1.49 GB/s | 5.25 GB/s | 4x |
115+
| Ukrainian 🇺🇦 | 140 MB/s | 8.88 GB/s | 63x | 2.26 GB/s | 5.35 GB/s | 2x |
116+
| Vietnamese 🇻🇳 | 110 MB/s | 4.25 GB/s | 39x | 1.07 GB/s | 1.12 GB/s | 1x |
117+
118+
To rerun the benchmarks for all languages:
119+
120+
```bash
121+
for f in leipzig*.txt; do
122+
[ -f "$f" ] || continue
123+
echo "=== $f ==="
124+
STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=words STRINGWARS_FILTER="case-insensitive-find" STRINGWARS_UNIQUE=1 "$bin"
125+
done
126+
```
127+
90128
---
91129

92130
See [README.md](README.md) for dataset information and replication instructions.

0 commit comments

Comments
 (0)