Docs: UTF-8 Search results

ashvardanian · ashvardanian · commit 39825f6ef726 · 2025-12-15T13:19:17.000Z
diff --git a/README.md b/README.md
@@ -4,10 +4,10 @@
 
 ![StringWars Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringWa.rs.jpg?raw=true)
 
-There are many **great** libraries for string processing!
+There are many __great__ libraries for string processing!
 Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.
 
-Where Rust decimates C and C++, is the **simplicity** of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
+Where Rust decimates C and C++, is the __simplicity__ of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
 So, to accelerate the development of the [`StringZilla`](https://github.com/ashvardanian/StringZilla) C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:
 
 - [`memchr`](https://github.com/BurntSushi/memchr) for substring search.
@@ -24,7 +24,7 @@ Notably, I also favor modern hardware with support for a wider range SIMD instru
 
 > [!IMPORTANT]  
 > The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.
-> Most of them were obtained on Intel Sapphire Rapids **(SPR)** and Granite Rapids **(GNR)** CPUs and Nvidia Hopper-based **H100** and Blackwell-based **RTX 6000** Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.
+> Most of them were obtained on Intel Sapphire Rapids __(SPR)__ and Granite Rapids __(GNR)__ CPUs and Nvidia Hopper-based __H100__ and Blackwell-based __RTX 6000__ Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.
 > To replicate the results, please refer to the [Replicating the Results](#replicating-the-results) section below.
 
 ## Benchmarks at a Glance
@@ -50,7 +50,34 @@ xxhash.xxh3_64      █████▋               0.04    ██████
 
 See [bench_hash.md](bench_hash.md) for details
 
-### Substring Search
+### Case-Insensitive UTF-8 Search
+
+Unicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς).
+Throughput searching across ~100MB multilingual corpora:
+
+```
+Rust:
+                      English                      German
+stringzilla           ████████████████████ 12.79   ████████████████████ 10.67 GB/s
+icu                   ▏                     0.08   ▏                     0.08 GB/s
+
+                      Russian                      Korean
+stringzilla           ████████████████████  7.12   ████████████████████ 35.10 GB/s
+icu                   ▏                     0.14   ▏                     0.23 GB/s
+
+Python:
+                      English                      German
+stringzilla           ████████████████████  5.61   ████████████████████  6.08 GB/s
+regex                 ██▋                   0.77   ███                   0.90 GB/s
+
+                      Russian                      Korean
+stringzilla           ████████████████████  5.70   ████████████████████ 20.05 GB/s
+regex                 ████████              2.30   ████▋                 4.59 GB/s
+```
+
+See [bench_unicode.md](bench_unicode.md) for details
+
+### Exact Substring Search
 
 Substring search is offloaded to C's `memmem` or `strstr` in most languages, but SIMD-optimized implementations can do better.
 Throughput on long lines:
@@ -93,13 +120,13 @@ Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single
 Throughput on AMD Zen5 Turin:
 
 ```
-                      English                     Arabic
 Newline splitting:
+                      English                     Arabic
 stringzilla           ████████████████ 15.45      ████████████████████ 18.34 GB/s
 stdlib                ██                1.90      ██                    1.82 GB/s
 
-                      English                     Korean
 Whitespace splitting:
+                      English                     Korean
 stringzilla           ████████████████████ 0.82   ████████████████████ 1.88 GB/s
 stdlib                ██████████████████▊  0.77   ██████████▍          0.98 GB/s
 icu::WhiteSpace       ██▋                  0.11   █▌                   0.15 GB/s
@@ -108,8 +135,8 @@ icu::WhiteSpace       ██▋                  0.11   █▌
 Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:
 
 ```
-                      English 16x                 German 6x
 Case folding:
+                      English 16x                 German 6x
 stringzilla           ████████████████████ 7.53   ████████████████████ 2.59 GB/s
 stdlib                ██▌                  0.48   ███▎                 0.43 GB/s
 
@@ -148,7 +175,7 @@ pyarrow.sort        █████▌                 62.17 M cmp/s
 list.sort           ████▏                  47.06 M cmp/s
 ```
 
-GPU: `cudf` on H100 reaches **9,463 M cmp/s** on short words.
+GPU: `cudf` on H100 reaches __9,463 M cmp/s__ on short words.
 
 See [bench_sequence.md](bench_sequence.md) for details
 
@@ -178,8 +205,8 @@ It's computationally expensive with O(n\*m) complexity, but GPUs and multi-core
 Levenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):
 
 ```
-                        1 Core                       1 Socket
 Rust:
+                        1 Core                       1 Socket
 bio::levenshtein        █▏                      823
 rapidfuzz               ████████████████████ 14,316
 stringzilla (384x GNR)  ██████████████████▎  13,084  ████████████████████ 3,084,270 MCUPS
@@ -195,8 +222,8 @@ Converting variable-length strings into fixed-length sketches (like Min-Hashing)
 Throughput on ~1,000 byte lines:
 
 ```
-                        1 Core                       1 Socket
 Rust:
+                        1 Core                       1 Socket
 pc::MinHash             ████████████████████   3.16
 stringzilla (384x GNR)  ███▏                   0.51  ███████████████▍      302.30 MB/s
 stringzilla (H100)                                   ████████████████████  392.37 MB/s
@@ -325,11 +352,11 @@ The Cohere Wikipedia dataset provides pre-processed JSONL files for different la
 This may be the optimal dataset for relative comparison of UTF-8 decoding and matching enginges in each individual environment.
 Not all Wikipedia languages are available, but the following have been selected specifically:
 
-- **Chinese (zh)**: 3-byte CJK characters, rare 1-byte punctuation
-- **Korean (ko)**: 3-byte Hangul syllables, frequent 1-byte punctuation
-- **Arabic (ar)**: 2-byte Arabic script, with regular 1-byte punctuation
-- **French (fr)**: Mixed 1-2 byte Latin with high diacritic density
-- **English (en)**: Mostly 1-byte ASCII baseline
+- __Chinese (zh)__: 3-byte CJK characters, rare 1-byte punctuation
+- __Korean (ko)__: 3-byte Hangul syllables, frequent 1-byte punctuation
+- __Arabic (ar)__: 2-byte Arabic script, with regular 1-byte punctuation
+- __French (fr)__: Mixed 1-2 byte Latin with high diacritic density
+- __English (en)__: Mostly 1-byte ASCII baseline
 
 To download and decompress one file from each language:
 
@@ -353,13 +380,13 @@ Files are XZ-compressed plain text with documents separated by double-newlines.
 
 | Workload                    | Relevant Scripts                  | Best Test Languages                                  |
 | --------------------------- | --------------------------------- | ---------------------------------------------------- |
-| **Case Folding**            | Latin, Cyrillic, Greek, Armenian  | Turkish (I/i), German (ss->SS), Greek, Russian       |
-| **Normalization**           | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic                    |
-| **Whitespace Tokenization** | Most scripts except CJK/Thai      | English, Russian, Arabic vs. Chinese, Japanese, Thai |
-| **Grapheme Clusters**       | Indic, Thai, Khmer, Myanmar       | Thai, Tamil, Myanmar, Khmer                          |
-| **RTL Handling**            | Arabic, Hebrew                    | Arabic, Hebrew, Persian                              |
+| __Case Folding__            | Latin, Cyrillic, Greek, Armenian  | Turkish (I/i), German (ss->SS), Greek, Russian       |
+| __Normalization__           | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic                    |
+| __Whitespace Tokenization__ | Most scripts except CJK/Thai      | English, Russian, Arabic vs. Chinese, Japanese, Thai |
+| __Grapheme Clusters__       | Indic, Thai, Khmer, Myanmar       | Thai, Tamil, Myanmar, Khmer                          |
+| __RTL Handling__            | Arabic, Hebrew                    | Arabic, Hebrew, Persian                              |
 
-**Bicameral scripts** with various case folding rules:
+__Bicameral scripts__ with various case folding rules:
 
 ```bash
 curl -fL https://data.statmt.org/cc-100/en.txt.xz | xz -d > cc100_en.txt      # 82 GB - English
@@ -379,7 +406,7 @@ curl -fL https://data.statmt.org/cc-100/pt.txt.xz | xz -d > cc100_pt.txt      #
 curl -fL https://data.statmt.org/cc-100/it.txt.xz | xz -d > cc100_it.txt      # 7.8 GB - Italian
 ```
 
-**Unicameral scripts** without case folding, but with other normalization/segmentation challenges:
+__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:
 
 ```bash
 curl -fL https://data.statmt.org/cc-100/ar.txt.xz | xz -d > cc100_ar.txt      # 5.4 GB - Arabic (RTL)
@@ -406,7 +433,7 @@ The [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/)
 Each tar.gz contains `*-sentences.txt` (tab-separated `id\tsentence`), `*-words.txt` (frequencies), and co-occurrence files.
 Standard sizes: 10K, 30K, 100K, 300K, 1M sentences. Check for newer years at the download page.
 
-**Bicameral scripts** with various case folding rules:
+__Bicameral scripts__ with various case folding rules:
 
 ```bash
 curl -fL https://downloads.wortschatz-leipzig.de/corpora/eng_wikipedia_2016_1M.tar.gz | tar -xzf - -O 'eng_wikipedia_2016_1M/eng_wikipedia_2016_1M-sentences.txt' | cut -f2 > leipzig1M_en.txt
@@ -427,7 +454,7 @@ curl -fL https://downloads.wortschatz-leipzig.de/corpora/ita_wikipedia_2021_1M.t
 curl -fL https://downloads.wortschatz-leipzig.de/corpora/lit_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'lit_wikipedia_2021_300K/lit_wikipedia_2021_300K-sentences.txt' | cut -f2 > leipzig300K_lt.txt
 ```
 
-**Unicameral scripts** without case folding, but with other normalization/segmentation challenges:
+__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:
 
 ```bash
 curl -fL https://downloads.wortschatz-leipzig.de/corpora/ara_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ara_wikipedia_2021_1M/ara_wikipedia_2021_1M-sentences.txt' | cut -f2 > leipzig1M_ar.txt
diff --git a/bench_unicode.md b/bench_unicode.md
@@ -61,7 +61,6 @@ On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following
 | Farsi 🇮🇷      |   235 MB/s |      858 MB/s |   4x |   475 MB/s |     1.42 GB/s |   3x |
 | Georgian 🇬🇪   |   294 MB/s |      192 MB/s |   1x |   689 MB/s |      488 MB/s |   1x |
 | Hebrew 🇮🇱     |   233 MB/s |     1.01 GB/s |   4x |   473 MB/s |     1.86 GB/s |   4x |
-| Hindi 🇮🇳      |   293 MB/s |     6.32 GB/s |  22x |   682 MB/s |     3.14 GB/s |   5x |
 | Italian 🇮🇹    |   439 MB/s |     2.29 GB/s |   5x |   268 MB/s |     1.93 GB/s |   7x |
 | Japanese 🇯🇵   |   330 MB/s |     3.51 GB/s |  11x |   726 MB/s |     2.00 GB/s |   3x |
 | Korean 🇰🇷     |   314 MB/s |      861 MB/s |   3x |   623 MB/s |     2.80 GB/s |   4x |
@@ -79,14 +78,53 @@ To rerun the benchmarks for all languages:
 RUSTFLAGS="-C target-cpu=native" cargo build --release --bench bench_unicode --features bench_unicode
 bin=$(find target/release/deps -name 'bench_unicode-*' -executable -type f | head -1)
 
-for f in leipzig1M_*.txt leipzig300K_*.txt; do
+for f in leipzig*.txt; do
   [ -f "$f" ] || continue
   echo "=== $f ==="
   STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold" "$bin"
   STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold/" uv run bench_unicode.py
 done
 ```
 
+## Case-Insensitive Substring Search
+
+| Language     | Standard 🦀 | StringZilla 🦀 |      | Standard 🐍 | StringZilla 🐍 |      |
+| :----------- | ---------: | ------------: | ---: | ---------: | ------------: | ---: |
+| Arabic 🇸🇦     |   200 MB/s |    38.55 GB/s | 193x |  3.01 GB/s |    14.78 GB/s |   5x |
+| Armenian 🇦🇲   |   190 MB/s |      980 MB/s |   5x |  2.07 GB/s |      860 MB/s |   0x |
+| Bengali 🇧🇩    |   300 MB/s |    28.20 GB/s |  94x |  4.51 GB/s |    21.19 GB/s |   5x |
+| Chinese 🇨🇳    |   240 MB/s |    25.65 GB/s | 107x |  5.40 GB/s |    13.94 GB/s |   3x |
+| Czech 🇨🇿      |    90 MB/s |     7.41 GB/s |  82x |  1.38 GB/s |     6.36 GB/s |   5x |
+| Dutch 🇳🇱      |    90 MB/s |    12.61 GB/s | 140x |   860 MB/s |     7.99 GB/s |   9x |
+| English 🇬🇧    |    80 MB/s |    12.79 GB/s | 160x |   770 MB/s |     5.61 GB/s |   7x |
+| Farsi 🇮🇷      |   190 MB/s |    26.22 GB/s | 138x |  2.36 GB/s |    10.70 GB/s |   5x |
+| French 🇫🇷     |    90 MB/s |    10.77 GB/s | 120x |  1.10 GB/s |     6.83 GB/s |   6x |
+| Georgian 🇬🇪   |   190 MB/s |     1.03 GB/s |   5x |  3.20 GB/s |      620 MB/s |   0x |
+| German 🇩🇪     |    80 MB/s |    10.67 GB/s | 133x |   900 MB/s |     6.08 GB/s |   7x |
+| Greek 🇬🇷      |   130 MB/s |     2.57 GB/s |  20x |  1.38 GB/s |     2.48 GB/s |   2x |
+| Hebrew 🇮🇱     |   190 MB/s |    34.54 GB/s | 182x |  2.92 GB/s |    15.72 GB/s |   5x |
+| Italian 🇮🇹    |    80 MB/s |    12.99 GB/s | 162x |   970 MB/s |     8.87 GB/s |   9x |
+| Japanese 🇯🇵   |   220 MB/s |    21.71 GB/s |  99x |  4.88 GB/s |    13.17 GB/s |   3x |
+| Korean 🇰🇷     |   230 MB/s |    35.10 GB/s | 153x |  4.59 GB/s |    20.05 GB/s |   4x |
+| Polish 🇵🇱     |    90 MB/s |    10.50 GB/s | 117x |  1.29 GB/s |     8.02 GB/s |   6x |
+| Portuguese 🇧🇷 |    90 MB/s |    10.72 GB/s | 119x |  1.10 GB/s |     8.12 GB/s |   7x |
+| Russian 🇷🇺    |   140 MB/s |     7.12 GB/s |  51x |  2.30 GB/s |     5.70 GB/s |   2x |
+| Spanish 🇪🇸    |    90 MB/s |    11.62 GB/s | 129x |  1.02 GB/s |     6.33 GB/s |   6x |
+| Tamil 🇮🇳      |   270 MB/s |    29.53 GB/s | 109x |  5.81 GB/s |    23.11 GB/s |   4x |
+| Turkish 🇹🇷    |    90 MB/s |     8.18 GB/s |  91x |  1.49 GB/s |     5.25 GB/s |   4x |
+| Ukrainian 🇺🇦  |   140 MB/s |     8.88 GB/s |  63x |  2.26 GB/s |     5.35 GB/s |   2x |
+| Vietnamese 🇻🇳 |   110 MB/s |     4.25 GB/s |  39x |  1.07 GB/s |     1.12 GB/s |   1x |
+
+To rerun the benchmarks for all languages:
+
+```bash
+for f in leipzig*.txt; do
+  [ -f "$f" ] || continue
+  echo "=== $f ==="
+  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=words STRINGWARS_FILTER="case-insensitive-find" STRINGWARS_UNIQUE=1 "$bin"
+done
+```
+
 ---
 
 See [README.md](README.md) for dataset information and replication instructions.