Skip to content

Commit d9690bf

Browse files
committed
Chore: Explore case-folding & norms in Unicode 17
1 parent 3445102 commit d9690bf

File tree

5 files changed

+1342
-256
lines changed

5 files changed

+1342
-256
lines changed

include/stringzilla/utf8_unpack.h

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,8 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
9898
*
9999
* This function applies full Unicode Case Folding as defined in the Unicode Standard (UAX #21 and
100100
* CaseFolding.txt), covering all bicameral scripts, all offset-based one-to-one folds, all table-based
101-
* one-to-one folds, and all normative one-to-many expansions.
101+
* one-to-one folds, and all normative one-to-many expansions. It doesn't however perform any normalization,
102+
* like NFKC or NFC, so combining marks are treated as-is.
102103
*
103104
* The following character mappings are supported:
104105
*
@@ -142,14 +143,10 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
142143
*
143144
* - ICU abandoned Boyer-Moore for Unicode, reverting to linear search for correctness
144145
* - ClickHouse uses Volnitsky with fallback to naive search for problematic characters
145-
* - ripgrep uses simple case folding only (no expansion handling)
146+
* - RipGrep uses simple case folding only (no expansion handling) leveraging the Rust RegEx engine
146147
*
147-
* Potential algorithmic improvements for future versions:
148-
*
149-
* - Streaming comparison with small expansion buffer instead of pre-materializing folded needle
150-
* - Fingerprint-based filtering using rolling hash over folded codepoints
151-
* - Conservative skip distances that account for maximum expansion ratio (3:1)
152-
* - First-codepoint filtering to quickly reject non-matching positions
148+
* StringZilla implements several algorithms. Most importantly it first locates the longest expansion-free
149+
* slice of the needle to locate against.
153150
*
154151
* @see https://unicode-org.github.io/icu/userguide/collation/string-search.html
155152
* ICU String Search - discusses why Boyer-Moore was abandoned for Unicode

0 commit comments

Comments
 (0)