@@ -98,7 +98,8 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
9898 *
9999 * This function applies full Unicode Case Folding as defined in the Unicode Standard (UAX #21 and
100100 * CaseFolding.txt), covering all bicameral scripts, all offset-based one-to-one folds, all table-based
101- * one-to-one folds, and all normative one-to-many expansions.
101+ * one-to-one folds, and all normative one-to-many expansions. It doesn't however perform any normalization,
102+ * like NFKC or NFC, so combining marks are treated as-is.
102103 *
103104 * The following character mappings are supported:
104105 *
@@ -142,14 +143,10 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold( //
142143 *
143144 * - ICU abandoned Boyer-Moore for Unicode, reverting to linear search for correctness
144145 * - ClickHouse uses Volnitsky with fallback to naive search for problematic characters
145- * - ripgrep uses simple case folding only (no expansion handling)
146+ * - RipGrep uses simple case folding only (no expansion handling) leveraging the Rust RegEx engine
146147 *
147- * Potential algorithmic improvements for future versions:
148- *
149- * - Streaming comparison with small expansion buffer instead of pre-materializing folded needle
150- * - Fingerprint-based filtering using rolling hash over folded codepoints
151- * - Conservative skip distances that account for maximum expansion ratio (3:1)
152- * - First-codepoint filtering to quickly reject non-matching positions
148+ * StringZilla implements several algorithms. Most importantly it first locates the longest expansion-free
149+ * slice of the needle to locate against.
153150 *
154151 * @see https://unicode-org.github.io/icu/userguide/collation/string-search.html
155152 * ICU String Search - discusses why Boyer-Moore was abandoned for Unicode
0 commit comments