Skip to content

Commit ae307ab

Browse files
authored
Merge: Case-Folding UTF-8 (ẞ → ss)
2 parents 3445102 + fa7422c commit ae307ab

16 files changed

+6192
-1364
lines changed

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,7 @@ if (STRINGZILLA_BUILD_BENCHMARK)
435435
define_launcher(stringzilla_bench_find_cpp20 scripts/bench_find.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
436436
define_launcher(stringzilla_bench_sequence_cpp20 scripts/bench_sequence.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
437437
define_launcher(stringzilla_bench_token_cpp20 scripts/bench_token.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
438+
define_launcher(stringzilla_bench_unicode_cpp20 scripts/bench_unicode.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
438439
define_launcher(stringzilla_bench_container_cpp20 scripts/bench_container.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
439440
define_launcher(stringzilla_bench_memory_cpp20 scripts/bench_memory.cpp 20 "${STRINGZILLA_TARGET_ARCH}")
440441

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,13 @@ It __accelerates exact and fuzzy string matching, hashing, edit distance computa
2727
- 🐂 __[C](#basic-usage-with-c-99-and-newer):__ Upgrade LibC's `<string.h>` to `<stringzilla/stringzilla.h>` in C 99
2828
- 🐉 __[C++](#basic-usage-with-c-11-and-newer):__ Upgrade STL's `<string>` to `<stringzilla/stringzilla.hpp>` in C++ 11
2929
- 🧮 __[CUDA](#cuda):__ Process in-bulk with `<stringzillas/stringzillas.cuh>` in CUDA C++ 17
30-
- 🐍 __[Python](#quick-start-python-🐍):__ Upgrade your `str` to faster `Str`
31-
- 🦀 __[Rust](#quick-start-rust-🦀):__ Use the `StringZilla` traits crate
32-
- 🦫 __[Go](#quick-start-golang-🦫):__ Use the `StringZilla` cGo module
33-
- 🍎 __[Swift](#quick-start-swift-🍏):__ Use the `String+StringZilla` extension
34-
- 🟨 __[JavaScript](#quick-start-javascript-🟨):__ Use the `StringZilla` library
30+
- 🐍 __[Python](#quick-start-python):__ Upgrade your `str` to faster `Str`
31+
- 🦀 __[Rust](#quick-start-rust):__ Use the `StringZilla` traits crate
32+
- 🦫 __[Go](#quick-start-golang):__ Use the `StringZilla` cGo module
33+
- 🍎 __[Swift](#quick-start-swift):__ Use the `String+StringZilla` extension
34+
- 🟨 __[JavaScript](#quick-start-javascript):__ Use the `StringZilla` library
3535
- 🐚 __[Shell][faq-shell]__: Accelerate common CLI tools with `sz_` prefix
36-
- 📚 Researcher? Jump to [Algorithms & Design Decisions](#algorithms--design-decisions-📚)
36+
- 📚 Researcher? Jump to [Algorithms & Design Decisions](#algorithms--design-decisions)
3737
- 💡 Thinking to contribute? Look for ["good first issues"][first-issues]
3838
- 🤝 And check the [guide](https://github.com/ashvardanian/StringZilla/blob/main/CONTRIBUTING.md) to set up the environment
3939
- Want more bindings or features? Let [me](https://github.com/ashvardanian) know!
@@ -343,7 +343,7 @@ Consider contributing if you need a feature that's not yet implemented.
343343
> ⚪ are considered.
344344
> ❌ are not intended.
345345
346-
## Quick Start: Python 🐍
346+
## Quick Start: Python
347347

348348
Python bindings are available on PyPI for Python 3.8+, and can be installed with `pip`.
349349

@@ -751,7 +751,7 @@ arr = pa.Array.from_buffers(
751751
That means you can convert `Str` to `pyarrow.Buffer` and `Strs` to `pyarrow.Array` without extra copies.
752752
For more details on the tape-like layouts, refer to the [StringTape](https://github.com/ashvardanian/StringTape) repository.
753753

754-
## Quick Start: C/C++ 🛠️
754+
## Quick Start: C/C++
755755

756756
The C library is header-only, so you can just copy the `stringzilla.h` header into your project.
757757
Same applies to C++, where you would copy the `stringzilla.hpp` header.
@@ -1527,7 +1527,7 @@ __`STRINGZILLA_BUILD_SHARED`, `STRINGZILLA_BUILD_TEST`, `STRINGZILLA_BUILD_BENCH
15271527
> It's synonymous to GCC's `-march` flag and is used to enable/disable the appropriate instruction sets.
15281528
> You can also disable the shared library build, if you don't need it.
15291529
1530-
## Quick Start: Rust 🦀
1530+
## Quick Start: Rust
15311531

15321532
StringZilla is available as a Rust crate, with documentation available on [docs.rs/stringzilla](https://docs.rs/stringzilla).
15331533
You can immediately check the installed version and the used hardware capabilities with following commands:
@@ -1760,7 +1760,7 @@ assert!(hashes.iter().any(|&h| h != u32::MAX)); // Verify computation occurred
17601760
assert!(counts.iter().any(|&c| c != u32::MAX));
17611761
```
17621762

1763-
## Quick Start: JavaScript 🟨
1763+
## Quick Start: JavaScript
17641764

17651765
Install the Node.js package and use zero-copy `Buffer` APIs.
17661766

@@ -1833,7 +1833,7 @@ const digestBuffer = hasher.digest(); // returns Buffer (32 bytes)
18331833
const digestHex = hasher.hexdigest(); // returns string (64 hex chars)
18341834
```
18351835

1836-
## Quick Start: Swift 🍏
1836+
## Quick Start: Swift
18371837

18381838
StringZilla can be added as a dependency in the Swift Package Manager.
18391839
In your `Package.swift` file, add the following:
@@ -1891,7 +1891,7 @@ let digestBytes = hasher.digest() // [UInt8] (32 bytes)
18911891
let digestHex = hasher.hexdigest() // String (64 hex chars)
18921892
```
18931893

1894-
## Quick Start: GoLang 🦫
1894+
## Quick Start: GoLang
18951895

18961896
Add the Go binding as a module dependency:
18971897

@@ -2003,7 +2003,7 @@ size := hasher.Size() // 32
20032003
blockSize := hasher.BlockSize() // 64
20042004
```
20052005

2006-
## Algorithms & Design Decisions 📚
2006+
## Algorithms & Design Decisions
20072007

20082008
StringZilla aims to optimize some of the slowest string operations.
20092009
Some popular operations, however, like equality comparisons and relative order checking, almost always complete on some of the very first bytes in either string.

c/stringzilla.c

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,10 +60,13 @@ typedef struct sz_implementations_t {
6060

6161
sz_utf8_count_t utf8_count;
6262
sz_utf8_find_nth_t utf8_find_nth;
63-
sz_utf8_unpack_chunk_t utf8_unpack_chunk;
6463
sz_utf8_find_boundary_t utf8_find_newline;
6564
sz_utf8_find_boundary_t utf8_find_whitespace;
65+
sz_utf8_unpack_chunk_t utf8_unpack_chunk;
66+
6667
sz_utf8_case_fold_t utf8_case_fold;
68+
sz_utf8_case_insensitive_find_t utf8_case_insensitive_find;
69+
sz_utf8_case_insensitive_order_t utf8_case_insensitive_order;
6770

6871
sz_sequence_argsort_t sequence_argsort;
6972
sz_sequence_intersect_t sequence_intersect;
@@ -108,10 +111,13 @@ static void sz_dispatch_table_update_implementation_(sz_capability_t caps) {
108111

109112
impl->utf8_count = sz_utf8_count_serial;
110113
impl->utf8_find_nth = sz_utf8_find_nth_serial;
111-
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_serial;
112114
impl->utf8_find_newline = sz_utf8_find_newline_serial;
113115
impl->utf8_find_whitespace = sz_utf8_find_whitespace_serial;
116+
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_serial;
117+
114118
impl->utf8_case_fold = sz_utf8_case_fold_serial;
119+
impl->utf8_case_insensitive_find = sz_utf8_case_insensitive_find_serial;
120+
impl->utf8_case_insensitive_order = sz_utf8_case_insensitive_order_serial;
115121

116122
impl->sequence_argsort = sz_sequence_argsort_serial;
117123
impl->sequence_intersect = sz_sequence_intersect_serial;
@@ -164,10 +170,8 @@ static void sz_dispatch_table_update_implementation_(sz_capability_t caps) {
164170

165171
impl->utf8_count = sz_utf8_count_haswell;
166172
impl->utf8_find_nth = sz_utf8_find_nth_haswell;
167-
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_haswell;
168173
impl->utf8_find_newline = sz_utf8_find_newline_haswell;
169174
impl->utf8_find_whitespace = sz_utf8_find_whitespace_haswell;
170-
impl->utf8_case_fold = sz_utf8_case_fold_haswell;
171175
}
172176
#endif
173177

@@ -204,10 +208,12 @@ static void sz_dispatch_table_update_implementation_(sz_capability_t caps) {
204208

205209
impl->utf8_count = sz_utf8_count_ice;
206210
impl->utf8_find_nth = sz_utf8_find_nth_ice;
207-
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_ice;
208211
impl->utf8_find_newline = sz_utf8_find_newline_ice;
209212
impl->utf8_find_whitespace = sz_utf8_find_whitespace_ice;
213+
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_ice;
214+
210215
impl->utf8_case_fold = sz_utf8_case_fold_ice;
216+
impl->utf8_case_insensitive_find = sz_utf8_case_insensitive_find_ice;
211217

212218
impl->lookup = sz_lookup_ice;
213219

@@ -246,10 +252,12 @@ static void sz_dispatch_table_update_implementation_(sz_capability_t caps) {
246252

247253
impl->utf8_count = sz_utf8_count_neon;
248254
impl->utf8_find_nth = sz_utf8_find_nth_neon;
249-
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_neon;
250255
impl->utf8_find_newline = sz_utf8_find_newline_neon;
251256
impl->utf8_find_whitespace = sz_utf8_find_whitespace_neon;
257+
impl->utf8_unpack_chunk = sz_utf8_unpack_chunk_neon;
258+
252259
impl->utf8_case_fold = sz_utf8_case_fold_neon;
260+
impl->utf8_case_insensitive_find = sz_utf8_case_insensitive_find_neon;
253261
}
254262
#endif
255263

@@ -507,6 +515,18 @@ SZ_DYNAMIC sz_size_t sz_utf8_case_fold(sz_cptr_t source, sz_size_t source_length
507515
return sz_dispatch_table.utf8_case_fold(source, source_length, destination);
508516
}
509517

518+
SZ_DYNAMIC sz_cptr_t sz_utf8_case_insensitive_find( //
519+
sz_cptr_t haystack, sz_size_t haystack_length, //
520+
sz_cptr_t needle, sz_size_t needle_length, sz_size_t *matched_length) {
521+
return sz_dispatch_table.utf8_case_insensitive_find(haystack, haystack_length, needle, needle_length,
522+
matched_length);
523+
}
524+
525+
SZ_DYNAMIC sz_ordering_t sz_utf8_case_insensitive_order( //
526+
sz_cptr_t a, sz_size_t a_length, sz_cptr_t b, sz_size_t b_length) {
527+
return sz_dispatch_table.utf8_case_insensitive_order(a, a_length, b, b_length);
528+
}
529+
510530
// Provide overrides for the libc mem* functions
511531
#if SZ_OVERRIDE_LIBC && !defined(__CYGWIN__)
512532

include/stringzilla/stringzilla.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
#include "hash.h" // `sz_bytesum`, `sz_hash`, `sz_state_init`, `sz_state_stream`, `sz_state_fold`
7676
#include "find.h" // `sz_find`, `sz_find_byteset`, `sz_rfind`
7777
#include "utf8.h" // `sz_utf8_find_newline`, `sz_utf8_find_whitespace`, `sz_utf8_find_nth`, `sz_utf8_valid`
78-
#include "utf8_unpack.h" // `sz_utf8_case_insensitive_find`, `sz_utf8_unpack_chunk`
78+
#include "utf8_case.h" // `sz_utf8_case_insensitive_find`, `sz_utf8_unpack_chunk`
7979
#include "small_string.h" // `sz_string_t`, `sz_string_init`, `sz_string_free`
8080
#include "sort.h" // `sz_sequence_argsort`, `sz_pgrams_sort`
8181
#include "intersect.h" // `sz_sequence_intersect`

include/stringzilla/types.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -788,6 +788,12 @@ typedef sz_cptr_t (*sz_utf8_unpack_chunk_t)(sz_cptr_t, sz_size_t, sz_rune_t *, s
788788
/** @brief Signature of `sz_utf8_case_fold`. */
789789
typedef sz_size_t (*sz_utf8_case_fold_t)(sz_cptr_t, sz_size_t, sz_ptr_t);
790790

791+
/** @brief Signature of `sz_utf8_case_insensitive_find`. */
792+
typedef sz_cptr_t (*sz_utf8_case_insensitive_find_t)(sz_cptr_t, sz_size_t, sz_cptr_t, sz_size_t, sz_size_t *);
793+
794+
/** @brief Signature of `sz_utf8_case_insensitive_order`. */
795+
typedef sz_ordering_t (*sz_utf8_case_insensitive_order_t)(sz_cptr_t, sz_size_t, sz_cptr_t, sz_size_t);
796+
791797
/** @brief Signature of `sz_fill_random`. */
792798
typedef void (*sz_fill_random_t)(sz_ptr_t, sz_size_t, sz_u64_t);
793799

0 commit comments

Comments
 (0)