Skip to content

Commit f42aa85

Browse files
committed
Docs: StringCuZilla design choices
1 parent 044c7cc commit f42aa85

File tree

5 files changed

+53
-10
lines changed

5 files changed

+53
-10
lines changed

include/stringcuzilla/features.hpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/**
22
* @brief Hardware-accelerated feature extractions for string collections.
3-
* @file features.h
3+
* @file features.hpp
44
* @author Ash Vardanian
55
*
66
* The `sklearn.feature_extraction` module for @b TF-IDF, `CountVectorizer`, and `HashingVectorizer`
@@ -28,8 +28,8 @@
2828
* - output hashes into a high-dimensional bit-vector.
2929
*
3030
*/
31-
#ifndef STRINGZILLA_FEATURES_H_
32-
#define STRINGZILLA_FEATURES_H_
31+
#ifndef STRINGZILLA_FEATURES_HPP_
32+
#define STRINGZILLA_FEATURES_HPP_
3333

3434
#include "types.h"
3535

@@ -142,4 +142,4 @@ SZ_PUBLIC sz_bool_t sz_detect_encoding(sz_cptr_t text, sz_size_t length) {
142142
#ifdef __cplusplus
143143
}
144144
#endif // __cplusplus
145-
#endif // STRINGZILLA_FEATURES_H_
145+
#endif // STRINGZILLA_FEATURES_HPP_
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
/**
2+
* @brief Hardware-accelerated multi-pattern exact substring search.
3+
* @file find_many.hpp
4+
* @author Ash Vardanian
5+
*
6+
* One of the most broadly used algorithms in string processing is the multi-pattern Aho-Corasick
7+
* algorithm, that constructs a trie from the patterns, transforms it into a finite state machine,
8+
* and then uses it to search for all patterns in the text in a single pass.
9+
*
10+
* One of its biggest issues is the memory consumption, as the naive implementation requires each
11+
* state to be proportional to the size of the alphabet, or 256 for byte-level processing. Such dense
12+
* representations simplify transition lookup down to a single memory access, but that access can be
13+
* expensive if the memory doesn't fir into the CPU caches for really large vocabulary sizes.
14+
*
15+
* Addressing this, we provide a sparse layout variant of the FSM, that uses predicated SIMD instructions
16+
* to rapidly probe the transitions and find the next state. This allows us to use a much smaller state,
17+
* fitting in L1/L2 caches much more frequently.
18+
*/
19+
#ifndef STRINGZILLA_FIND_MANY_HPP_
20+
#define STRINGZILLA_FIND_MANY_HPP_
21+
22+
#include "types.h"
23+
24+
#include "compare.h" // `sz_compare`
25+
#include "memory.h" // `sz_copy`
26+
27+
#ifdef __cplusplus
28+
extern "C" {
29+
#endif
30+
31+
#pragma region Core API
32+
33+
#pragma endregion // Core API
34+
35+
#ifdef __cplusplus
36+
}
37+
#endif // __cplusplus
38+
#endif // STRINGZILLA_FIND_MANY_HPP_
Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
/**
22
* @brief CUDA-accelerated string similarity utilities.
3-
* @file similarities.cuh
3+
* @file similarity.cuh
44
* @author Ash Vardanian
55
*
6-
* Includes core APIs:
6+
* Unlike th OpenMP backed, which also has single-pair similarity scores, the CUDA backend focuses on
7+
* batch-processing of large collections of strings, generally, assigning a single warp to each string pair:
78
*
89
* - `sz::cuda::levenshtein_distances` & `sz::cuda::levenshtein_distances_utf8` for Levenshtein edit-distances.
910
* - `sz::cuda::needleman_wunsch_score` for weighted Needleman-Wunsch global alignment.

include/stringzilla/stringzilla.h

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,15 @@
2525
* - `stringzilla.h` - umbrella header for the core C API.
2626
* - `stringzilla.hpp` - umbrella header for the core C++ API.
2727
*
28-
* It also provides many higher-level algorithms, mostly implemented in C++ with OpenMP and CUDA,
29-
* also exposed via the stable C 99 ABI, but requiring C++17 and CUDA 17 compilers to build the shared libraries:
28+
* It also provides many higher-level parallel algorithms, mostly implemented in C++ with OpenMP and CUDA, also exposed
29+
* via the stable C 99 ABI, but requiring C++17 and CUDA 17 compilers to build the shared @b StringCuZilla libraries:
3030
*
31-
* - `similarity.hpp` - similarity measures, like Levenshtein distance, Needleman-Wunsch, & Smith-Waterman alignment.
32-
* - `features.hpp` - feature extraction for TF-IDF and other Machine Learning algorithms.
31+
* - `similarity.{hpp,cuh}` - similarity measures, like Levenshtein, Needleman-Wunsch, & Smith-Waterman scores.
32+
* - `features.{hpp,cuh}` - feature extraction for TF-IDF and other Machine Learning algorithms.
33+
* - `find_many.{hpp,cuh}` - Aho-Corasick multi-pattern search.
34+
*
35+
* The core implementations of those algorithms are mostly structured as callable structure templates, as opposed to
36+
* template functions to simplify specialized overloads and reusing the state between invocations.
3337
*
3438
* @section Compilation Settings
3539
*

0 commit comments

Comments
 (0)