Skip to content

Commit dd57536

Browse files
committed
Docs: Formatting
1 parent 9460fd4 commit dd57536

File tree

8 files changed

+64
-42
lines changed

8 files changed

+64
-42
lines changed

.gitignore

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@ node_modules/
3535
leipzig1M.txt
3636
enwik9.txt
3737
xlsum.csv
38-
human_protein_1200row_800len.txt
38+
proteins.txt
3939

4040
# StringZilla-specific log files
41-
/failed_sz_*
41+
/failed_sz_*
42+
/.cache/

CONTRIBUTING.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,24 @@ The project is split into the following parts:
1717

1818
For minimal test coverage, check the following scripts:
1919

20-
- `test.cpp` - tests C++ API (not underlying C) against STL.
21-
- `test.py` - tests Python API against native strings.
22-
- `test.js`.
20+
- `scripts/test.cpp` - tests C++ API (not underlying C) against STL.
21+
- `scripts/test.py` - tests Python API against native strings.
22+
- `scripts/test.js`.
2323

2424
At the C++ level all benchmarks also validate the results against the STL baseline, serving as tests on real-world data.
2525
They have the broadest coverage of the library, and are the most important to keep up-to-date:
2626

27-
- `bench_token.cpp` - token-level ops, like hashing, ordering, equality checks.
28-
- `bench_search.cpp` - bidirectional substring search, both exact and fuzzy.
29-
- `bench_similarity.cpp` - benchmark all edit distance backends.
30-
- `bench_sequence.cpp` - sorting, partitioning, merging.
31-
- `bench_container.cpp` - STL containers with different string keys.
27+
- `scripts/bench_token.cpp` - token-level ops, like hashing, ordering, equality checks.
28+
- `scripts/bench_search.cpp` - bidirectional substring search, both exact and fuzzy.
29+
- `scripts/bench_similarity.cpp` - benchmark all edit distance backends.
30+
- `scripts/bench_sequence.cpp` - sorting, partitioning, merging.
31+
- `scripts/bench_container.cpp` - STL containers with different string keys.
3232

3333
The role of Python benchmarks is less to provide absolute number, but to compare against popular tools in the Python ecosystem.
3434

35-
- `bench_search.(py|ipynb)` - compares against native Python `str`.
36-
- `bench_sequence.(py|ipynb)` - compares against `pandas`.
37-
- `bench_similarity.(ipynb)` - compares against `jellyfish`, `editdistance`, etc.
35+
- `scripts/bench_search.(py|ipynb)` - compares against native Python `str`.
36+
- `scripts/bench_sequence.(py|ipynb)` - compares against `pandas`.
37+
- `scripts/bench_similarity.(ipynb)` - compares against `jellyfish`, `editdistance`, etc.
3838

3939
## Benchmarking Datasets
4040

@@ -55,7 +55,7 @@ unzip enwik9.zip && rm enwik9.zip && mv enwik9 enwik9.txt
5555
# XL Sum dataset for multilingual extractive summarization
5656
# 4.7 GB (1.7 GB compressed), 1'004'598 lines of UTF8, 268'435'456 tokens of mean length 8
5757
wget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
58-
gzip -d xlsum.csv.gz
58+
gzip -d xlsum.csv.gz
5959

6060
# Human chromosome generator dataset generated by:
6161
# https://github.com/rghilduta/human-chromosome-data-generator/blob/main/generate_chromosome_data.sh

include/stringzilla/features.h

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
* @file features.h
44
* @author Ash Vardanian
55
*
6-
* The `sklearn.feature_extraction` module for @b TF-IDF, "CountVectorizer", and "HashingVectorizer"
6+
* The `sklearn.feature_extraction` module for @b TF-IDF, `CountVectorizer`, and `HashingVectorizer`
77
* is one of the most commonly used in the industry due to its extreme flexibility. It can:
88
*
99
* - Tokenize by words, N-grams, or in-word N-grams.
@@ -14,7 +14,18 @@
1414
*
1515
* That level of flexibility is not feasible for a hardware-accelerated SIMD library, but we
1616
* can provide a set of APIs that can be used to build such a library on top of StringZilla.
17-
* That functionality will reuse our @b Trie data-structure for vocabulary building histograms.
17+
* That functionality can reuse our @b Trie data-structure for vocabulary building histograms.
18+
*
19+
* In this file, we mostly focus on batch-level hashing operations, similar to the `intersect.h`
20+
* module. There, we cross-reference two sets of strings, and here we only analyze one at a time.
21+
*
22+
* - The text comes in pre-tokenized form, as a stream, not even indexed-lookup is needed,
23+
* unlike the `sz_sequence_t` in `sz_intersect` APIs.
24+
* - We scatter those tokens into the output in multiple forms:
25+
*
26+
* - output hashes into a continuous buffer.
27+
* - output hashes into a hash-map with counts.
28+
* - output hashes into a high-dimensional bit-vector.
1829
*
1930
*/
2031
#ifndef STRINGZILLA_FEATURES_H_

include/stringzilla/similarity.h

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -379,11 +379,11 @@ SZ_INTERNAL sz_status_t _sz_levenshtein_distance_skewed_diagonals_serial( //
379379
* Stores only 2 rows of the Levenshtein matrix, but uses 64-bit integers for the distance values,
380380
* and upcasts UTF8 variable-length codepoints to 64-bit integers for faster addressing.
381381
*
382-
* ! In the worst case for 2 strings of length 100, that contain just one 16-bit codepoint this will result in
383-
* extra:
384-
* + 2 rows * 100 slots * 8 bytes/slot = 1600 bytes of memory for the two rows of the Levenshtein matrix rows.
385-
* + 100 codepoints * 2 strings * 4 bytes/codepoint = 800 bytes of memory for the UTF8 buffer.
386-
* = 2400 bytes of memory or @b 12x memory amplification!
382+
* ! In the worst case for 2 strings of length 100, that contain just one 16-bit codepoint this algorithm
383+
* ! will require 2400 bytes of memory:
384+
* ! + 2 rows * 100 slots * 8 bytes/slot = 1600 bytes of memory for the two rows of the Levenshtein matrix rows.
385+
* ! + 100 codepoints * 2 strings * 4 bytes/codepoint = 800 bytes of memory for the UTF8 buffer.
386+
* ! = 2400 bytes of memory or @b 12x memory amplification!
387387
*/
388388
SZ_INTERNAL sz_status_t _sz_levenshtein_distance_wagner_fisher_serial( //
389389
sz_cptr_t longer, sz_size_t longer_length, //
@@ -764,15 +764,16 @@ SZ_PUBLIC sz_status_t sz_hamming_distance_utf8_serial( //
764764
apply_to = function)
765765

766766
/**
767-
* @brief Computes the edit distance between two very short byte-strings using the AVX-512VBMI extensions.
767+
* @brief Computes the edit distance between two very short byte-strings using the AVX-512VBMI extensions.
768+
* @sa `sz::levenshtein_distance_openmp`.
768769
*
769770
* Applies to string lengths up to 63, and evaluates at most (63 * 2 + 1 = 127) diagonals, or just as many loop
770771
* cycles. Supports an early exit, if the distance is bounded. Keeps all of the data and Levenshtein matrices skew
771772
* diagonal in just a couple of registers. Benefits from the @b `vpermb` instructions, that can rotate the bytes
772773
* across the entire ZMM register.
773774
*
774-
*? Bounds check, for inputs ranging from 33 to 64 bytes doesn't affect the performance at all.
775-
*? It's also worth exploring `_mm512_alignr_epi8` and `_mm512_maskz_compress_epi8` for the shift.
775+
* ? Bounds check, for inputs ranging from 33 to 64 bytes doesn't affect the performance at all.
776+
* ? It's also worth exploring `_mm512_alignr_epi8` and `_mm512_maskz_compress_epi8` for the shift.
776777
*/
777778
SZ_INTERNAL sz_size_t _sz_levenshtein_distance_skewed_diagonals_upto63_ice( //
778779
sz_cptr_t shorter, sz_size_t shorter_length, //
@@ -809,7 +810,7 @@ SZ_INTERNAL sz_size_t _sz_levenshtein_distance_skewed_diagonals_upto63_ice( //
809810
bound_vec.zmm = _mm512_set1_epi8(bound <= 255 ? (sz_u8_t)bound : 255);
810811

811812
// To simplify comparisons and traversals, we want to reverse the order of bytes in the shorter string.
812-
shorter_vec.zmm = _mm512_setzero_si512(); //? To simplify debugging.
813+
shorter_vec.zmm = _mm512_setzero_si512(); //? To simplify debugging, but can be noise
813814
for (sz_size_t i = 0; i != shorter_length; ++i) shorter_vec.u8s[63 - i] = shorter[i];
814815
shorter_rotated_vec.zmm = _mm512_permutexvar_epi8(rotate_right_vec.zmm, shorter_vec.zmm);
815816

@@ -1034,7 +1035,7 @@ SZ_INTERNAL sz_status_t _sz_levenshtein_distance_skewed_diagonals_upto65k_ice( /
10341035
// The length of the longest (main) diagonal would be `shorter_dim = (shorter_length + 1)`.
10351036
sz_size_t const shorter_dim = shorter_length + 1;
10361037
sz_size_t const longer_dim = longer_length + 1;
1037-
// Unlike the serial version, we also want to avoid reverse-order iteration over teh shorter string.
1038+
// Unlike the serial version, we also want to avoid reverse-order iteration over the shorter string.
10381039
// So let's allocate a bit more memory and reverse-export our shorter string into that buffer.
10391040
sz_size_t const buffer_length = sizeof(sz_u16_t) * longer_dim * 3 + shorter_length;
10401041
sz_u16_t *const distances = (sz_u16_t *)alloc->allocate(buffer_length, alloc->handle);
@@ -1225,8 +1226,8 @@ SZ_PUBLIC sz_status_t sz_levenshtein_distance_ice( //
12251226
* Unlike the `_sz_levenshtein_distance_skewed_diagonals_upto65k_avx512` method, this one uses signed integers to store
12261227
* the accumulated score. Moreover, it's primary bottleneck is the latency of gathering the substitution costs
12271228
* from the substitution matrix. If we use the diagonal order, we will be comparing a slice of the first string
1228-
* with a slice of the second. If we stick to the conventional horizontal order, we will be comparing one character
1229-
* against a slice, which is much easier to optimize. In that case we are sampling costs not from arbitrary parts of
1229+
* with a slice of the second. If we stick to the conventional horizontal order, we will be comparing one character
1230+
* against a slice, which is much easier to optimize. In that case we are sampling costs not from arbitrary parts of
12301231
* a 256 x 256 matrix, but from a single row!
12311232
*/
12321233
SZ_INTERNAL sz_status_t _sz_needleman_wunsch_score_wagner_fisher_upto17m_ice( //

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# This file configures wheels compilation for `cibuilwheel` for StringZilla CPython bindings.
1+
# This file configures wheels compilation for `cibuildwheel` for StringZilla CPython bindings.
22
# On a good day it will produce:
33
# - `manylinux` and `musllinux` wheels for Linux on x86_64, aarch64, i686, ppc64le, s390x;
44
# - `macos` wheels for x86_64, arm64, and universal2;
@@ -28,7 +28,7 @@ build-verbosity = 0
2828
# - on Windows: AMD64, x86, ARM64
2929
# https://cibuildwheel.readthedocs.io/en/stable/options/#archs
3030
#
31-
# Important to note, not all those paltforms have recent images.
31+
# Important to note, not all those platforms have recent images.
3232
# The `manylinux_2_28` seems to be missing for `i686`.
3333
# The `i686` is 32-bit x86, and `x86_64` is 64-bit x86.
3434
archs = ["all"]

rust/lib.rs

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
#![cfg_attr(not(test), no_std)]
2-
3-
/// The `sz` module provides a collection of string searching and manipulation functionality,
4-
/// designed for high efficiency and compatibility with `no_std` environments. This module offers
5-
/// various utilities for byte string manipulation, including search, reverse search, and
6-
/// edit-distance calculations, suitable for a wide range of applications from basic string
7-
/// processing to complex text analysis tasks.
2+
#[doc = r"
3+
The `sz` module provides a collection of string searching and manipulation functionality,
4+
designed for high efficiency and compatibility with `no_std` environments. This module offers
5+
various utilities for byte string manipulation, including search, reverse search, and
6+
edit-distance calculations, suitable for a wide range of applications from basic string
7+
processing to complex text analysis tasks.
8+
"]
89

910
pub mod sz {
1011

scripts/bench_search.cpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,12 @@
1111
*
1212
* For substring search, the number of operations per second are reported as the number of character-level comparisons
1313
* happening in the worst case in the naive algorithm, meaning O(N*M) for N characters in the haystack and M in the
14-
* needle.
14+
* needle. In byteset search, the number of operations per second is computed the same way and the following character
15+
* sets are tested against each scanned token:
16+
*
17+
* - "\n\r\v\f": 4 tabs
18+
* - "</>&'\"=[]": 9 html
19+
* - "0123456789": 10 digits
1520
*
1621
* Instead of CLI arguments, for compatibility with @b StringWa.rs, the following environment variables are used:
1722
* - `STRINGWARS_DATASET` : Path to the dataset file.

scripts/test.hpp

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ namespace scripts {
1414

1515
inline std::string read_file(std::string path) {
1616
std::ifstream stream(path);
17-
if (!stream.is_open()) { throw std::runtime_error("Failed to open file: " + path); }
17+
if (!stream.is_open()) throw std::runtime_error("Failed to open file: " + path);
1818
return std::string((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());
1919
}
2020

2121
inline void write_file(std::string path, std::string content) {
2222
std::ofstream stream(path);
23-
if (!stream.is_open()) { throw std::runtime_error("Failed to open file: " + path); }
23+
if (!stream.is_open()) throw std::runtime_error("Failed to open file: " + path);
2424
stream << content;
2525
stream.close();
2626
}
@@ -74,13 +74,15 @@ inline std::string repeat(std::string const &patten, std::size_t count) {
7474
}
7575

7676
/**
77-
* @brief A callback type for iterating over consecutive random-length slices of a string.
77+
* @brief Randomly slices a string into consecutive parts and passes those to @p slice_callback.
78+
* @warning Is @b single-threaded in nature, as it depends on the `global_random_generator`.
7879
*/
7980
template <typename slice_callback_type_>
8081
inline void iterate_in_random_slices(std::string const &text, slice_callback_type_ &&slice_callback) {
8182
std::size_t remaining = text.size();
8283
while (remaining > 0) {
83-
std::size_t slice_length = std::uniform_int_distribution<std::size_t>(1, remaining)(global_random_generator());
84+
std::uniform_int_distribution<std::size_t> slice_length_distribution(1, remaining);
85+
std::size_t slice_length = slice_length_distribution(global_random_generator());
8486
slice_callback({text.data() + text.size() - remaining, slice_length});
8587
remaining -= slice_length;
8688
}
@@ -121,7 +123,8 @@ using error_costs_256x256_t = std::array<sz_error_cost_t, 256 * 256>;
121123
inline error_costs_256x256_t unary_substitution_costs() {
122124
error_costs_256x256_t result;
123125
for (std::size_t i = 0; i != 256; ++i)
124-
for (std::size_t j = 0; j != 256; ++j) result[i * 256 + j] = (i == j ? 0 : -1);
126+
for (std::size_t j = 0; j != 256; ++j) //
127+
result[i * 256 + j] = i == j ? 0 : -1;
125128
return result;
126129
}
127130

0 commit comments

Comments
 (0)