Docs: Formatting

ashvardanian · ashvardanian · commit dd5753650f6f · 2025-03-16T09:17:51.000Z
diff --git a/.gitignore b/.gitignore
@@ -35,7 +35,8 @@ node_modules/
 leipzig1M.txt
 enwik9.txt
 xlsum.csv
-human_protein_1200row_800len.txt
+proteins.txt
 
 # StringZilla-specific log files
-/failed_sz_*
+/failed_sz_*
+/.cache/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -17,24 +17,24 @@ The project is split into the following parts:
 
 For minimal test coverage, check the following scripts:
 
-- `test.cpp` - tests C++ API (not underlying C) against STL.
-- `test.py` - tests Python API against native strings.
-- `test.js`.
+- `scripts/test.cpp` - tests C++ API (not underlying C) against STL.
+- `scripts/test.py` - tests Python API against native strings.
+- `scripts/test.js`.
 
 At the C++ level all benchmarks also validate the results against the STL baseline, serving as tests on real-world data.
 They have the broadest coverage of the library, and are the most important to keep up-to-date:
 
-- `bench_token.cpp` - token-level ops, like hashing, ordering, equality checks.
-- `bench_search.cpp` - bidirectional substring search, both exact and fuzzy.
-- `bench_similarity.cpp` - benchmark all edit distance backends.
-- `bench_sequence.cpp` - sorting, partitioning, merging.
-- `bench_container.cpp` - STL containers with different string keys.
+- `scripts/bench_token.cpp` - token-level ops, like hashing, ordering, equality checks.
+- `scripts/bench_search.cpp` - bidirectional substring search, both exact and fuzzy.
+- `scripts/bench_similarity.cpp` - benchmark all edit distance backends.
+- `scripts/bench_sequence.cpp` - sorting, partitioning, merging.
+- `scripts/bench_container.cpp` - STL containers with different string keys.
 
 The role of Python benchmarks is less to provide absolute number, but to compare against popular tools in the Python ecosystem.
 
-- `bench_search.(py|ipynb)` - compares against native Python `str`.
-- `bench_sequence.(py|ipynb)` - compares against `pandas`.
-- `bench_similarity.(ipynb)` - compares against `jellyfish`, `editdistance`, etc.
+- `scripts/bench_search.(py|ipynb)` - compares against native Python `str`.
+- `scripts/bench_sequence.(py|ipynb)` - compares against `pandas`.
+- `scripts/bench_similarity.(ipynb)` - compares against `jellyfish`, `editdistance`, etc.
 
 ## Benchmarking Datasets
 
@@ -55,7 +55,7 @@ unzip enwik9.zip && rm enwik9.zip && mv enwik9 enwik9.txt
 # XL Sum dataset for multilingual extractive summarization
 # 4.7 GB (1.7 GB compressed), 1'004'598 lines of UTF8, 268'435'456 tokens of mean length 8
 wget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
-gzip -d  xlsum.csv.gz
+gzip -d xlsum.csv.gz
 
 # Human chromosome generator dataset generated by:
 # https://github.com/rghilduta/human-chromosome-data-generator/blob/main/generate_chromosome_data.sh
diff --git a/include/stringzilla/features.h b/include/stringzilla/features.h
@@ -3,7 +3,7 @@
  *  @file   features.h
  *  @author Ash Vardanian
  *
- *  The `sklearn.feature_extraction` module for @b TF-IDF, "CountVectorizer", and "HashingVectorizer"
+ *  The `sklearn.feature_extraction` module for @b TF-IDF, `CountVectorizer`, and `HashingVectorizer`
  *  is one of the most commonly used in the industry due to its extreme flexibility. It can:
  *
  *  - Tokenize by words, N-grams, or in-word N-grams.
@@ -14,7 +14,18 @@
  *
  *  That level of flexibility is not feasible for a hardware-accelerated SIMD library, but we
  *  can provide a set of APIs that can be used to build such a library on top of StringZilla.
- *  That functionality will reuse our @b Trie data-structure for vocabulary building histograms.
+ *  That functionality can reuse our @b Trie data-structure for vocabulary building histograms.
+ *
+ *  In this file, we mostly focus on batch-level hashing operations, similar to the `intersect.h`
+ *  module. There, we cross-reference two sets of strings, and here we only analyze one at a time.
+ *
+ *  - The text comes in pre-tokenized form, as a stream, not even indexed-lookup is needed,
+ *    unlike the `sz_sequence_t` in `sz_intersect` APIs.
+ *  - We scatter those tokens into the output in multiple forms:
+ *
+ *    - output hashes into a continuous buffer.
+ *    - output hashes into a hash-map with counts.
+ *    - output hashes into a high-dimensional bit-vector.
  *
  */
 #ifndef STRINGZILLA_FEATURES_H_
diff --git a/include/stringzilla/similarity.h b/include/stringzilla/similarity.h
@@ -379,11 +379,11 @@ SZ_INTERNAL sz_status_t _sz_levenshtein_distance_skewed_diagonals_serial( //
  *          Stores only 2 rows of the Levenshtein matrix, but uses 64-bit integers for the distance values,
  *          and upcasts UTF8 variable-length codepoints to 64-bit integers for faster addressing.
  *
- *  ! In the worst case for 2 strings of length 100, that contain just one 16-bit codepoint this will result in
- * extra:
- *      + 2 rows * 100 slots * 8 bytes/slot = 1600 bytes of memory for the two rows of the Levenshtein matrix rows.
- *      + 100 codepoints * 2 strings * 4 bytes/codepoint = 800 bytes of memory for the UTF8 buffer.
- *      = 2400 bytes of memory or @b 12x memory amplification!
+ *  ! In the worst case for 2 strings of length 100, that contain just one 16-bit codepoint this algorithm
+ *  ! will require 2400 bytes of memory:
+ *  !    + 2 rows * 100 slots * 8 bytes/slot = 1600 bytes of memory for the two rows of the Levenshtein matrix rows.
+ *  !    + 100 codepoints * 2 strings * 4 bytes/codepoint = 800 bytes of memory for the UTF8 buffer.
+ *  !    = 2400 bytes of memory or @b 12x memory amplification!
  */
 SZ_INTERNAL sz_status_t _sz_levenshtein_distance_wagner_fisher_serial( //
     sz_cptr_t longer, sz_size_t longer_length,                         //
@@ -764,15 +764,16 @@ SZ_PUBLIC sz_status_t sz_hamming_distance_utf8_serial( //
                              apply_to = function)
 
 /**
- *  @brief  Computes the edit distance between two very short byte-strings using the AVX-512VBMI extensions.
+ *  @brief Computes the edit distance between two very short byte-strings using the AVX-512VBMI extensions.
+ *  @sa `sz::levenshtein_distance_openmp`.
  *
  *  Applies to string lengths up to 63, and evaluates at most (63 * 2 + 1 = 127) diagonals, or just as many loop
  *  cycles. Supports an early exit, if the distance is bounded. Keeps all of the data and Levenshtein matrices skew
  *  diagonal in just a couple of registers. Benefits from the @b `vpermb` instructions, that can rotate the bytes
  *  across the entire ZMM register.
  *
- *? Bounds check, for inputs ranging from 33 to 64 bytes doesn't affect the performance at all.
- *? It's also worth exploring `_mm512_alignr_epi8` and `_mm512_maskz_compress_epi8` for the shift.
+ *  ? Bounds check, for inputs ranging from 33 to 64 bytes doesn't affect the performance at all.
+ *  ? It's also worth exploring `_mm512_alignr_epi8` and `_mm512_maskz_compress_epi8` for the shift.
  */
 SZ_INTERNAL sz_size_t _sz_levenshtein_distance_skewed_diagonals_upto63_ice( //
     sz_cptr_t shorter, sz_size_t shorter_length,                            //
@@ -809,7 +810,7 @@ SZ_INTERNAL sz_size_t _sz_levenshtein_distance_skewed_diagonals_upto63_ice( //
     bound_vec.zmm = _mm512_set1_epi8(bound <= 255 ? (sz_u8_t)bound : 255);
 
     // To simplify comparisons and traversals, we want to reverse the order of bytes in the shorter string.
-    shorter_vec.zmm = _mm512_setzero_si512(); //? To simplify debugging.
+    shorter_vec.zmm = _mm512_setzero_si512(); //? To simplify debugging, but can be noise
     for (sz_size_t i = 0; i != shorter_length; ++i) shorter_vec.u8s[63 - i] = shorter[i];
     shorter_rotated_vec.zmm = _mm512_permutexvar_epi8(rotate_right_vec.zmm, shorter_vec.zmm);
 
@@ -1034,7 +1035,7 @@ SZ_INTERNAL sz_status_t _sz_levenshtein_distance_skewed_diagonals_upto65k_ice( /
     // The length of the longest (main) diagonal would be `shorter_dim = (shorter_length + 1)`.
     sz_size_t const shorter_dim = shorter_length + 1;
     sz_size_t const longer_dim = longer_length + 1;
-    // Unlike the serial version, we also want to avoid reverse-order iteration over teh shorter string.
+    // Unlike the serial version, we also want to avoid reverse-order iteration over the shorter string.
     // So let's allocate a bit more memory and reverse-export our shorter string into that buffer.
     sz_size_t const buffer_length = sizeof(sz_u16_t) * longer_dim * 3 + shorter_length;
     sz_u16_t *const distances = (sz_u16_t *)alloc->allocate(buffer_length, alloc->handle);
@@ -1225,8 +1226,8 @@ SZ_PUBLIC sz_status_t sz_levenshtein_distance_ice( //
  *  Unlike the `_sz_levenshtein_distance_skewed_diagonals_upto65k_avx512` method, this one uses signed integers to store
  *  the accumulated score. Moreover, it's primary bottleneck is the latency of gathering the substitution costs
  *  from the substitution matrix. If we use the diagonal order, we will be comparing a slice of the first string
- * with a slice of the second. If we stick to the conventional horizontal order, we will be comparing one character
- * against a slice, which is much easier to optimize. In that case we are sampling costs not from arbitrary parts of
+ *  with a slice of the second. If we stick to the conventional horizontal order, we will be comparing one character
+ *  against a slice, which is much easier to optimize. In that case we are sampling costs not from arbitrary parts of
  *  a 256 x 256 matrix, but from a single row!
  */
 SZ_INTERNAL sz_status_t _sz_needleman_wunsch_score_wagner_fisher_upto17m_ice( //
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,4 +1,4 @@
-# This file configures wheels compilation for `cibuilwheel` for StringZilla CPython bindings.
+# This file configures wheels compilation for `cibuildwheel` for StringZilla CPython bindings.
 # On a good day it will produce:
 #   - `manylinux` and `musllinux` wheels for Linux on x86_64, aarch64, i686, ppc64le, s390x;
 #   - `macos` wheels for x86_64, arm64, and universal2;
@@ -28,7 +28,7 @@ build-verbosity = 0
 # - on Windows: AMD64, x86, ARM64
 # https://cibuildwheel.readthedocs.io/en/stable/options/#archs
 #
-# Important to note, not all those paltforms have recent images.
+# Important to note, not all those platforms have recent images.
 # The `manylinux_2_28` seems to be missing for `i686`.
 # The `i686` is 32-bit x86, and `x86_64` is 64-bit x86.
 archs = ["all"]
diff --git a/rust/lib.rs b/rust/lib.rs
@@ -1,10 +1,11 @@
 #![cfg_attr(not(test), no_std)]
-
-/// The `sz` module provides a collection of string searching and manipulation functionality,
-/// designed for high efficiency and compatibility with `no_std` environments. This module offers
-/// various utilities for byte string manipulation, including search, reverse search, and
-/// edit-distance calculations, suitable for a wide range of applications from basic string
-/// processing to complex text analysis tasks.
+#[doc = r"
+The `sz` module provides a collection of string searching and manipulation functionality,
+designed for high efficiency and compatibility with `no_std` environments. This module offers
+various utilities for byte string manipulation, including search, reverse search, and
+edit-distance calculations, suitable for a wide range of applications from basic string
+processing to complex text analysis tasks.
+"]
 
 pub mod sz {
 
diff --git a/scripts/bench_search.cpp b/scripts/bench_search.cpp
@@ -11,7 +11,12 @@
  *
  *  For substring search, the number of operations per second are reported as the number of character-level comparisons
  *  happening in the worst case in the naive algorithm, meaning O(N*M) for N characters in the haystack and M in the
- *  needle.
+ *  needle. In byteset search, the number of operations per second is computed the same way and the following character
+ *  sets are tested against each scanned token:
+ *
+ *  - "\n\r\v\f": 4 tabs
+ *  - "</>&'\"=[]": 9 html
+ *  - "0123456789": 10 digits
  *
  *  Instead of CLI arguments, for compatibility with @b StringWa.rs, the following environment variables are used:
  *  - `STRINGWARS_DATASET` : Path to the dataset file.
diff --git a/scripts/test.hpp b/scripts/test.hpp
@@ -14,13 +14,13 @@ namespace scripts {
 
 inline std::string read_file(std::string path) {
     std::ifstream stream(path);
-    if (!stream.is_open()) { throw std::runtime_error("Failed to open file: " + path); }
+    if (!stream.is_open()) throw std::runtime_error("Failed to open file: " + path);
     return std::string((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());
 }
 
 inline void write_file(std::string path, std::string content) {
     std::ofstream stream(path);
-    if (!stream.is_open()) { throw std::runtime_error("Failed to open file: " + path); }
+    if (!stream.is_open()) throw std::runtime_error("Failed to open file: " + path);
     stream << content;
     stream.close();
 }
@@ -74,13 +74,15 @@ inline std::string repeat(std::string const &patten, std::size_t count) {
 }
 
 /**
- *  @brief  A callback type for iterating over consecutive random-length slices of a string.
+ *  @brief Randomly slices a string into consecutive parts and passes those to @p slice_callback.
+ *  @warning Is @b single-threaded in nature, as it depends on the `global_random_generator`.
  */
 template <typename slice_callback_type_>
 inline void iterate_in_random_slices(std::string const &text, slice_callback_type_ &&slice_callback) {
     std::size_t remaining = text.size();
     while (remaining > 0) {
-        std::size_t slice_length = std::uniform_int_distribution<std::size_t>(1, remaining)(global_random_generator());
+        std::uniform_int_distribution<std::size_t> slice_length_distribution(1, remaining);
+        std::size_t slice_length = slice_length_distribution(global_random_generator());
         slice_callback({text.data() + text.size() - remaining, slice_length});
         remaining -= slice_length;
     }
@@ -121,7 +123,8 @@ using error_costs_256x256_t = std::array<sz_error_cost_t, 256 * 256>;
 inline error_costs_256x256_t unary_substitution_costs() {
     error_costs_256x256_t result;
     for (std::size_t i = 0; i != 256; ++i)
-        for (std::size_t j = 0; j != 256; ++j) result[i * 256 + j] = (i == j ? 0 : -1);
+        for (std::size_t j = 0; j != 256; ++j) //
+            result[i * 256 + j] = i == j ? 0 : -1;
     return result;
 }
 

Original file line number	Diff line number	Diff line change
`@@ -11,7 +11,12 @@`
`11`	`11`	`*`
`12`	`12`	`* For substring search, the number of operations per second are reported as the number of character-level comparisons`
`13`	`13`	`* happening in the worst case in the naive algorithm, meaning O(N*M) for N characters in the haystack and M in the`
`14`		`- * needle.`
	`14`	`+ * needle. In byteset search, the number of operations per second is computed the same way and the following character`
	`15`	`+ * sets are tested against each scanned token:`
	`16`	`+ *`
	`17`	`+ * - "\n\r\v\f": 4 tabs`
	`18`	`+ * - "</>&'\"=[]": 9 html`
	`19`	`+ * - "0123456789": 10 digits`
`15`	`20`	`*`
`16`	`21`	`* Instead of CLI arguments, for compatibility with @b StringWa.rs, the following environment variables are used:`
`17`	`22`	* - `STRINGWARS_DATASET` : Path to the dataset file.