-
Notifications
You must be signed in to change notification settings - Fork 0
Possible bottleneck reviews #5
Description
I fed the existing repository to Opus 4.6 to see if it could identify any obvious bottlenecks for speed. Here's what it came up with, which may or may not be aligned with reality. If any of this seems like an easy or obvious benefit, please evaluate and update. If some of it doesn't make sense or isn't a clear win for time spent vs. speed, no worries - it can wait.
In this evaluation, I told it that rules would be evenly distributed across the existing rule types. That may be false given eventual usage patterns; it remains to be seen.
You may already be building with the flags it notes below.
Performance optimization opportunities for high-throughput matching
Context
We need to evaluate what software-level optimizations are available to improve stringsimile throughput on a standard Intel CPU. Hardware acceleration (e.g., Google Coral TPU) was evaluated and rejected as architecturally unsuitable — the Edge TPU is a systolic array designed for 8-bit quantized neural network inference (multiply-accumulate operations), which is fundamentally mismatched with the branching, sequential, variable-length nature of string similarity algorithms like Levenshtein, Jaro-Winkler, Soundex, and IDN Confusables.
The following analysis focuses on software-level improvements within the existing Rust codebase.
Assumptions
- Hardware: 32-core / 64-thread Intel processor
- Throughput target: 80,000 input names per second
- Rule configuration: ~2,000 rule_sets, evenly distributed across available rule types
- Deduplication: Handled upstream in the pipeline; Kafka input will not contain significant duplicates
split_targetusage: Typical domains produce 3–5 labels (2–4 after TLD removal), multiplying per-message comparisons accordingly- Match rate: Very low — the vast majority of comparisons are non-matches
Workload analysis
At 80K names/sec × 2,000 rule_sets × ~3 domain parts × multiple rules per set, the system must sustain on the order of hundreds of millions of individual rule evaluations per second. With 64 threads, that leaves roughly 80–200 nanoseconds per rule evaluation as the per-comparison budget. At this scale, overhead from allocations, serialization, and redundant computation becomes significant relative to the algorithmic cost of the comparisons themselves.
Existing strengths
The codebase already uses triple_accel for SIMD-accelerated Levenshtein, Damerau-Levenshtein, and Hamming distance — and specifically uses the _exp (exponential search) variants, which are the optimal choice when most comparisons are expected to exceed the distance threshold. The buffer_unordered parallelism model across messages with tokio should provide adequate core utilization given the message arrival rate vs. per-message processing time.
Proposed optimizations
1. Skip JSON serialization for non-matches (estimated 15–25% overall speedup)
Problem: In lib.rs, try_into_generic_result() calls serde_json::to_value() on every rule evaluation result — both matches and non-matches. This creates heap-allocated Map<String, Value> objects with string keys and boxed values for metadata that, in the report_all: false case, is immediately discarded for non-matches.
With a match rate well under 1%, over 99% of these serialization calls produce garbage that is never read.
Affected code: lib.rs:63–74 (try_into_generic_result), called from rule.rs:107–124 (match_rule_generic).
Proposed fix: When report_all is false, introduce a fast path in match_rule_generic that returns a lightweight non-match result (just the boolean) without serializing metadata. Only perform serde_json::to_value() for actual matches. This could be done by passing a report_all flag into the rule evaluation, or by having the rule return a richer enum that defers serialization.
Impact: This is the highest-impact single change because it eliminates multiple heap allocations per evaluation in the hottest loop, and it affects every rule type equally.
2. Pre-compute phonetic codes for target strings (estimated 8–15% overall speedup)
Problem: For Soundex, Metaphone, NYSIIS, and Match Rating rules, the target string (string_match) is constant between rule reloads, but its phonetic encoding is recomputed on every comparison.
Specific examples:
metaphone.rs:96:metaphone.encode(target_str)— called 80K × (number of metaphone rule_sets) times/sec, always with the same targetnysiis.rs:46:self.nysiis.encode(target_str)— same issuesoundex.rs:67–70:self.soundex_type.build_soundex()— additionally allocates a newBox<dyn SoundexCommons>on every single call
Proposed fix: At rule construction time (in RuleSetConfig::into_rule_set or in the rule struct constructors), pre-compute and store the phonetic encoding of string_match. At comparison time, only encode the input string and compare against the stored code. For Soundex, also store the Soundex/RefinedSoundex instance in the rule struct rather than constructing it per-call.
Note: This changes the MatcherRule trait slightly — phonetic rules would need access to the pre-computed target code rather than the raw target string, or the pre-computed code could be stored alongside the rule in the RuleSet.
3. Build with RUSTFLAGS="-C target-cpu=native" (estimated 5–15% speedup on edit distance rules)
Problem: triple_accel auto-detects SIMD at runtime, but the Rust compiler itself may not emit AVX2 instructions without explicit target CPU flags. This can result in SSE4.1 (128-bit) instead of AVX2 (256-bit) for the vectorized edit distance inner loops — a potential 2× throughput difference for those operations.
Proposed fix: Add -C target-cpu=native to RUSTFLAGS in the production build pipeline. If builds must be portable across different CPU generations, consider -C target-cpu=x86-64-v3 (which includes AVX2) as a minimum baseline for production hardware.
Verification: Build triple_accel with its debug feature flag temporarily — it prints the vector type being used, confirming whether AVX2 is active.
Impact: Directly affects Levenshtein, Damerau-Levenshtein, and Hamming rules (~30% of rule_sets if evenly distributed). Also enables auto-vectorization of other loops throughout the binary.
4. Length-based pre-filtering for edit distance rules (estimated 3–8% overall speedup)
Problem: For Levenshtein and Damerau-Levenshtein with a maximum_distance threshold, if |len(input) - len(target)| > maximum_distance, the edit distance is guaranteed to exceed the threshold. This check is a single integer subtraction — effectively free — but it avoids invoking the full algorithm.
Affected code: levenshtein.rs:37 and damerau_levenshtein.rs:37, where levenshtein_exp / rdamerau_exp are called unconditionally.
Proposed fix: Add a length-difference check before calling the triple_accel functions:
fn match_rule(&self, input_str: &str, target_str: &str) -> MatcherResult<...> {
let len_diff = input_str.len().abs_diff(target_str.len()) as u32;
if len_diff > self.maximum_distance {
return MatcherResult::new_no_match(LevenshteinMetadata { distance: len_diff });
}
let res = levenshtein_exp(input_str.as_bytes(), target_str.as_bytes());
// ... existing logic
}Note: The _exp variants already do some early termination internally via exponential search, so the marginal gain is less than it would be with a naive implementation. The benefit depends on the length distribution of domain labels vs. target strings — short labels like www (3 chars) vs. long targets like wikipedia (9 chars) would be caught immediately.
5. Eliminate string allocations for split domain parts (estimated 2–4% overall speedup)
Problem: In ruleset.rs:94, parts.extend(name.split(".").map(|s| s.to_string())) allocates a new String for every domain label. At 80K names/sec × ~4 labels, that's ~320K small heap allocations/sec, all unnecessary since the parts are only used as &str references.
Proposed fix: Use &str slices borrowing from the input string instead of owned Strings. This requires some lifetime annotations on generate_matches and the parts vector, but is straightforward:
let parts: Vec<&str> = if self.split_target {
let mut p: Vec<&str> = name.split('.').collect();
if self.ignore_tld && !p.is_empty() {
p.pop();
}
p
} else {
vec![name]
};6. Use a performance-oriented global allocator (estimated 1–3% overall speedup)
Problem: The default system allocator (glibc malloc) can exhibit lock contention under heavy multi-threaded allocation patterns. With 64 threads all performing small allocations (metadata structs, string parts, JSON values), allocator pressure is non-trivial.
Proposed fix: Add tikv-jemallocator or mimalloc as the global allocator — this is a ~3-line change:
// in main.rs
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;This is complementary to reducing allocations (items 1, 2, 5) — it makes the remaining allocations cheaper.
Summary
| # | Optimization | Estimated speedup | Effort |
|---|---|---|---|
| 1 | Skip JSON serialization for non-matches | 15–25% | Medium |
| 2 | Pre-compute phonetic target codes | 8–15% | Low–Medium |
| 3 | Build with -C target-cpu=native |
5–15% | Trivial |
| 4 | Length pre-filtering for edit distance | 3–8% | Trivial |
| 5 | Use &str slices for split parts |
2–4% | Low |
| 6 | Switch to jemalloc/mimalloc allocator | 1–3% | Trivial |
Combined realistic estimate: 25–45% overall throughput improvement. These overlap partially (removing serialization overhead changes the profile for the remaining optimizations), so they are not strictly additive.
Items 3, 4, and 6 are low-risk, low-effort changes that could be landed quickly. Items 1 and 2 are higher impact but require more careful design work around the trait interfaces.