Possible bottleneck reviews

I fed the existing repository to Opus 4.6 to see if it could identify any obvious bottlenecks for speed.  Here's what it came up with, which may or may not be aligned with reality.  If any of this seems like an easy or obvious benefit, please evaluate and update.  If some of it doesn't make sense or isn't a clear win for time spent vs. speed, no worries - it can wait.

In this evaluation, I told it that rules would be evenly distributed across the existing rule types.  That may be false given eventual usage patterns; it remains to be seen.  

You may already be building with the flags it notes below.


# Performance optimization opportunities for high-throughput matching

## Context

We need to evaluate what software-level optimizations are available to improve stringsimile throughput on a standard Intel CPU. Hardware acceleration (e.g., Google Coral TPU) was evaluated and rejected as architecturally unsuitable — the Edge TPU is a systolic array designed for 8-bit quantized neural network inference (multiply-accumulate operations), which is fundamentally mismatched with the branching, sequential, variable-length nature of string similarity algorithms like Levenshtein, Jaro-Winkler, Soundex, and IDN Confusables.

The following analysis focuses on software-level improvements within the existing Rust codebase.

## Assumptions

- **Hardware:** 32-core / 64-thread Intel processor
- **Throughput target:** 80,000 input names per second
- **Rule configuration:** ~2,000 rule_sets, evenly distributed across available rule types
- **Deduplication:** Handled upstream in the pipeline; Kafka input will not contain significant duplicates
- **`split_target` usage:** Typical domains produce 3–5 labels (2–4 after TLD removal), multiplying per-message comparisons accordingly
- **Match rate:** Very low — the vast majority of comparisons are non-matches

## Workload analysis

At 80K names/sec × 2,000 rule_sets × ~3 domain parts × multiple rules per set, the system must sustain on the order of **hundreds of millions of individual rule evaluations per second**. With 64 threads, that leaves roughly **80–200 nanoseconds per rule evaluation** as the per-comparison budget. At this scale, overhead from allocations, serialization, and redundant computation becomes significant relative to the algorithmic cost of the comparisons themselves.

## Existing strengths

The codebase already uses `triple_accel` for SIMD-accelerated Levenshtein, Damerau-Levenshtein, and Hamming distance — and specifically uses the `_exp` (exponential search) variants, which are the optimal choice when most comparisons are expected to exceed the distance threshold. The `buffer_unordered` parallelism model across messages with tokio should provide adequate core utilization given the message arrival rate vs. per-message processing time.

## Proposed optimizations

### 1. Skip JSON serialization for non-matches (estimated 15–25% overall speedup)

**Problem:** In `lib.rs`, `try_into_generic_result()` calls `serde_json::to_value()` on every rule evaluation result — both matches and non-matches. This creates heap-allocated `Map<String, Value>` objects with string keys and boxed values for metadata that, in the `report_all: false` case, is immediately discarded for non-matches.

With a match rate well under 1%, over 99% of these serialization calls produce garbage that is never read.

**Affected code:** `lib.rs:63–74` (`try_into_generic_result`), called from `rule.rs:107–124` (`match_rule_generic`).

**Proposed fix:** When `report_all` is false, introduce a fast path in `match_rule_generic` that returns a lightweight non-match result (just the boolean) without serializing metadata. Only perform `serde_json::to_value()` for actual matches. This could be done by passing a `report_all` flag into the rule evaluation, or by having the rule return a richer enum that defers serialization.

**Impact:** This is the highest-impact single change because it eliminates multiple heap allocations per evaluation in the hottest loop, and it affects every rule type equally.

### 2. Pre-compute phonetic codes for target strings (estimated 8–15% overall speedup)

**Problem:** For Soundex, Metaphone, NYSIIS, and Match Rating rules, the target string (`string_match`) is constant between rule reloads, but its phonetic encoding is recomputed on every comparison.

Specific examples:
- `metaphone.rs:96`: `metaphone.encode(target_str)` — called 80K × (number of metaphone rule_sets) times/sec, always with the same target
- `nysiis.rs:46`: `self.nysiis.encode(target_str)` — same issue
- `soundex.rs:67–70`: `self.soundex_type.build_soundex()` — additionally allocates a new `Box<dyn SoundexCommons>` on every single call

**Proposed fix:** At rule construction time (in `RuleSetConfig::into_rule_set` or in the rule struct constructors), pre-compute and store the phonetic encoding of `string_match`. At comparison time, only encode the input string and compare against the stored code. For Soundex, also store the `Soundex`/`RefinedSoundex` instance in the rule struct rather than constructing it per-call.

**Note:** This changes the `MatcherRule` trait slightly — phonetic rules would need access to the pre-computed target code rather than the raw target string, or the pre-computed code could be stored alongside the rule in the `RuleSet`.

### 3. Build with `RUSTFLAGS="-C target-cpu=native"` (estimated 5–15% speedup on edit distance rules)

**Problem:** `triple_accel` auto-detects SIMD at runtime, but the Rust compiler itself may not emit AVX2 instructions without explicit target CPU flags. This can result in SSE4.1 (128-bit) instead of AVX2 (256-bit) for the vectorized edit distance inner loops — a potential 2× throughput difference for those operations.

**Proposed fix:** Add `-C target-cpu=native` to `RUSTFLAGS` in the production build pipeline. If builds must be portable across different CPU generations, consider `-C target-cpu=x86-64-v3` (which includes AVX2) as a minimum baseline for production hardware.

**Verification:** Build `triple_accel` with its `debug` feature flag temporarily — it prints the vector type being used, confirming whether AVX2 is active.

**Impact:** Directly affects Levenshtein, Damerau-Levenshtein, and Hamming rules (~30% of rule_sets if evenly distributed). Also enables auto-vectorization of other loops throughout the binary.

### 4. Length-based pre-filtering for edit distance rules (estimated 3–8% overall speedup)

**Problem:** For Levenshtein and Damerau-Levenshtein with a `maximum_distance` threshold, if `|len(input) - len(target)| > maximum_distance`, the edit distance is guaranteed to exceed the threshold. This check is a single integer subtraction — effectively free — but it avoids invoking the full algorithm.

**Affected code:** `levenshtein.rs:37` and `damerau_levenshtein.rs:37`, where `levenshtein_exp` / `rdamerau_exp` are called unconditionally.

**Proposed fix:** Add a length-difference check before calling the `triple_accel` functions:

```rust
fn match_rule(&self, input_str: &str, target_str: &str) -> MatcherResult<...> {
    let len_diff = input_str.len().abs_diff(target_str.len()) as u32;
    if len_diff > self.maximum_distance {
        return MatcherResult::new_no_match(LevenshteinMetadata { distance: len_diff });
    }
    let res = levenshtein_exp(input_str.as_bytes(), target_str.as_bytes());
    // ... existing logic
}
```

**Note:** The `_exp` variants already do some early termination internally via exponential search, so the marginal gain is less than it would be with a naive implementation. The benefit depends on the length distribution of domain labels vs. target strings — short labels like `www` (3 chars) vs. long targets like `wikipedia` (9 chars) would be caught immediately.

### 5. Eliminate string allocations for split domain parts (estimated 2–4% overall speedup)

**Problem:** In `ruleset.rs:94`, `parts.extend(name.split(".").map(|s| s.to_string()))` allocates a new `String` for every domain label. At 80K names/sec × ~4 labels, that's ~320K small heap allocations/sec, all unnecessary since the parts are only used as `&str` references.

**Proposed fix:** Use `&str` slices borrowing from the input string instead of owned `String`s. This requires some lifetime annotations on `generate_matches` and the parts vector, but is straightforward:

```rust
let parts: Vec<&str> = if self.split_target {
    let mut p: Vec<&str> = name.split('.').collect();
    if self.ignore_tld && !p.is_empty() {
        p.pop();
    }
    p
} else {
    vec![name]
};
```

### 6. Use a performance-oriented global allocator (estimated 1–3% overall speedup)

**Problem:** The default system allocator (glibc malloc) can exhibit lock contention under heavy multi-threaded allocation patterns. With 64 threads all performing small allocations (metadata structs, string parts, JSON values), allocator pressure is non-trivial.

**Proposed fix:** Add `tikv-jemallocator` or `mimalloc` as the global allocator — this is a ~3-line change:

```rust
// in main.rs
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
```

This is complementary to reducing allocations (items 1, 2, 5) — it makes the remaining allocations cheaper.

## Summary

| # | Optimization | Estimated speedup | Effort |
|---|---|---|---|
| 1 | Skip JSON serialization for non-matches | 15–25% | Medium |
| 2 | Pre-compute phonetic target codes | 8–15% | Low–Medium |
| 3 | Build with `-C target-cpu=native` | 5–15% | Trivial |
| 4 | Length pre-filtering for edit distance | 3–8% | Trivial |
| 5 | Use `&str` slices for split parts | 2–4% | Low |
| 6 | Switch to jemalloc/mimalloc allocator | 1–3% | Trivial |

**Combined realistic estimate: 25–45% overall throughput improvement.** These overlap partially (removing serialization overhead changes the profile for the remaining optimizations), so they are not strictly additive.

Items 3, 4, and 6 are low-risk, low-effort changes that could be landed quickly. Items 1 and 2 are higher impact but require more careful design work around the trait interfaces.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bottleneck reviews #5

Performance optimization opportunities for high-throughput matching

Context

Assumptions

Workload analysis

Existing strengths

Proposed optimizations

1. Skip JSON serialization for non-matches (estimated 15–25% overall speedup)

2. Pre-compute phonetic codes for target strings (estimated 8–15% overall speedup)

3. Build with `RUSTFLAGS="-C target-cpu=native"` (estimated 5–15% speedup on edit distance rules)

4. Length-based pre-filtering for edit distance rules (estimated 3–8% overall speedup)

5. Eliminate string allocations for split domain parts (estimated 2–4% overall speedup)

6. Use a performance-oriented global allocator (estimated 1–3% overall speedup)

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Optimization	Estimated speedup	Effort
1	Skip JSON serialization for non-matches	15–25%	Medium
2	Pre-compute phonetic target codes	8–15%	Low–Medium
3	Build with `-C target-cpu=native`	5–15%	Trivial
4	Length pre-filtering for edit distance	3–8%	Trivial
5	Use `&str` slices for split parts	2–4%	Low
6	Switch to jemalloc/mimalloc allocator	1–3%	Trivial

Possible bottleneck reviews #5

Description

Performance optimization opportunities for high-throughput matching

Context

Assumptions

Workload analysis

Existing strengths

Proposed optimizations

1. Skip JSON serialization for non-matches (estimated 15–25% overall speedup)

2. Pre-compute phonetic codes for target strings (estimated 8–15% overall speedup)

3. Build with RUSTFLAGS="-C target-cpu=native" (estimated 5–15% speedup on edit distance rules)

4. Length-based pre-filtering for edit distance rules (estimated 3–8% overall speedup)

5. Eliminate string allocations for split domain parts (estimated 2–4% overall speedup)

6. Use a performance-oriented global allocator (estimated 1–3% overall speedup)

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Build with `RUSTFLAGS="-C target-cpu=native"` (estimated 5–15% speedup on edit distance rules)