Skip to content

Commit d190f85

Browse files
committed
perf: optimize strpos by eliminating double iteration for UTF-8
For non-ASCII strings, the original implementation used string.find() to get the byte index, then counted characters up to that byte index. This required two passes through the string. This optimization uses char_indices() to find the substring while simultaneously tracking character positions, completing the search in a single pass. Benchmark results (UTF-8 strings): - str_len_8: 188.98 µs → 140.54 µs (25.4% faster) - str_len_32: 615.69 µs → 294.15 µs (52.2% faster) - str_len_128: 2.2707 ms → 1.2462 ms (45.1% faster) - str_len_4096: 74.328 ms → 36.538 ms (50.9% faster) ASCII performance unchanged (already optimized with fast path).
1 parent 7c50448 commit d190f85

File tree

1 file changed

+24
-8
lines changed

1 file changed

+24
-8
lines changed

datafusion/functions/src/unicode/strpos.rs

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -215,14 +215,30 @@ where
215215
)
216216
}
217217
} else {
218-
// The `find` method returns the byte index of the substring.
219-
// We count the number of chars up to that byte index.
220-
T::Native::from_usize(
221-
string
222-
.find(substring)
223-
.map(|x| string[..x].chars().count() + 1)
224-
.unwrap_or(0),
225-
)
218+
// For non-ASCII, use a single-pass search that tracks both
219+
// byte position and character position simultaneously
220+
if substring.is_empty() {
221+
return T::Native::from_usize(1);
222+
}
223+
224+
let substring_bytes = substring.as_bytes();
225+
let string_bytes = string.as_bytes();
226+
227+
if substring_bytes.len() > string_bytes.len() {
228+
return T::Native::from_usize(0);
229+
}
230+
231+
// Single pass: find substring while counting characters
232+
let mut char_pos = 0;
233+
for (byte_idx, _) in string.char_indices() {
234+
char_pos += 1;
235+
if byte_idx + substring_bytes.len() <= string_bytes.len()
236+
&& &string_bytes[byte_idx..byte_idx + substring_bytes.len()] == substring_bytes {
237+
return T::Native::from_usize(char_pos);
238+
}
239+
}
240+
241+
T::Native::from_usize(0)
226242
}
227243
}
228244
_ => None,

0 commit comments

Comments
 (0)