-
-
Notifications
You must be signed in to change notification settings - Fork 131
Description
Background
Hi, I had built a crate https://crates.io/crates/shufti-matcher, which could be used to boost the case where we need to match needles whose size is greater than 3 chars.
It would construct the shufti table at compile time using proc-macro so there is little overhead.
In its impl, it used classical low&high nimble and table lookup to extract the bitmask, which could be used to locate the first char in the haystack that matches the needle. And the performance is quiet stable.
simdjson and sonic-rs adopted the shufti table approach, but made the table manually. In fact, it's quite simple to build a char set for less than 9, since there are 8 bits in a u8. For more than 8 chars, it's still possible to fit multiple chars in a bucket.
the usage looks like:
use shufti_macro::ShuftiMatcher;
use shufti_matcher::ShuftiMatch;
#[derive(ShuftiMatcher)]
#[shufti(set = "\t\r\n")]
pub struct WsMatcher;
// https://github.com/servo/rust-url/blob/a66f4220895c3cc84ae623c218466710eb3a812f/url/src/host.rs#L131
#[derive(ShuftiMatcher)]
#[shufti(set = "\0\t\n\r #/:<>?@[\\]^|")]
pub struct MyMatcher;
fn main() {
let s = b"\nhello\nworld";
let pos = WsMatcher::find_first(s);
assert_eq!(pos, Some(0));
let pos = WsMatcher::find_first(&s[1..]);
assert_eq!(pos, Some(5));
let s1 = b"\nhello";
assert_eq!(MyMatcher::find_first(s1), Some(0));
let s2 = b"user@domain";
assert_eq!(MyMatcher::find_first(s2), Some(4));
let s3 = b"path\\to";
assert_eq!(MyMatcher::find_first(s3), Some(4));
let s4 = b"rust|cpp";
assert_eq!(MyMatcher::find_first(s4), Some(4));
let s5 = b"ABCdef123";
assert_eq!(MyMatcher::find_first(s5), None);
let s6 = b"null\0byte";
assert_eq!(MyMatcher::find_first(s6), Some(4));
let s7 = b" \t ";
assert_eq!(MyMatcher::find_first(s7), Some(0));
assert_eq!(MyMatcher::find_first(&s7[1..]), Some(0));
}Discussion
To make this crate more useful, I need to integrate memchr1, memchr2 and memchr3 for fast handling where the needles' size is less than 4. But then I thought, why not make memchr support this approach, which could be more beneficial since it's so widely used.
I'd like to here more thoughts about it.