Skip to content

[feat request] shufti table approach #206

@hsqStephenZhang

Description

@hsqStephenZhang

Background

Hi, I had built a crate https://crates.io/crates/shufti-matcher, which could be used to boost the case where we need to match needles whose size is greater than 3 chars.

It would construct the shufti table at compile time using proc-macro so there is little overhead.
In its impl, it used classical low&high nimble and table lookup to extract the bitmask, which could be used to locate the first char in the haystack that matches the needle. And the performance is quiet stable.

simdjson and sonic-rs adopted the shufti table approach, but made the table manually. In fact, it's quite simple to build a char set for less than 9, since there are 8 bits in a u8. For more than 8 chars, it's still possible to fit multiple chars in a bucket.

the usage looks like:

use shufti_macro::ShuftiMatcher;
use shufti_matcher::ShuftiMatch;

#[derive(ShuftiMatcher)]
#[shufti(set = "\t\r\n")]
pub struct WsMatcher;

// https://github.com/servo/rust-url/blob/a66f4220895c3cc84ae623c218466710eb3a812f/url/src/host.rs#L131
#[derive(ShuftiMatcher)]
#[shufti(set = "\0\t\n\r #/:<>?@[\\]^|")]
pub struct MyMatcher;

fn main() {
    let s = b"\nhello\nworld";
    let pos = WsMatcher::find_first(s);
    assert_eq!(pos, Some(0));
    let pos = WsMatcher::find_first(&s[1..]);
    assert_eq!(pos, Some(5));

    let s1 = b"\nhello";
    assert_eq!(MyMatcher::find_first(s1), Some(0));

    let s2 = b"user@domain";
    assert_eq!(MyMatcher::find_first(s2), Some(4));

    let s3 = b"path\\to";
    assert_eq!(MyMatcher::find_first(s3), Some(4));

    let s4 = b"rust|cpp";
    assert_eq!(MyMatcher::find_first(s4), Some(4));

    let s5 = b"ABCdef123";
    assert_eq!(MyMatcher::find_first(s5), None);

    let s6 = b"null\0byte";
    assert_eq!(MyMatcher::find_first(s6), Some(4));

    let s7 = b" \t ";
    assert_eq!(MyMatcher::find_first(s7), Some(0));
    assert_eq!(MyMatcher::find_first(&s7[1..]), Some(0));
}

Discussion

To make this crate more useful, I need to integrate memchr1, memchr2 and memchr3 for fast handling where the needles' size is less than 4. But then I thought, why not make memchr support this approach, which could be more beneficial since it's so widely used.

I'd like to here more thoughts about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions