Skip to content

Conversation

@taminomara
Copy link
Contributor

@taminomara taminomara commented May 23, 2025

This helps with writing structured input adapters for fuzzing. When fuzzing a parser specifically (as opposed to fuzzing lexer and parser at the same time), we'd like to supply it with an array of valid lexemes. This export helps us build such an array as we don't have to manually list all tokens in a fuzzing entry point.

Note that I didn't implement this functionality for generated lexers because there's already a way to get all tokens via mod_l::lexerdef().iter_rules().

Example of a fuzzing implementation after this PR:

#[derive(Debug)]
struct Token(u32, String);

impl<'a> Arbitrary<'a> for Token {
    fn arbitrary(u: &mut Unstructured<'a>) -> libfuzzer_sys::arbitrary::Result<Self> {
        Ok(Token(*u.choose(token_map::TOKENS)?, u.arbitrary()?))
    }
}

fuzz_target!(|data: Vec<Token>| {
    let mut text = String::new();
    let lexemes = data.into_iter().map(|tok| {
        let lexeme = DefaultLexeme::new(
            tok.0,
            text.len(),
            tok.1.len(),
        );
        text.push_str(&tok.1);
        lexeme
    }).collect();

    // Run parser...
}

@ltratt
Copy link
Member

ltratt commented May 23, 2025

This is a part of the system I haven't thought about for a while. Is it possible to do the same thing with mod_l::lexerdef().iter_rules().map(|x| x.name()).collect() or similar? [Warning: untried!]

@taminomara
Copy link
Contributor Author

Is it possible to do the same thing with mod_l::lexerdef().iter_rules().map(|x| x.name()).collect() or similar?

Yes, this seems to work.

@ltratt
Copy link
Member

ltratt commented May 23, 2025

OK, then I think we don't need to generate the array?

@taminomara
Copy link
Contributor Author

OK, then I think we don't need to generate the array?

That will only work when user has a generated lexer. If there's a custom lexer with ct_token_map, then there's no way to get a full array of tokens.

@ltratt
Copy link
Member

ltratt commented May 23, 2025

If there's a custom lexer with ct_token_map, then there's no way to get a full array of tokens.

I take your point.

@taminomara taminomara force-pushed the master branch 4 times, most recently from e8356d9 to 49ba5e3 Compare May 24, 2025 14:53
This helps with writing structured input adapters for fuzzing. When fuzzing a parser specifically (as opposed to fuzzing lexer and parser at the same time), we'd like to supply it with an array of valid lexemes. This export helps us build such an array as we don't have to manually list all tokens in a fuzzing entry point.

Note that I didn't implement this functionality for generated lexers because there's already a way to get all tokens via `mod_l::lexerdef().iter_rules()`.
@ratmice
Copy link
Collaborator

ratmice commented May 24, 2025

Took a bit of head scratching until I grokked it (building a token stream directly rather than an intermediate vector!),
But once it clicked, it all seemed fine to me.

Seems fine to me now, unless Laurence has any further comments.

@ltratt
Copy link
Member

ltratt commented May 24, 2025

@ratmice Thanks for the review!

@taminomara Thanks for the PR!

@ltratt ltratt added this pull request to the merge queue May 24, 2025
Merged via the queue into softdevteam:master with commit 7831d2d May 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants