Skip to content

Conversation

@h3n4l
Copy link
Member

@h3n4l h3n4l commented Aug 29, 2025

Grammar-Aware Fuzzing Library

This PR introduces a comprehensive fuzzing library that generates valid SQL inputs from ANTLR v4 grammar files for parser testing.

🚀 Features

Core Implementation

  • Grammar IR Storage: Parses ANTLR v4 .g4 files into intermediate representation
  • Random Generation: Configurable random generation with support for alternatives, quantifiers, and optional elements
  • Multi-Grammar Support: Can merge separate lexer and parser grammar files
  • Recursion Control: Depth-based termination to prevent infinite loops
  • Token Generation: Concrete token generation for lexer rules with character sets, ranges, and negated sets

Configuration Options

  • MaxDepth: Maximum recursion depth (default: 5)
  • OptionalProb: Probability of including optional elements (0.0-1.0)
  • MaxQuantifier/MinQuantifier: Control for * and + quantifiers
  • Seed: Reproducible random generation
  • OutputFormat: Compact or verbose output with rule traversal

Testing Infrastructure

  • PostgreSQL grammar integration tests
  • Multiple test scenarios (simple, deep, minimal)
  • Benchmark tests for performance measurement
  • Verbose output for debugging rule traversal

🔧 Usage Example

cfg := &config.Config{
    GrammarFiles: []string{"postgresql/PostgreSQLLexer.g4", "postgresql/PostgreSQLParser.g4"},
    StartRule:    "selectstmt",
    Count:        10,
    MaxDepth:     5,
    OptionalProb: 0.7,
    Seed:         42,
}

gen := generator.New(cfg)
err := gen.Generate() // Generates 10 SELECT statements

🧪 Test Results

All tests pass successfully:
=== RUN   TestPostgreSQLSelectStmt
=== RUN   TestPostgreSQLExpressions
=== RUN   TestPostgreSQLVerboseOutput
PASS
ok      github.com/bytebase/parser/tools/fuzzing/tests    0.833s

⚠️ Current Limitations

1. Aggressive Depth Limiting

The current implementation uses a simple depth-based termination that generates placeholders like <rule_MAX_DEPTH> even when not in actual recursion:

Example Output:
Query 1: ( <select_clause_MAX_DEPTH> <for_locking_clause_MAX_DEPTH> )

Issues:
- Reaches max depth through sequential rule expansion, not recursion
- Doesn't distinguish between recursive and non-recursive rule references
- May generate less realistic output at depth boundaries

2. Basic Terminal Selection

- No attempt to find non-recursive alternatives before using placeholders
- Could implement smarter terminal forcing for better output quality

3. Limited Character Set Support

- Basic support for [a-z], ~[...], 'a'..'z' patterns
- Could expand lexer pattern support

🔄 Future Enhancements

1. Smart Recursion Detection: Distinguish actual recursion from sequential expansion
2. Depth-Biased Selection: Prefer non-recursive alternatives at higher depths
3. Terminal Forcing: Try non-recursive alternatives before placeholders
4. More Grammar Support: Extend beyond PostgreSQL to other SQL dialects
5. Advanced Lexer Patterns: Expand character class and regex support

🎯 Integration

- Uses existing ANTLR v4 parser at tools/grammar/
- Compatible with all parser implementations in the repository
- Provides programmatic access without requiring CLI tools
- Ready for CI/CD integration and automated testing

This foundational implementation provides a solid base for grammar-aware fuzzing while clearly identifying areas for future improvement.

This PR description:
-Highlights the key features and capabilities
- ⚠️ Honestly documents the current limitations (aggressive depth limiting)
- 🔄 Outlines clear future enhancement opportunities
- 📊 Shows test results and usage examples
- 🏗️ Explains the architecture and integration points

@h3n4l h3n4l changed the title feat: fuzzing v1 feat: implement ANTLR grammar-aware fuzzing library for parser testing Aug 29, 2025
@h3n4l h3n4l merged commit 80e1ea5 into main Aug 29, 2025
5 checks passed
@h3n4l h3n4l deleted the fuzzing branch August 29, 2025 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants