|
| 1 | +# Grammar-Aware Fuzzing Tool Design |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +A fuzzing tool that generates valid SQL inputs by analyzing ANTLR v4 grammar files, ensuring comprehensive parser testing with syntactically correct queries that can stress-test parsing performance and correctness. |
| 6 | + |
| 7 | +## Goals |
| 8 | + |
| 9 | +- **Valid Input Generation**: Generate syntactically correct SQL queries based on grammar rules |
| 10 | +- **Performance Testing**: Create complex queries to test parser performance limits |
| 11 | +- **Coverage Maximization**: Exercise all grammar rules and edge cases |
| 12 | +- **Automated Testing**: Integrate with CI for continuous parser validation |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +``` |
| 17 | +tools/fuzzing/ |
| 18 | +├── generator/ # Core generation logic |
| 19 | +│ ├── grammar_analyzer.go # Parse ANTLR grammar files |
| 20 | +│ ├── rule_expander.go # Expand grammar rules to concrete syntax |
| 21 | +│ └── query_builder.go # Build SQL queries from rule expansions |
| 22 | +├── strategies/ # Different generation strategies |
| 23 | +│ ├── depth_first.go # Generate deeply nested structures |
| 24 | +│ ├── breadth_first.go # Generate wide, complex queries |
| 25 | +│ └── weighted.go # Probability-based rule selection |
| 26 | +├── corpus/ # Generated test cases and seeds |
| 27 | +│ ├── seeds/ # Hand-crafted seed inputs |
| 28 | +│ └── generated/ # Auto-generated test cases |
| 29 | +└── cmd/ # CLI tools |
| 30 | + └── fuzzer/ # Main fuzzer executable |
| 31 | +``` |
| 32 | + |
| 33 | +## Core Components |
| 34 | + |
| 35 | +### 1. Grammar Analyzer |
| 36 | + |
| 37 | +Leverages the existing `tools/grammar/` ANTLR v4 parser to: |
| 38 | +- Parse target grammar files (e.g., `postgresql.g4`, `cql.g4`) |
| 39 | +- Extract production rules and their alternatives |
| 40 | +- Build dependency graph between rules |
| 41 | +- Identify terminal vs non-terminal symbols |
| 42 | + |
| 43 | +```go |
| 44 | +type GrammarAnalyzer struct { |
| 45 | + parser *grammar.ANTLRv4Parser |
| 46 | + rules map[string]*Rule |
| 47 | +} |
| 48 | + |
| 49 | +type Rule struct { |
| 50 | + Name string |
| 51 | + Alternatives []Alternative |
| 52 | + Type RuleType // LEXER, PARSER, FRAGMENT |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +### 2. Rule Expander |
| 57 | + |
| 58 | +Recursively expands grammar rules into concrete syntax trees: |
| 59 | +- Handles rule recursion with configurable depth limits |
| 60 | +- Supports probability-weighted alternative selection |
| 61 | +- Manages lexer rules and literal generation |
| 62 | +- Tracks generation context for smart decisions |
| 63 | + |
| 64 | +```go |
| 65 | +type RuleExpander struct { |
| 66 | + grammar *ParsedGrammar |
| 67 | + maxDepth int |
| 68 | + weights map[string]float64 |
| 69 | + random *rand.Rand |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +### 3. Query Builder |
| 74 | + |
| 75 | +Converts syntax trees to executable SQL strings: |
| 76 | +- Handles whitespace and formatting |
| 77 | +- Manages identifier generation (table names, columns) |
| 78 | +- Ensures semantic consistency where possible |
| 79 | +- Outputs parseable query strings |
| 80 | + |
| 81 | +## Generation Strategies |
| 82 | + |
| 83 | +### Depth-First Strategy |
| 84 | +- Generates deeply nested subqueries, expressions |
| 85 | +- Tests parser stack limits and recursion handling |
| 86 | +- Focuses on structural complexity |
| 87 | + |
| 88 | +### Breadth-First Strategy |
| 89 | +- Creates wide queries with many clauses, joins, columns |
| 90 | +- Tests parser memory usage and performance |
| 91 | +- Focuses on query size and breadth |
| 92 | + |
| 93 | +### Weighted Strategy |
| 94 | +- Uses probability weights for rule selection |
| 95 | +- Biases toward commonly used constructs |
| 96 | +- Configurable via weight files per dialect |
| 97 | + |
| 98 | +## Integration Points |
| 99 | + |
| 100 | +### With Existing Grammar Parser |
| 101 | +```go |
| 102 | +// Reuse tools/grammar/ for parsing target grammars |
| 103 | +analyzer := NewGrammarAnalyzer() |
| 104 | +targetGrammar, err := analyzer.ParseGrammarFile("postgresql/PostgreSQLLexer.g4") |
| 105 | +``` |
| 106 | + |
| 107 | +### With Parser Testing |
| 108 | +```go |
| 109 | +// Generate test cases for specific parser |
| 110 | +fuzzer := NewFuzzer(postgresqlGrammar) |
| 111 | +queries := fuzzer.GenerateQueries(1000) |
| 112 | + |
| 113 | +for _, query := range queries { |
| 114 | + // Test against postgresql parser |
| 115 | + result := postgresqlParser.Parse(query) |
| 116 | + // Collect metrics, detect crashes |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +## Configuration |
| 121 | + |
| 122 | +### Fuzzer Config |
| 123 | +```yaml |
| 124 | +target_grammar: "postgresql" |
| 125 | +strategies: |
| 126 | + - name: "depth_first" |
| 127 | + weight: 0.3 |
| 128 | + max_depth: 15 |
| 129 | + - name: "breadth_first" |
| 130 | + weight: 0.4 |
| 131 | + max_width: 50 |
| 132 | + - name: "weighted" |
| 133 | + weight: 0.3 |
| 134 | + weights_file: "postgresql_weights.yaml" |
| 135 | + |
| 136 | +generation: |
| 137 | + count: 10000 |
| 138 | + max_query_length: 100000 |
| 139 | + seed: 42 |
| 140 | + |
| 141 | +output: |
| 142 | + format: "sql" |
| 143 | + directory: "corpus/generated" |
| 144 | +``` |
| 145 | +
|
| 146 | +### Grammar Weights |
| 147 | +```yaml |
| 148 | +# postgresql_weights.yaml |
| 149 | +rules: |
| 150 | + selectStmt: 0.4 |
| 151 | + insertStmt: 0.2 |
| 152 | + updateStmt: 0.2 |
| 153 | + deleteStmt: 0.1 |
| 154 | + createStmt: 0.1 |
| 155 | + |
| 156 | + # Bias toward complex expressions |
| 157 | + expr: |
| 158 | + binaryOp: 0.4 |
| 159 | + functionCall: 0.3 |
| 160 | + subquery: 0.2 |
| 161 | + literal: 0.1 |
| 162 | +``` |
| 163 | +
|
| 164 | +## CLI Interface |
| 165 | +
|
| 166 | +```bash |
| 167 | +# Generate queries for PostgreSQL |
| 168 | +./fuzzer generate --grammar postgresql --count 1000 --strategy weighted |
| 169 | + |
| 170 | +# Run continuous fuzzing with performance metrics |
| 171 | +./fuzzer fuzz --grammar cql --duration 1h --metrics |
| 172 | + |
| 173 | +# Validate existing corpus against parser |
| 174 | +./fuzzer validate --grammar postgresql --corpus corpus/postgresql/ |
| 175 | +``` |
| 176 | + |
| 177 | +## Performance Metrics |
| 178 | + |
| 179 | +### Generation Metrics |
| 180 | +- Queries generated per second |
| 181 | +- Grammar rule coverage percentage |
| 182 | +- Distribution of query complexity (depth, width) |
| 183 | + |
| 184 | +### Parser Testing Metrics |
| 185 | +- Parse success rate |
| 186 | +- Average parse time per query |
| 187 | +- Memory usage during parsing |
| 188 | +- Parser crash/error detection |
| 189 | + |
| 190 | +## Implementation Phases |
| 191 | + |
| 192 | +### Phase 1: Foundation (Week 1-2) |
| 193 | +- Basic grammar analyzer using existing ANTLR parser |
| 194 | +- Simple rule expander with depth-first strategy |
| 195 | +- Command-line interface for manual testing |
| 196 | + |
| 197 | +### Phase 2: Core Features (Week 3-4) |
| 198 | +- Multiple generation strategies |
| 199 | +- Configuration system |
| 200 | +- Basic corpus management |
| 201 | +- Integration with existing parser tests |
| 202 | + |
| 203 | +### Phase 3: Advanced Features (Week 5-6) |
| 204 | +- Weighted generation with probability tuning |
| 205 | +- Performance metrics collection |
| 206 | +- CI integration for continuous fuzzing |
| 207 | +- Corpus minimization and deduplication |
| 208 | + |
| 209 | +### Phase 4: Optimization (Week 7-8) |
| 210 | +- Generation performance optimization |
| 211 | +- Advanced semantic awareness |
| 212 | +- Custom mutation strategies |
| 213 | +- Comprehensive documentation |
| 214 | + |
| 215 | +## Future Enhancements |
| 216 | + |
| 217 | +- **Semantic Awareness**: Generate queries with valid schema references |
| 218 | +- **Mutation-Based Fuzzing**: Mutate existing queries to explore edge cases |
| 219 | +- **Differential Testing**: Compare parser outputs across database dialects |
| 220 | +- **Performance Regression Detection**: Track parser performance over time |
| 221 | +- **Grammar Evolution**: Adapt fuzzing as grammars evolve |
| 222 | + |
| 223 | +## Dependencies |
| 224 | + |
| 225 | +- Existing `tools/grammar/` ANTLR v4 parser |
| 226 | +- Go standard library (`rand`, `fmt`, `strings`) |
| 227 | +- YAML configuration parsing |
| 228 | +- CLI framework (e.g., `cobra`) |
| 229 | + |
| 230 | +This design provides a solid foundation for grammar-aware fuzzing while leveraging our existing ANTLR infrastructure. |
0 commit comments