Skip to content

Bash parser#6753

Open
kmccarp wants to merge 6 commits intomainfrom
bash-parser-improvements
Open

Bash parser#6753
kmccarp wants to merge 6 commits intomainfrom
bash-parser-improvements

Conversation

@kmccarp
Copy link
Copy Markdown
Contributor

@kmccarp kmccarp commented Feb 17, 2026

Summary

A new bash parser for OpenRewrite, built from scratch using ANTLR4. The parser produces a lossless syntax tree (LST) that preserves all whitespace, comments, and formatting — enabling round-trip parsing where printed output is byte-identical to the original source.

What's included

  • ANTLR4 grammar (BashLexer.g4, BashParser.g4) covering bash syntax: functions, loops, conditionals, case statements, arrays, arithmetic, pipelines, redirections, here-documents, process substitution, command substitution, variable expansion, quoting, and more
  • BashParserVisitor that converts ANTLR parse trees into OpenRewrite's LST model
  • BashPrinter for lossless printing back to source
  • BashVisitor / BashIsoVisitor for recipe authors to traverse and transform bash scripts
  • 270 unit tests across 19 test classes

Testing strategy

Unit tests (270 tests, 19 classes): Each test class covers a specific language construct (arithmetic, arrays, case statements, command substitution, conditionals, for loops, functions, if statements, pipelines, process substitution, quoting, redirections, subshells, variable expansion, while loops, etc.). Every test verifies lossless round-trip fidelity — the parsed-then-printed output must be byte-identical to the input.

Corpus validation (3,000+ scripts from 10 open source projects): The parser was validated against a diverse corpus of real-world bash scripts. Every script parses and round-trips successfully. The corpus repos:

Repository Scripts
torvalds/linux 1,079
NixOS/nixpkgs 1,054
kubernetes/kubernetes 307
void-linux/void-packages 296
acmesh-official/acme.sh 245
ohmyzsh/ohmyzsh 16
pi-hole/pi-hole 17
bats-core/bats-core 7
pyenv/pyenv 2
asdf-vm/asdf 1

Test plan

  • All 270 unit tests pass
  • Full corpus passes (3,000+ scripts, 100% success rate)

Expanded grammar and visitor to handle additional bash constructs
encountered in a 3145-script corpus (nixpkgs, kubernetes, pi-hole,
void-packages):

- Brace groups, select statements, process substitution with
  whitespace before closing paren
- Suppress synthetic ANTLR error-recovery token text (e.g.
  <missing 'fi'>) that was polluting round-trip output
- Arithmetic expressions with bitwise ops, special vars, triple-paren
  edge cases
- Deeply nested command/process substitutions, backtick nesting
- Here-strings, complex redirections, associative arrays
- Various quoting edge cases (dollar-single-quote, regex escapes)

Added 59 regression tests across 15 existing test classes covering
all newly supported constructs. 270 tests total, all passing.
@github-project-automation github-project-automation bot moved this to In Progress in OpenRewrite Feb 17, 2026
@kmccarp kmccarp changed the title Improve bash parser for real-world script compatibility New bash parser with lossless round-trip fidelity Feb 17, 2026
@kmccarp kmccarp changed the title New bash parser with lossless round-trip fidelity Bash parser Feb 17, 2026
Address review feedback on CommandList and Pipeline operator modeling:

- Replace List<Literal> operators in CommandList with typed enum
  (Operator.AND, Operator.OR) paired with Space via OperatorEntry
- Replace List<Literal> pipeOperators in Pipeline with typed enum
  (PipeOp.PIPE, PipeOp.PIPE_AND) paired with Space via PipeEntry
- Add Bash.Background wrapper type for & (postfix statement modifier),
  following the same pattern as Bash.Redirected
- Restructure grammar: move &&/|| into andOr rule, keep ;/& in listSep
- CommandList now exclusively represents &&/|| chains
- ; absorbed into whitespace prefix (not explicitly modeled)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants