Skip to content

Commit ba0039f

Browse files
Add postlexer to support multiline binary operators and ternary expressions
1 parent a80e8e2 commit ba0039f

File tree

12 files changed

+370
-18
lines changed

12 files changed

+370
-18
lines changed

CLAUDE.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,20 @@
33
## Pipeline
44

55
```
6-
Forward: HCL2 Text → Lark Parse Tree → LarkElement Tree → Python Dict/JSON
6+
Forward: HCL2 Text → [PostLexer] → Lark Parse Tree → LarkElement Tree → Python Dict/JSON
77
Reverse: Python Dict/JSON → LarkElement Tree → Lark Tree → HCL2 Text
8+
Direct: HCL2 Text → [PostLexer] → Lark Parse Tree → LarkElement Tree → Lark Tree → HCL2 Text
89
```
910

11+
The **Direct** pipeline (`parse_to_tree``transform``to_lark``reconstruct`) skips serialization to dict, so all IR nodes (including `NewLineOrCommentRule` nodes for whitespace/comments) directly influence the reconstructed output. Any information discarded before the IR is lost in this pipeline.
12+
1013
## Module Map
1114

1215
| Module | Role |
1316
|---|---|
1417
| `hcl2/hcl2.lark` | Lark grammar definition |
1518
| `hcl2/api.py` | Public API (`load/loads/dump/dumps` + intermediate stages) |
19+
| `hcl2/postlexer.py` | Token stream transforms between lexer and parser |
1620
| `hcl2/parser.py` | Lark parser factory with caching |
1721
| `hcl2/transformer.py` | Lark parse tree → LarkElement tree |
1822
| `hcl2/deserializer.py` | Python dict → LarkElement tree |
@@ -73,6 +77,16 @@ jsontohcl2 --indent 4 --no-align file.json
7377

7478
Add new options as `parser.add_argument()` calls in the relevant entry point module.
7579

80+
## PostLexer (`postlexer.py`)
81+
82+
Lark's `postlex` parameter accepts a single object with a `process(stream)` method that transforms the token stream between the lexer and LALR parser. The `PostLexer` class is designed for extensibility: each transformation is a private method that accepts and yields tokens, and `process()` chains them together.
83+
84+
Current passes:
85+
86+
- `_merge_newlines_into_operators`
87+
88+
To add a new pass: create a private method with the same `(self, stream) -> generator` signature, and add a `yield from` call in `process()`.
89+
7690
## Hard Rules
7791

7892
These are project-specific constraints that must not be violated:
@@ -88,6 +102,7 @@ These are project-specific constraints that must not be violated:
88102
## Adding a New Language Construct
89103

90104
1. Add grammar rules to `hcl2.lark`
105+
1. If the new construct creates LALR ambiguities with `NL_OR_COMMENT`, add a postlexer pass in `postlexer.py`
91106
1. Create rule class(es) in the appropriate `rules/` file
92107
1. Add transformer method(s) in `transformer.py`
93108
1. Implement `serialize()` in the rule class
@@ -103,9 +118,9 @@ python -m unittest discover -s test -p "test_*.py" -v
103118

104119
**Unit tests** (`test/unit/`): instantiate rule objects directly (no parsing).
105120

106-
- `test/unit/rules/` — one file per rules module
107-
- `test/unit/cli/` — one file per CLI module
108-
- `test/unit/test_api.py`, `test_builder.py`, `test_deserializer.py`, `test_formatter.py`, `test_reconstructor.py`, `test_utils.py`
121+
- `rules/` — one file per rules module
122+
- `cli/` — one file per CLI module
123+
- `test_*.py` — tests for corresponding files from `hcl2/` directory
109124

110125
Use concrete stubs when testing ABCs (e.g., `StubExpression(ExpressionRule)`).
111126

hcl2/parser.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
from lark import Lark
66

7+
from hcl2.postlexer import PostLexer
8+
79

810
PARSER_FILE = Path(__file__).absolute().resolve().parent / ".lark_cache.bin"
911

@@ -17,4 +19,5 @@ def parser() -> Lark:
1719
cache=str(PARSER_FILE), # Disable/Delete file to effect changes to the grammar
1820
rel_to=__file__,
1921
propagate_positions=True,
22+
postlex=PostLexer(),
2023
)

hcl2/postlexer.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
"""Postlexer that transforms the token stream between the Lark lexer and parser.
2+
3+
Each transformation is implemented as a private method that accepts and yields
4+
tokens. The public ``process`` method chains them together, making it easy to
5+
add new passes without touching existing logic.
6+
"""
7+
8+
from collections.abc import Iterator
9+
from typing import FrozenSet, Optional, Tuple
10+
11+
from lark import Token
12+
13+
# Type alias for a token stream consumed and produced by each pass.
14+
TokenStream = Iterator[Token]
15+
16+
# Operator token types that may legally follow a line-continuation newline.
17+
# MINUS is excluded — it is also the unary negation operator, and merging a
18+
# newline into MINUS would incorrectly consume statement-separating newlines
19+
# before negative literals (e.g. "a = 1\nb = -2").
20+
OPERATOR_TYPES: FrozenSet[str] = frozenset(
21+
{
22+
"DOUBLE_EQ",
23+
"NEQ",
24+
"LT",
25+
"GT",
26+
"LEQ",
27+
"GEQ",
28+
"ASTERISK",
29+
"SLASH",
30+
"PERCENT",
31+
"DOUBLE_AMP",
32+
"DOUBLE_PIPE",
33+
"PLUS",
34+
"QMARK",
35+
}
36+
)
37+
38+
39+
class PostLexer:
40+
"""Transform the token stream before it reaches the LALR parser."""
41+
42+
def process(self, stream: TokenStream) -> TokenStream:
43+
"""Chain all postlexer passes over the token stream."""
44+
yield from self._merge_newlines_into_operators(stream)
45+
46+
def _merge_newlines_into_operators(self, stream: TokenStream) -> TokenStream:
47+
"""Merge NL_OR_COMMENT tokens into immediately following operator tokens.
48+
49+
LALR parsers cannot distinguish a statement-ending newline from a
50+
line-continuation newline before a binary operator. This pass resolves
51+
the ambiguity by merging NL_OR_COMMENT into the operator token's value
52+
when the next token is a binary operator or QMARK. The transformer
53+
later extracts the newline prefix and creates a NewLineOrCommentRule
54+
node, preserving round-trip fidelity.
55+
"""
56+
pending_nl: Optional[Token] = None
57+
for token in stream:
58+
if token.type == "NL_OR_COMMENT":
59+
if pending_nl is not None:
60+
yield pending_nl
61+
pending_nl = token
62+
else:
63+
if pending_nl is not None:
64+
if token.type in OPERATOR_TYPES:
65+
token = token.update(value=str(pending_nl) + str(token))
66+
else:
67+
yield pending_nl
68+
pending_nl = None
69+
yield token
70+
if pending_nl is not None:
71+
yield pending_nl
72+
73+
@property
74+
def always_accept(self) -> Tuple[()]:
75+
"""Terminal names the parser must accept even when not expected by LALR.
76+
77+
Lark requires this property on postlexer objects. An empty tuple
78+
means no extra terminals are injected.
79+
"""
80+
return ()

hcl2/reconstructor.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,11 @@ def _reconstruct_tree(
183183
# Check spacing BEFORE processing children, while _last_rule_name
184184
# still reflects the previous sibling (not a child of this tree).
185185
needs_space = self._should_add_space_before(tree, parent_rule_name)
186+
if needs_space:
187+
# A space will be inserted before this tree's output, so tell
188+
# children that the last character was a space to prevent the
189+
# first child from adding a duplicate leading space.
190+
self._last_was_space = True
186191

187192
if rule_name == UnaryOpRule.lark_name():
188193
for i, child in enumerate(tree.children):

hcl2/rules/expressions.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ class ConditionalRule(ExpressionRule):
122122

123123
_children_layout: Tuple[
124124
ExpressionRule,
125+
Optional[NewLineOrCommentRule],
125126
QMARK,
126127
Optional[NewLineOrCommentRule],
127128
ExpressionRule,
@@ -137,7 +138,7 @@ def lark_name() -> str:
137138
return "conditional"
138139

139140
def __init__(self, children, meta: Optional[Meta] = None):
140-
self._insert_optionals(children, [2, 4, 6])
141+
self._insert_optionals(children, [1, 3, 5, 7])
141142
super().__init__(children, meta)
142143

143144
@property
@@ -148,12 +149,12 @@ def condition(self) -> ExpressionRule:
148149
@property
149150
def if_true(self) -> ExpressionRule:
150151
"""Return the true-branch expression."""
151-
return self._children[3]
152+
return self._children[4]
152153

153154
@property
154155
def if_false(self) -> ExpressionRule:
155156
"""Return the false-branch expression."""
156-
return self._children[7]
157+
return self._children[8]
157158

158159
def serialize(
159160
self, options=SerializationOptions(), context=SerializationContext()
@@ -179,6 +180,7 @@ class BinaryTermRule(ExpressionRule):
179180
"""Rule for the operator+operand portion of a binary operation."""
180181

181182
_children_layout: Tuple[
183+
Optional[NewLineOrCommentRule],
182184
BinaryOperatorRule,
183185
Optional[NewLineOrCommentRule],
184186
ExprTermRule,
@@ -190,18 +192,18 @@ def lark_name() -> str:
190192
return "binary_term"
191193

192194
def __init__(self, children, meta: Optional[Meta] = None):
193-
self._insert_optionals(children, [1])
195+
self._insert_optionals(children, [0, 2])
194196
super().__init__(children, meta)
195197

196198
@property
197199
def binary_operator(self) -> BinaryOperatorRule:
198200
"""Return the binary operator."""
199-
return self._children[0]
201+
return self._children[1]
200202

201203
@property
202204
def expr_term(self) -> ExprTermRule:
203205
"""Return the right-hand operand."""
204-
return self._children[2]
206+
return self._children[3]
205207

206208
def serialize(
207209
self, options=SerializationOptions(), context=SerializationContext()

hcl2/transformer.py

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,15 @@ def __init__(self, discard_new_line_or_comments: bool = False):
8181

8282
def __default_token__(self, token: Token) -> StringToken:
8383
# TODO make this return StaticStringToken where applicable
84-
if token.value in StaticStringToken.classes_by_value:
85-
return StaticStringToken.classes_by_value[token.value]()
86-
return StringToken[token.type](token.value) # type: ignore[misc]
84+
value = token.value
85+
# The EQ terminal /[ \t]*=(?!=|>)/ captures leading whitespace.
86+
# Strip it so parsed and deserialized tokens produce the same "=" value,
87+
# preventing the reconstructor from emitting double spaces.
88+
if token.type == "EQ":
89+
value = value.lstrip(" \t")
90+
if value in StaticStringToken.classes_by_value:
91+
return StaticStringToken.classes_by_value[value]()
92+
return StringToken[token.type](value) # type: ignore[misc]
8793

8894
# pylint: disable=C0103
8995
def FLOAT_LITERAL(self, token: Token) -> FloatLiteral:
@@ -164,8 +170,32 @@ def heredoc_template_trim(self, meta: Meta, args) -> HeredocTrimTemplateRule:
164170
def expr_term(self, meta: Meta, args) -> ExprTermRule:
165171
return ExprTermRule(args, meta)
166172

173+
def _extract_nl_prefix(self, token):
174+
"""Strip leading newlines from a token value.
175+
176+
If the token contains a newline prefix (from the postlexer merging a
177+
line-continuation newline into the operator token), strip it and
178+
return a NewLineOrCommentRule. Otherwise return None.
179+
"""
180+
value = str(token.value)
181+
stripped = value.lstrip("\n \t")
182+
if len(stripped) == len(value):
183+
return None
184+
nl_text = value[: len(value) - len(stripped)]
185+
token.set_value(stripped)
186+
if self.discard_new_line_or_comments:
187+
return None
188+
return NewLineOrCommentRule.from_string(nl_text)
189+
167190
@v_args(meta=True)
168191
def conditional(self, meta: Meta, args) -> ConditionalRule:
192+
# args: [condition, QMARK, NL?, if_true, NL?, COLON, NL?, if_false]
193+
# QMARK is at index 1 — check for NL prefix from the postlexer
194+
qmark_token = args[1]
195+
nl_rule = self._extract_nl_prefix(qmark_token)
196+
if nl_rule is not None:
197+
args = list(args)
198+
args.insert(1, nl_rule)
169199
return ConditionalRule(args, meta)
170200

171201
@v_args(meta=True)
@@ -174,6 +204,12 @@ def binary_operator(self, meta: Meta, args) -> BinaryOperatorRule:
174204

175205
@v_args(meta=True)
176206
def binary_term(self, meta: Meta, args) -> BinaryTermRule:
207+
# args: [BinaryOperatorRule, NL?, ExprTermRule]
208+
# The operator's token may contain a NL prefix from the postlexer
209+
op_rule = args[0]
210+
nl_rule = self._extract_nl_prefix(op_rule.token)
211+
if nl_rule is not None:
212+
args = [nl_rule] + list(args)
177213
return BinaryTermRule(args, meta)
178214

179215
@v_args(meta=True)

test/integration/hcl2_original/smoke.tf

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,28 @@ block label1 label2 {
4343
}
4444
}
4545

46+
block multiline_ternary {
47+
foo = (
48+
bar
49+
? baz(foo)
50+
: foo == "bar"
51+
? "baz"
52+
: foo
53+
)
54+
}
55+
56+
block multiline_binary_ops {
57+
expr = {
58+
for k, v in local.map_a : k => v
59+
if lookup(local.map_b[v.id
60+
], "enabled", false)
61+
|| (
62+
contains(local.map_c, v.id)
63+
&& contains(local.map_d, v.id)
64+
)
65+
}
66+
}
67+
4668
block {
4769
route53_forwarding_rule_shares = {
4870
for forwarding_rule_key in keys(var.route53_resolver_forwarding_rule_shares) :

test/integration/hcl2_reconstructed/smoke.tf

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,20 @@ block label1 label2 {
3939
}
4040

4141

42+
block multiline_ternary {
43+
foo = (bar ? baz(foo) : foo == "bar" ? "baz" : foo)
44+
}
45+
46+
47+
block multiline_binary_ops {
48+
expr = {
49+
for k, v in local.map_a :
50+
k => v
51+
if lookup(local.map_b[v.id], "enabled", false) || (contains(local.map_c, v.id) && contains(local.map_d, v.id))
52+
}
53+
}
54+
55+
4256
block {
4357
route53_forwarding_rule_shares = {
4458
for forwarding_rule_key in keys(var.route53_resolver_forwarding_rule_shares) :

test/integration/json_reserialized/smoke.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,18 @@
4848
}
4949
}
5050
},
51+
{
52+
"multiline_ternary": {
53+
"foo": "${(bar ? baz(foo) : foo == \"bar\" ? \"baz\" : foo)}",
54+
"__is_block__": true
55+
}
56+
},
57+
{
58+
"multiline_binary_ops": {
59+
"expr": "${{for k, v in local.map_a : k => v if lookup(local.map_b[v.id], \"enabled\", false) || (contains(local.map_c, v.id) && contains(local.map_d, v.id))}}",
60+
"__is_block__": true
61+
}
62+
},
5163
{
5264
"route53_forwarding_rule_shares": "${{for forwarding_rule_key in keys(var.route53_resolver_forwarding_rule_shares) : \"${forwarding_rule_key}\" => {aws_account_ids = [for account_name in var.route53_resolver_forwarding_rule_shares[forwarding_rule_key].aws_account_names : module.remote_state_subaccounts.map[account_name].outputs[\"aws_account_id\"]]}... if substr(bucket_name, 0, 1) == \"l\"}}",
5365
"__is_block__": true

test/integration/json_serialized/smoke.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,18 @@
4848
}
4949
}
5050
},
51+
{
52+
"multiline_ternary": {
53+
"foo": "${(bar ? baz(foo) : foo == \"bar\" ? \"baz\" : foo)}",
54+
"__is_block__": true
55+
}
56+
},
57+
{
58+
"multiline_binary_ops": {
59+
"expr": "${{for k, v in local.map_a : k => v if lookup(local.map_b[v.id], \"enabled\", false) || (contains(local.map_c, v.id) && contains(local.map_d, v.id))}}",
60+
"__is_block__": true
61+
}
62+
},
5163
{
5264
"route53_forwarding_rule_shares": "${{for forwarding_rule_key in keys(var.route53_resolver_forwarding_rule_shares) : \"${forwarding_rule_key}\" => {aws_account_ids = [for account_name in var.route53_resolver_forwarding_rule_shares[forwarding_rule_key].aws_account_names : module.remote_state_subaccounts.map[account_name].outputs[\"aws_account_id\"]]}... if substr(bucket_name, 0, 1) == \"l\"}}",
5365
"__is_block__": true

0 commit comments

Comments
 (0)