Skip to content

Commit 48e7589

Browse files
geoffromerjonmeow
andauthored
Document pattern-matching implementation (#5846)
Co-authored-by: Jon Ross-Perkins <[email protected]>
1 parent cae8aa3 commit 48e7589

File tree

4 files changed

+299
-6
lines changed

4 files changed

+299
-6
lines changed

docs/design/pattern_matching.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,11 @@ _binding patterns_. When a pattern is executed by giving it a value called the
5252
_scrutinee_, it determines whether the scrutinee matches the pattern, and if so,
5353
determines the values of the bindings.
5454

55+
A _full pattern_ is a complete input to a pattern matching operation, that is a
56+
pattern that is not a subpattern of another pattern. If it's preceded by a
57+
deduced parameter list or followed by a return type expression, those are part
58+
of the full pattern as well.
59+
5560
## Pattern Syntax and Semantics
5661

5762
Expressions are patterns, as described below. A pattern that is not an

docs/design/variadics.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -480,13 +480,10 @@ expansion itself is relatively straightforward:
480480
481481
### Typechecking patterns
482482

483-
A _full pattern_ consists of an optional deduced parameter list, a pattern, and
484-
an optional return type expression.
485-
486483
A pack expansion pattern has _fixed arity_ if it contains at least one usage of
487-
an each-name that is not a parameter of the enclosing full pattern. Otherwise it
488-
has _deduced arity_. A tuple pattern can have at most one segment with deduced
489-
arity. For example:
484+
an each-name that is not a parameter of the enclosing
485+
[full pattern](pattern_matching.md). Otherwise it has _deduced arity_. A tuple
486+
pattern can have at most one segment with deduced arity. For example:
490487

491488
```carbon
492489
class C(... each T:! type) {

toolchain/docs/check/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ production of SemIR. It also does any validation that requires context.
5353
Some particular topics have their own documentation:
5454

5555
- [Associated constants](associated_constant.md)
56+
- [Pattern matching](pattern_matching.md)
5657

5758
## Postorder processing
5859

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Pattern matching
2+
3+
<!--
4+
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
5+
Exceptions. See /LICENSE for license information.
6+
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
7+
-->
8+
9+
<!-- toc -->
10+
11+
## Table of contents
12+
13+
- [Overview](#overview)
14+
- [Pattern instructions](#pattern-instructions)
15+
- [Instruction ordering](#instruction-ordering)
16+
- [Parser-driven pattern block pushing](#parser-driven-pattern-block-pushing)
17+
- [Function parameters](#function-parameters)
18+
- [`Call` parameters and arguments](#call-parameters-and-arguments)
19+
- [Caller and callee matching](#caller-and-callee-matching)
20+
- [The return slot](#the-return-slot)
21+
22+
<!-- tocstop -->
23+
24+
## Overview
25+
26+
This document focuses on the implementation of pattern matching. See
27+
[here](/docs/design/pattern_matching.md) for more on the design and fundamental
28+
concepts.
29+
30+
The SemIR for a pattern-matching operation is emitted in three steps:
31+
32+
1. **Pattern:** Traverse the parse tree of the pattern to emit SemIR that
33+
abstractly describes the pattern.
34+
2. **Scrutinee:** Traverse the parse tree of the scrutinee expression to emit
35+
SemIR that evaluates it.
36+
3. **Match:** Traverse the pattern SemIR from step 1 (sometimes in conjunction
37+
with the scrutinee SemIR) to emit SemIR that actually performs pattern
38+
matching.
39+
40+
## Pattern instructions
41+
42+
The SemIR emitted in the pattern step primarily consists of _pattern
43+
instructions_, which are instructions that describe the pattern itself. For
44+
example, given the pattern `(x: i32, y:i32)`, the pattern step might emit the
45+
following SemIR:
46+
47+
```
48+
%x.patt: %pattern_type.7ce = binding_pattern x [concrete]
49+
%y.patt: %pattern_type.7ce = binding_pattern y [concrete]
50+
%.loc4_21: %pattern_type.511 = tuple_pattern (%x.patt, %y.patt) [concrete]
51+
```
52+
53+
Pattern instructions do not represent executable code, and are generally ignored
54+
during lowering. Instead, they descriptively represent the pattern itself as a
55+
kind of constant value, and their primary consumer is the match step. The type
56+
of a pattern instruction is a _pattern type_, which is represented by a
57+
`PatternType` instruction. For example, the `constants` block might define the
58+
types in the above SemIR like so:
59+
60+
```
61+
%i32: type = class_type @Int, @Int(%int_32) [concrete]
62+
%pattern_type.7ce: type = pattern_type %i32 [concrete]
63+
%tuple.type: type = tuple_type (%i32, %i32) [concrete]
64+
%pattern_type.511: type = pattern_type %tuple.type [concrete]
65+
```
66+
67+
We can read this as saying that the type of `%x.patt` and `%y.patt` is "pattern
68+
that matches an `i32` scrutinee", and the type of `%.loc4_21` is "pattern that
69+
matches a `(i32, i32)` scrutinee".
70+
71+
Pattern instructions are only emitted during the pattern step, but that step can
72+
emit non-pattern instructions as well. For example, in a pattern like
73+
`(x: i32, a + b)`, `i32` and `a + b` are ordinary expressions, and so their
74+
SemIR must be emitted during the initial traversal of the parse tree, as with
75+
any other expression.
76+
77+
All the pattern instructions for a given full-pattern are grouped together in a
78+
distinct block that contains only pattern instructions. Consequently,
79+
`Check::Context` maintains `pattern_block_stack` as a separate `InstBlockStack`
80+
for pattern blocks, and provides separate methods like `AddPatternInst` for
81+
adding instructions to it.
82+
83+
## Instruction ordering
84+
85+
The SemIR produced in the first two steps is (like most SemIR) generally in
86+
post-order, reflecting the order of the parse tree. However, the match step
87+
traversal is performed pre-order, starting with the root instruction of the
88+
pattern and traversing into its dependencies.
89+
90+
In some cases it is necessary for the pattern step to allocate instructions that
91+
won't actually be emitted until the match step, because they are responsible for
92+
performing pattern matching. When that happens, they are allocated but not added
93+
to a block, and their IDs are stored in the `Check::Context` so that they can be
94+
spliced into the current block at the appropriate point in the match step.
95+
96+
Currently this happens in two cases, which are handled using two maps in
97+
`Check::Context` from pattern instruction IDs to the corresponding match
98+
instruction IDs:
99+
100+
- A name binding can be used within the same pattern that declares it:
101+
```carbon
102+
match (x) {
103+
case (n: i32, n) => ...
104+
```
105+
For this to work, the name `n` needs to be added to the scope as soon as we
106+
handle its declaration, and it needs to resolve to the `BindName`
107+
instruction that binds a value to that name. This means that the `BindName`
108+
instruction needs to be allocated during the pattern step, even though it is
109+
part of matching, not part of the pattern. `Context::bind_name_map` stores
110+
these `BindName`s, keyed by the corresponding `BindingPattern` instruction.
111+
- A `var` pattern allocates storage during matching, which is represented by a
112+
`VarStorage` instruction. This instruction must be allocated during the
113+
pattern step, so that it can be used as the output parameter of scrutinee
114+
expression evaluation during the scrutinee step. `Context::var_storage_map`
115+
stores these `VarStorage` instructions, keyed by the corresponding
116+
`VarPattern` instruction.
117+
118+
As noted earlier, the pattern step can also emit non-pattern instructions to
119+
evaluate expressions that are embedded in the pattern, such as the type
120+
expressions of binding patterns, and expressions that are used as patterns
121+
themselves (although those have not been implemented yet). The parse tree
122+
doesn't mark these situations in advance: any given subpattern might turn out to
123+
be one that emits non-pattern instructions. To handle these situations, we
124+
speculatively push an instruction block onto the (non-pattern) stack whenever we
125+
are about to begin handling a subpattern, and then pop it at the end of the
126+
subpattern, with different treatment depending on whether the subpattern turned
127+
out to be a subexpression. This is handled by `BeginSubpattern`,
128+
`EndSubpatternAsExpr`, and `EndSubpatternAsNonExpr`.
129+
130+
One further complication here is that the type expression can contain control
131+
flow (such as an `if` expression). Consequently, we can't represent the type
132+
expression SemIR as a single block; instead, we represent the SemIR for a given
133+
type expression as a
134+
[single-entry, single-exit (SE/SE) region](https://en.wikipedia.org/wiki/Single-entry_single-exit),
135+
potentially consisting of multiple blocks.
136+
137+
> **Note:** The original motivation for rigorously excluding non-pattern
138+
> instructions from the pattern block may no longer apply. In particular, it may
139+
> make sense to put non-pattern instructions in the pattern block when they
140+
> represent an expression that is part of the pattern. If so, substantial parts
141+
> of this design might change. See
142+
> [issue #5351](https://github.com/carbon-language/carbon-lang/issues/5351).
143+
144+
## Parser-driven pattern block pushing
145+
146+
At the same time as all of that, we have to manage the _pattern_ block stack as
147+
well. We attempt to do this precisely rather than speculatively, by leveraging
148+
the parser to precisely mark the nodes immediately before full-patterns, and
149+
pushing the pattern block stack when we handle those nodes. We then rely on
150+
signals from both the parser and the node stack to determine when to pop from
151+
the pattern block stack.
152+
153+
In the case of `let` and `var` decls, this is fairly straightforward: the
154+
beginning is marked by the `LetIntroducer` or `VarIntroducer` node, and the end
155+
is marked by the `LetInitializer` or `VarInitializer`, or by the `VarDecl` in
156+
the case of a `var` decl with no initializer. Similarly, the beginning of an
157+
`impl forall` parameter list is marked by the `Forall` node, and the end is
158+
marked by the `ImplDecl` or `ImplDefinitionStart`.
159+
160+
The case of a parameterized name (such as `Bar(y: i32)`) is more challenging.
161+
The node immediately before the start of the full-pattern is an identifier, but
162+
an identifier doesn't necessarily mark the start of a full-pattern. We've solved
163+
that by having the parser mark identifier nodes that are followed by
164+
full-patterns (using lookahead). Rather than use additional storage for what is
165+
logically a single bit of data, we effectively smuggle that bit into the kind
166+
enum by having separate node kinds `IdentifierNameBeforeParams` and
167+
`IdentifierNameNotBeforeParams`.
168+
169+
If the parameterized name is a name qualifier (such as the first part of
170+
`Foo(X:! i32).Bar(y: i32)`), the node immediately after it will be the qualifier
171+
node. As of this writing, we bifurcate qualifier nodes into
172+
`NameQualifierWithParams` and `NameQualifierWithoutParams`, much like we do with
173+
identifier names, but we don't actually use that information, and instead use
174+
the presence of parameters on the node stack to determine whether to pop the
175+
pattern block stack.
176+
177+
> **Open question:** should we re-combine the two qualifier node kinds?
178+
179+
If the parameterized name is not part of a name qualifier, the node immediately
180+
after it will be a `*Decl` or `*DefinitionStart` node of the appropriate kind
181+
(for example `FunctionDecl` or `FunctionDefinitionStart` if the introducer was
182+
`fn`). Note that this means the pattern block is still on the stack while
183+
handling the return type of a function. This is intentional, because we model
184+
the return type as declaring an output parameter (see below), which makes it
185+
functionally part of the parameter pattern.
186+
187+
## Function parameters
188+
189+
### `Call` parameters and arguments
190+
191+
SemIR models a function call as a `Call` instruction, which has an instruction
192+
block consisting of one instruction per argument. Correspondingly, the SemIR
193+
representation of a function has a block consisting of one instruction per
194+
parameter. We refer to these as _`Call` arguments_ and _`Call` parameters_,
195+
because they don't necessarily correspond to the colloquial meaning of
196+
"arguments" and "parameters" (which are sometimes referred to as _syntactic_
197+
arguments and parameters).
198+
199+
For example, consider this function:
200+
201+
```carbon
202+
fn F(T:! type, U:! type) -> Core.String;
203+
```
204+
205+
The `Call` instruction is a runtime-phase operation, so it notionally runs after
206+
compile-time parameters have already been bound to values. As a result, a `Call`
207+
instruction calling `F` does not pass values for either `T` or `U`. On the other
208+
hand, it does pass a reference to the storage that `F` should construct the
209+
return value in. So although we would colloquially say that `F` takes two
210+
parameters of type `type`, it has a single `Call` parameter of type
211+
`Core.String`.
212+
213+
If Carbon supports general patterns in function parameter lists, that introduces
214+
additional ways that `Call` parameters can diverge from the colloquial meaning.
215+
For example:
216+
217+
```carbon
218+
fn G(x: i32, var (y: i32, z: i32));
219+
fn H(x: i32, (y: i32, var z: i32));
220+
```
221+
222+
A `var` pattern converts the scrutinee to a durable reference expression, and
223+
then performs further pattern matching on the object it refers to. As a result,
224+
`G` has two `Call` parameters: a value corresponding to `x`, and a reference to
225+
an object of type `(i32, i32)`, corresponding to both `y` and `z`. On the other
226+
hand, `H` has 3 `Call` parameters: values corresponding to `x` and `y`, and a
227+
reference corresponding to `z`.
228+
229+
### Caller and callee matching
230+
231+
The `Call` parameters define the API boundary between the caller and callee at
232+
the SemIR level. As a result, responsibility for matching the arguments against
233+
the parameter list is split between the caller and the callee. Continuing the
234+
example from above, given the call `G(0, (x, y))`, the caller is responsible for
235+
converting `0` to `i32`, and for initializing a new `(i32, i32)` object from
236+
`(x, y)`, but the callee is responsible for binding the name `x` to its first
237+
`Call` parameter, and for destructuring its second `Call` parameter and binding
238+
the names `y` and `z` to its elements.
239+
240+
In SemIR we represent this situation with special `ParamPattern` instructions,
241+
which mark the boundary: there is exactly one `ParamPattern` instruction for
242+
each `Call` parameter, which matches the entire corresponding `Call` argument.
243+
The subpatterns of the `ParamPattern`s are matched on the callee side, and
244+
everything above them is matched on the caller side. There are multiple kinds of
245+
`ParamPattern` instruction, which correspond to different ways of passing a
246+
parameter (such as by reference or by value).
247+
248+
When performing callee-side pattern matching, we do not have an actual scrutinee
249+
expression. Instead, for each `ParamPattern` instruction we generate a
250+
corresponding `Param` instruction, which reads from the corresponding entry in
251+
the `Call` argument list, and we use that as the scrutinee of the
252+
`ParamPattern`. Every `ParamPattern` kind has a corresponding `Param` kind.
253+
254+
### The return slot
255+
256+
If a function has a declared return type, the function takes an additional
257+
`Call` parameter, which points to the storage that should be initialized with
258+
the return value. This `Call` parameter is represented as an `OutParamPattern`
259+
instruction with a `ReturnSlotPattern` instruction as a subpattern. The
260+
`ReturnSlotPattern` also represents the return type declaration itself, such as
261+
in `FunctionFields`. The SemIR that matches these patterns consists of a
262+
`ReturnSlot` instruction, which binds the special name `NameId::ReturnSlot` to
263+
the `OutParam` instruction representing the storage passed by the caller.
264+
265+
This structure is analogous to the handling of an ordinary by-value parameter,
266+
which is represented in the `Call` parameters as a `ValueParamPattern`
267+
instruction with a `BindingPattern`, and in the pattern-matching SemIR as a
268+
`BindName` instruction that binds the parameter name to the `ValueParam`
269+
instruction representing the argument passed by the caller.
270+
271+
Note that if the return type does not have an in-place value representation
272+
(meaning that the return value should not be passed in memory), these
273+
instructions will all still be generated, but the SemIR for `return` statements
274+
will not access the `ReturnSlot`, and the `Call` argument list will not contain
275+
an argument corresponding to the `OutParamPattern` (and so it will be one
276+
element shorter than the `Call` parameter list). However, the
277+
`ReturnSlotPattern` is still used, in its other role as a representation of the
278+
return type declaration. This leads to a potentially confusing situation, where
279+
the term "return slot" sometimes refers to the `ReturnSlotPattern` (for example
280+
in `FunctionFields::return_slot_pattern`), which is present for any function
281+
with a declared return type, and sometimes refers to the actual storage provided
282+
by the caller (for example in `ReturnTypeInfo::has_return_slot`), which is
283+
present only if the return type has an in-place value representation.
284+
285+
> **TODO:** When the return type isn't in-place, the `OutParamPattern` should
286+
> probably not be in the `Call` parameter list (for consistency with the `Call`
287+
> argument list), and possibly the `OutParamPattern`, `OutParam`, and
288+
> `ReturnSlot` instructions should not be emitted in the first place.
289+
> Furthermore, we should find a way to resolve the inconsistent "return slot"
290+
> terminology.

0 commit comments

Comments
 (0)