|
| 1 | +# Pattern matching |
| 2 | + |
| 3 | +<!-- |
| 4 | +Part of the Carbon Language project, under the Apache License v2.0 with LLVM |
| 5 | +Exceptions. See /LICENSE for license information. |
| 6 | +SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
| 7 | +--> |
| 8 | + |
| 9 | +<!-- toc --> |
| 10 | + |
| 11 | +## Table of contents |
| 12 | + |
| 13 | +- [Overview](#overview) |
| 14 | +- [Pattern instructions](#pattern-instructions) |
| 15 | +- [Instruction ordering](#instruction-ordering) |
| 16 | +- [Parser-driven pattern block pushing](#parser-driven-pattern-block-pushing) |
| 17 | +- [Function parameters](#function-parameters) |
| 18 | + - [`Call` parameters and arguments](#call-parameters-and-arguments) |
| 19 | + - [Caller and callee matching](#caller-and-callee-matching) |
| 20 | + - [The return slot](#the-return-slot) |
| 21 | + |
| 22 | +<!-- tocstop --> |
| 23 | + |
| 24 | +## Overview |
| 25 | + |
| 26 | +This document focuses on the implementation of pattern matching. See |
| 27 | +[here](/docs/design/pattern_matching.md) for more on the design and fundamental |
| 28 | +concepts. |
| 29 | + |
| 30 | +The SemIR for a pattern-matching operation is emitted in three steps: |
| 31 | + |
| 32 | +1. **Pattern:** Traverse the parse tree of the pattern to emit SemIR that |
| 33 | + abstractly describes the pattern. |
| 34 | +2. **Scrutinee:** Traverse the parse tree of the scrutinee expression to emit |
| 35 | + SemIR that evaluates it. |
| 36 | +3. **Match:** Traverse the pattern SemIR from step 1 (sometimes in conjunction |
| 37 | + with the scrutinee SemIR) to emit SemIR that actually performs pattern |
| 38 | + matching. |
| 39 | + |
| 40 | +## Pattern instructions |
| 41 | + |
| 42 | +The SemIR emitted in the pattern step primarily consists of _pattern |
| 43 | +instructions_, which are instructions that describe the pattern itself. For |
| 44 | +example, given the pattern `(x: i32, y:i32)`, the pattern step might emit the |
| 45 | +following SemIR: |
| 46 | + |
| 47 | +``` |
| 48 | +%x.patt: %pattern_type.7ce = binding_pattern x [concrete] |
| 49 | +%y.patt: %pattern_type.7ce = binding_pattern y [concrete] |
| 50 | +%.loc4_21: %pattern_type.511 = tuple_pattern (%x.patt, %y.patt) [concrete] |
| 51 | +``` |
| 52 | + |
| 53 | +Pattern instructions do not represent executable code, and are generally ignored |
| 54 | +during lowering. Instead, they descriptively represent the pattern itself as a |
| 55 | +kind of constant value, and their primary consumer is the match step. The type |
| 56 | +of a pattern instruction is a _pattern type_, which is represented by a |
| 57 | +`PatternType` instruction. For example, the `constants` block might define the |
| 58 | +types in the above SemIR like so: |
| 59 | + |
| 60 | +``` |
| 61 | +%i32: type = class_type @Int, @Int(%int_32) [concrete] |
| 62 | +%pattern_type.7ce: type = pattern_type %i32 [concrete] |
| 63 | +%tuple.type: type = tuple_type (%i32, %i32) [concrete] |
| 64 | +%pattern_type.511: type = pattern_type %tuple.type [concrete] |
| 65 | +``` |
| 66 | + |
| 67 | +We can read this as saying that the type of `%x.patt` and `%y.patt` is "pattern |
| 68 | +that matches an `i32` scrutinee", and the type of `%.loc4_21` is "pattern that |
| 69 | +matches a `(i32, i32)` scrutinee". |
| 70 | + |
| 71 | +Pattern instructions are only emitted during the pattern step, but that step can |
| 72 | +emit non-pattern instructions as well. For example, in a pattern like |
| 73 | +`(x: i32, a + b)`, `i32` and `a + b` are ordinary expressions, and so their |
| 74 | +SemIR must be emitted during the initial traversal of the parse tree, as with |
| 75 | +any other expression. |
| 76 | + |
| 77 | +All the pattern instructions for a given full-pattern are grouped together in a |
| 78 | +distinct block that contains only pattern instructions. Consequently, |
| 79 | +`Check::Context` maintains `pattern_block_stack` as a separate `InstBlockStack` |
| 80 | +for pattern blocks, and provides separate methods like `AddPatternInst` for |
| 81 | +adding instructions to it. |
| 82 | + |
| 83 | +## Instruction ordering |
| 84 | + |
| 85 | +The SemIR produced in the first two steps is (like most SemIR) generally in |
| 86 | +post-order, reflecting the order of the parse tree. However, the match step |
| 87 | +traversal is performed pre-order, starting with the root instruction of the |
| 88 | +pattern and traversing into its dependencies. |
| 89 | + |
| 90 | +In some cases it is necessary for the pattern step to allocate instructions that |
| 91 | +won't actually be emitted until the match step, because they are responsible for |
| 92 | +performing pattern matching. When that happens, they are allocated but not added |
| 93 | +to a block, and their IDs are stored in the `Check::Context` so that they can be |
| 94 | +spliced into the current block at the appropriate point in the match step. |
| 95 | + |
| 96 | +Currently this happens in two cases, which are handled using two maps in |
| 97 | +`Check::Context` from pattern instruction IDs to the corresponding match |
| 98 | +instruction IDs: |
| 99 | + |
| 100 | +- A name binding can be used within the same pattern that declares it: |
| 101 | + ```carbon |
| 102 | + match (x) { |
| 103 | + case (n: i32, n) => ... |
| 104 | + ``` |
| 105 | + For this to work, the name `n` needs to be added to the scope as soon as we |
| 106 | + handle its declaration, and it needs to resolve to the `BindName` |
| 107 | + instruction that binds a value to that name. This means that the `BindName` |
| 108 | + instruction needs to be allocated during the pattern step, even though it is |
| 109 | + part of matching, not part of the pattern. `Context::bind_name_map` stores |
| 110 | + these `BindName`s, keyed by the corresponding `BindingPattern` instruction. |
| 111 | +- A `var` pattern allocates storage during matching, which is represented by a |
| 112 | + `VarStorage` instruction. This instruction must be allocated during the |
| 113 | + pattern step, so that it can be used as the output parameter of scrutinee |
| 114 | + expression evaluation during the scrutinee step. `Context::var_storage_map` |
| 115 | + stores these `VarStorage` instructions, keyed by the corresponding |
| 116 | + `VarPattern` instruction. |
| 117 | +
|
| 118 | +As noted earlier, the pattern step can also emit non-pattern instructions to |
| 119 | +evaluate expressions that are embedded in the pattern, such as the type |
| 120 | +expressions of binding patterns, and expressions that are used as patterns |
| 121 | +themselves (although those have not been implemented yet). The parse tree |
| 122 | +doesn't mark these situations in advance: any given subpattern might turn out to |
| 123 | +be one that emits non-pattern instructions. To handle these situations, we |
| 124 | +speculatively push an instruction block onto the (non-pattern) stack whenever we |
| 125 | +are about to begin handling a subpattern, and then pop it at the end of the |
| 126 | +subpattern, with different treatment depending on whether the subpattern turned |
| 127 | +out to be a subexpression. This is handled by `BeginSubpattern`, |
| 128 | +`EndSubpatternAsExpr`, and `EndSubpatternAsNonExpr`. |
| 129 | +
|
| 130 | +One further complication here is that the type expression can contain control |
| 131 | +flow (such as an `if` expression). Consequently, we can't represent the type |
| 132 | +expression SemIR as a single block; instead, we represent the SemIR for a given |
| 133 | +type expression as a |
| 134 | +[single-entry, single-exit (SE/SE) region](https://en.wikipedia.org/wiki/Single-entry_single-exit), |
| 135 | +potentially consisting of multiple blocks. |
| 136 | +
|
| 137 | +> **Note:** The original motivation for rigorously excluding non-pattern |
| 138 | +> instructions from the pattern block may no longer apply. In particular, it may |
| 139 | +> make sense to put non-pattern instructions in the pattern block when they |
| 140 | +> represent an expression that is part of the pattern. If so, substantial parts |
| 141 | +> of this design might change. See |
| 142 | +> [issue #5351](https://github.com/carbon-language/carbon-lang/issues/5351). |
| 143 | +
|
| 144 | +## Parser-driven pattern block pushing |
| 145 | +
|
| 146 | +At the same time as all of that, we have to manage the _pattern_ block stack as |
| 147 | +well. We attempt to do this precisely rather than speculatively, by leveraging |
| 148 | +the parser to precisely mark the nodes immediately before full-patterns, and |
| 149 | +pushing the pattern block stack when we handle those nodes. We then rely on |
| 150 | +signals from both the parser and the node stack to determine when to pop from |
| 151 | +the pattern block stack. |
| 152 | +
|
| 153 | +In the case of `let` and `var` decls, this is fairly straightforward: the |
| 154 | +beginning is marked by the `LetIntroducer` or `VarIntroducer` node, and the end |
| 155 | +is marked by the `LetInitializer` or `VarInitializer`, or by the `VarDecl` in |
| 156 | +the case of a `var` decl with no initializer. Similarly, the beginning of an |
| 157 | +`impl forall` parameter list is marked by the `Forall` node, and the end is |
| 158 | +marked by the `ImplDecl` or `ImplDefinitionStart`. |
| 159 | +
|
| 160 | +The case of a parameterized name (such as `Bar(y: i32)`) is more challenging. |
| 161 | +The node immediately before the start of the full-pattern is an identifier, but |
| 162 | +an identifier doesn't necessarily mark the start of a full-pattern. We've solved |
| 163 | +that by having the parser mark identifier nodes that are followed by |
| 164 | +full-patterns (using lookahead). Rather than use additional storage for what is |
| 165 | +logically a single bit of data, we effectively smuggle that bit into the kind |
| 166 | +enum by having separate node kinds `IdentifierNameBeforeParams` and |
| 167 | +`IdentifierNameNotBeforeParams`. |
| 168 | +
|
| 169 | +If the parameterized name is a name qualifier (such as the first part of |
| 170 | +`Foo(X:! i32).Bar(y: i32)`), the node immediately after it will be the qualifier |
| 171 | +node. As of this writing, we bifurcate qualifier nodes into |
| 172 | +`NameQualifierWithParams` and `NameQualifierWithoutParams`, much like we do with |
| 173 | +identifier names, but we don't actually use that information, and instead use |
| 174 | +the presence of parameters on the node stack to determine whether to pop the |
| 175 | +pattern block stack. |
| 176 | +
|
| 177 | +> **Open question:** should we re-combine the two qualifier node kinds? |
| 178 | +
|
| 179 | +If the parameterized name is not part of a name qualifier, the node immediately |
| 180 | +after it will be a `*Decl` or `*DefinitionStart` node of the appropriate kind |
| 181 | +(for example `FunctionDecl` or `FunctionDefinitionStart` if the introducer was |
| 182 | +`fn`). Note that this means the pattern block is still on the stack while |
| 183 | +handling the return type of a function. This is intentional, because we model |
| 184 | +the return type as declaring an output parameter (see below), which makes it |
| 185 | +functionally part of the parameter pattern. |
| 186 | +
|
| 187 | +## Function parameters |
| 188 | +
|
| 189 | +### `Call` parameters and arguments |
| 190 | +
|
| 191 | +SemIR models a function call as a `Call` instruction, which has an instruction |
| 192 | +block consisting of one instruction per argument. Correspondingly, the SemIR |
| 193 | +representation of a function has a block consisting of one instruction per |
| 194 | +parameter. We refer to these as _`Call` arguments_ and _`Call` parameters_, |
| 195 | +because they don't necessarily correspond to the colloquial meaning of |
| 196 | +"arguments" and "parameters" (which are sometimes referred to as _syntactic_ |
| 197 | +arguments and parameters). |
| 198 | +
|
| 199 | +For example, consider this function: |
| 200 | +
|
| 201 | +```carbon |
| 202 | +fn F(T:! type, U:! type) -> Core.String; |
| 203 | +``` |
| 204 | + |
| 205 | +The `Call` instruction is a runtime-phase operation, so it notionally runs after |
| 206 | +compile-time parameters have already been bound to values. As a result, a `Call` |
| 207 | +instruction calling `F` does not pass values for either `T` or `U`. On the other |
| 208 | +hand, it does pass a reference to the storage that `F` should construct the |
| 209 | +return value in. So although we would colloquially say that `F` takes two |
| 210 | +parameters of type `type`, it has a single `Call` parameter of type |
| 211 | +`Core.String`. |
| 212 | + |
| 213 | +If Carbon supports general patterns in function parameter lists, that introduces |
| 214 | +additional ways that `Call` parameters can diverge from the colloquial meaning. |
| 215 | +For example: |
| 216 | + |
| 217 | +```carbon |
| 218 | +fn G(x: i32, var (y: i32, z: i32)); |
| 219 | +fn H(x: i32, (y: i32, var z: i32)); |
| 220 | +``` |
| 221 | + |
| 222 | +A `var` pattern converts the scrutinee to a durable reference expression, and |
| 223 | +then performs further pattern matching on the object it refers to. As a result, |
| 224 | +`G` has two `Call` parameters: a value corresponding to `x`, and a reference to |
| 225 | +an object of type `(i32, i32)`, corresponding to both `y` and `z`. On the other |
| 226 | +hand, `H` has 3 `Call` parameters: values corresponding to `x` and `y`, and a |
| 227 | +reference corresponding to `z`. |
| 228 | + |
| 229 | +### Caller and callee matching |
| 230 | + |
| 231 | +The `Call` parameters define the API boundary between the caller and callee at |
| 232 | +the SemIR level. As a result, responsibility for matching the arguments against |
| 233 | +the parameter list is split between the caller and the callee. Continuing the |
| 234 | +example from above, given the call `G(0, (x, y))`, the caller is responsible for |
| 235 | +converting `0` to `i32`, and for initializing a new `(i32, i32)` object from |
| 236 | +`(x, y)`, but the callee is responsible for binding the name `x` to its first |
| 237 | +`Call` parameter, and for destructuring its second `Call` parameter and binding |
| 238 | +the names `y` and `z` to its elements. |
| 239 | + |
| 240 | +In SemIR we represent this situation with special `ParamPattern` instructions, |
| 241 | +which mark the boundary: there is exactly one `ParamPattern` instruction for |
| 242 | +each `Call` parameter, which matches the entire corresponding `Call` argument. |
| 243 | +The subpatterns of the `ParamPattern`s are matched on the callee side, and |
| 244 | +everything above them is matched on the caller side. There are multiple kinds of |
| 245 | +`ParamPattern` instruction, which correspond to different ways of passing a |
| 246 | +parameter (such as by reference or by value). |
| 247 | + |
| 248 | +When performing callee-side pattern matching, we do not have an actual scrutinee |
| 249 | +expression. Instead, for each `ParamPattern` instruction we generate a |
| 250 | +corresponding `Param` instruction, which reads from the corresponding entry in |
| 251 | +the `Call` argument list, and we use that as the scrutinee of the |
| 252 | +`ParamPattern`. Every `ParamPattern` kind has a corresponding `Param` kind. |
| 253 | + |
| 254 | +### The return slot |
| 255 | + |
| 256 | +If a function has a declared return type, the function takes an additional |
| 257 | +`Call` parameter, which points to the storage that should be initialized with |
| 258 | +the return value. This `Call` parameter is represented as an `OutParamPattern` |
| 259 | +instruction with a `ReturnSlotPattern` instruction as a subpattern. The |
| 260 | +`ReturnSlotPattern` also represents the return type declaration itself, such as |
| 261 | +in `FunctionFields`. The SemIR that matches these patterns consists of a |
| 262 | +`ReturnSlot` instruction, which binds the special name `NameId::ReturnSlot` to |
| 263 | +the `OutParam` instruction representing the storage passed by the caller. |
| 264 | + |
| 265 | +This structure is analogous to the handling of an ordinary by-value parameter, |
| 266 | +which is represented in the `Call` parameters as a `ValueParamPattern` |
| 267 | +instruction with a `BindingPattern`, and in the pattern-matching SemIR as a |
| 268 | +`BindName` instruction that binds the parameter name to the `ValueParam` |
| 269 | +instruction representing the argument passed by the caller. |
| 270 | + |
| 271 | +Note that if the return type does not have an in-place value representation |
| 272 | +(meaning that the return value should not be passed in memory), these |
| 273 | +instructions will all still be generated, but the SemIR for `return` statements |
| 274 | +will not access the `ReturnSlot`, and the `Call` argument list will not contain |
| 275 | +an argument corresponding to the `OutParamPattern` (and so it will be one |
| 276 | +element shorter than the `Call` parameter list). However, the |
| 277 | +`ReturnSlotPattern` is still used, in its other role as a representation of the |
| 278 | +return type declaration. This leads to a potentially confusing situation, where |
| 279 | +the term "return slot" sometimes refers to the `ReturnSlotPattern` (for example |
| 280 | +in `FunctionFields::return_slot_pattern`), which is present for any function |
| 281 | +with a declared return type, and sometimes refers to the actual storage provided |
| 282 | +by the caller (for example in `ReturnTypeInfo::has_return_slot`), which is |
| 283 | +present only if the return type has an in-place value representation. |
| 284 | + |
| 285 | +> **TODO:** When the return type isn't in-place, the `OutParamPattern` should |
| 286 | +> probably not be in the `Call` parameter list (for consistency with the `Call` |
| 287 | +> argument list), and possibly the `OutParamPattern`, `OutParam`, and |
| 288 | +> `ReturnSlot` instructions should not be emitted in the first place. |
| 289 | +> Furthermore, we should find a way to resolve the inconsistent "return slot" |
| 290 | +> terminology. |
0 commit comments