Skip to content

Parsing

Kalyan Sriram edited this page Mar 13, 2023 · 1 revision

Parsing and the AST

The second stage of the pipeline, parsing, is responsible for taking a token stream generated by the lexer and building an abstract syntax tree. This is a tree data structure where each node represents a construct in the language - such as a function, declaration, assignment, or addition. Like most languages, femto has a very recursive structure - function blocks contain statements, which contain expressions, which contain sub expressions, etc. The parser constructs this using a technique called recursive descent parsing. This is a fancy name for recognizing groups of tokens that make up a known language construct, creating a tree node, and recursively trying to match against sub-nodes based on context.

Tree Structure

The data structures for working with the AST are defined in Ast.zig. This is documented in the code, so it is only briefly described here. The tree node structure looks like this:

const Node = struct {
    main_token: u32,
    data: Data,

    pub const Data = union(enum) {
        binary_expr: struct {
            left: u32,
            right: u32,
        },
        block: struct {
            stmts_start: u32,
            stmts_end: u32,
        },
        // ...
    }:
};

Each node contains an index to up to one token, called the main_token. This usually represents the starting token for a node (like .k_fn or .k_let), but varies. Since the structure of a language construct is fixed, other tokens can be found by indexing around this main_token. For instance, in a constant declaration let foo = 123;, the identifier foo can be found by indexing main_token = <index of .k_let> + 1.

Each node also contains a tagged union with up to two u32 indices. Remember that since we cap the file size at UINT32_MAX, we also have an upper bound on the number of AST elements (we will discuss how the elements are stored in a minute). Each of these indices points to one of two things - a child node, or an index in the extra data array.

Tree Storage

The Ast looks something like this (simplified):

const Ast = struct {
    source: [:0]const u8,
    tokens: []Token,
    nodes: []Node,
    extra: []u32,
    errors: []Error
};

Instead of allocating each node separately and storing pointers, each node is inserted into a single ArrayList. This allows us to reduce the number of allocations and halve the size of the pointer (from usize == u64 on 64 bit systems) to u32 indices. Additionally, we get better cache locality since all the nodes will be next to each other in a contiguous block of memory.

On top of this, the tree uses zig's MultiArrayList data structure. This is a compile-time struct of arrays construct, which splits the Node struct's elements up into separate arrays, something like this:

const MultiArrayList = struct {
    main_tokens: []u32,
    data: []Data,
};

This further improves cache locality (since we sometimes work with the main_token and data separately) in the code. While the node consists of only u32 values, SOA types also reduce memory waste from struct member padding.

Parsing

Parser Overview

The parser lives in parse.zig. The Parser data structure contains mutable, dynamically resizing ArrayList versions of each of the elements listed above in the Ast. It also owns the list of tokens (tags and start indices only). It turns out that we don't need to store the end of the token, since many tokens don't get converted to substrings (i.e. keywords) or have trivially known sizes (operators have fixed length). When we do need the token's end, we can re-lex it on demand.

The parser also contains a scratch list, which is used for storing temporary u32 values. Any values that a function adds to the scratch list is removed before the function returns, usually using defer scratch.shrinkRetainingCapacity(). Hence, we can nest functions that use the same global scratch list without worrying about garbage being inserted.

The parser is designed to generate an AST for a single femto source file. Multiple parser instances will be instantiated to parse the entire source, and the resulting trees will be merged later in the process before HIR generation. The toplevel node in the AST is the module.

Module Parsing

The module parser tries to parse the top level source file by repeatedly looking for valid global statements, such as declarations, etc. This further calls functions like parseDecl() which themselves call other functions. This is what "recursive descent parsing" means.

Binary Expressions

The only exception to the recursive descent architecture is during parsing of binary expressions. Since humans (and hence, femto source) uses infix notation for arithmetic with order of operations (a + b * c - d), we cannot simply recursively parse nested binary expressions. Instead, here we use another simple algorithm for operator precedence parsing, where each operator gets a certain amount of "priority" which is used to determine the recursive order of the nodes in the tree. The LLVM Kaleidoscope documentation explains the logic that is used; refer to this page.

Level of Analysis

AST parsing is responsible for verifying all syntax. That is, source code that is successfully parsed should not contain an syntax errors. However, it may still contain other semantic errors - such as referencing invalid variables/identifiers, type mismatches, etc. Currently, the parser is a WIP and frequently crashes or terminates if invalid syntax is encountered; however, in the future, it will be improved to detect common errors and notify the user without panicking. This would allow more robust parsing for i.e. an LSP server and give the user a full list of syntax errors atomically.

Clone this wiki locally