Here you can find definitions for words that are commonly used in the compiler along with links to the codebase. Check https://www.roc-lang.org/tutorial if you want to know about general Roc terms. Feel free to ask for a term to be added or add one yourself!
Contributor note: definitions should be roughly ordered like in a tutorial, e.g. Parser should be explained before Canonicalization.
Command Line Interface. The entrypoint of the compiler that brings together all
functionality in the Roc toolset and makes it accessible to the user through the
terminal, e.g. roc build main.roc.
- new compiler: src/cli.main.zig
- old compiler: crates/cli/src/main.rs
A .roc file forms one module.
Types of modules:
- app (example): Applications are combined with a platform and compiled into an executable.
- module (example): Provide types and functions which can be imported into other modules.
- package (example): Organises modules to share functionality across applications and platforms.
- platform (example): Provides memory management and effects like writing to files, network communication,... to interface with the outside world. Detailed explanation.
- hosted (example): Lists all Roc types and functions provided by the platform.
Implementation:
- new compiler:
- old compiler:
(Intermediate Representation)
An abstract code format that sits between the high-level source code and the low-level machine code. It is generated after the source code is parsed and before target code is produced. IR makes it easier for the compiler to analyze and optimize programs.
Example for:
module []
foo : U64Token IR:
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
Newline(1:1-1:1),
LowerIdent(3:1-3:4),OpColon(3:5-3:6),UpperIdent(3:7-3:10),Newline(1:1-1:1)
AST IR:
(file
(module (1:1-1:10))
(type_anno (3:1-4:4)
"foo"
(tag (3:7-3:10) "U64")))
A memory optimization technique where only one copy of each distinct value is stored in memory, regardless of how many times it appears in a program or IR. For example, a function named foo may be called many times in a Roc file, but we store foo once and use an index to refer to foo at the call sites.
Uses of interning:
- new compiler: TODO
- old compiler: small_string_interner.rs, mono_module.rs, format.rs, ...
- There are many more uses of interning, I recommend searching for "interner" (case-insensitive).
Any text in a Roc source file that has significant content, but is not a Roc Str like "Hello". Used for variable names, record field names, type names, etc. .
During tokenization all identifiers are put into a deduplicated collection and given an ID. That ID is used in IRs instead of the actual text to save memory.
Identifier in the compiler:
- new compiler:
- Ident
- Ident tokenization: check the functions
chompIdentLowerandchompIdentGeneral, and their uses. - Ident parsing: search
Ident
- old compiler:
- IdentStr
- module/ident.rs
- parsing: search "identifier" (case-insensitive)
A specific word that has a predefined meaning in the language, like crash, if, when, ... .
Many keywords can not be used as a variable name.
We have an overview of all Roc keywords.
Keywords in the compiler:
An operator is a symbol or keyword that performs a specific operation on one or more operands (values or variables) to produce a result.
Some examples: +, =, ==, >. A table of all operators in Roc.
+ is an example of binary operator because it works with two operands, e.g. 1 + 1. Similarly ! (e.g. !Bool.false) is a unary operator.
Operators in the compiler:
- New compiler: search
Opin tokenize.zig - Old compiler: search
operator_helpin expr.rs
The set of rules that define the correct structure and format of statements, expressions, and code blocks. It specifies how code should be written so that it can be interpreted and executed correctly. In other words, syntax determines how symbols, keywords, and punctuation must be arranged to form valid source code.
Syntax in the compiler:
- New compiler: determined by the tokenizer and parser.
- Old compiler: determined by the parser.
Syntax within a programming language that is designed to make things easier to read or express. It allows developers to write code in a more concise, readable, or convenient way without adding new functionality to the language itself.
Desugaring converts syntax sugar (like x + 1) into more fundamental operations (like Num.add(x, 1)).
A table of all operators in Roc and what they desugar to
Desugaring in the compiler:
- New compiler: there is no desugaring in the new compiler, these are implicitly handled by the compiler but not modified into a different form or IR representation.
- Old compiler: desugar.rs
Specifies the type of a variable. For example, the type signature of Str.concat is:
concat : Str, Str -> StrHere it specifies concat takes two strings as input and produces one as output.
In the compiler, the type signature specified in the source code has priority over the type found by type inference, although both need to match for your code to compile completely.
Type annotations are basically the same thing as type signatures and both terms are used interchangeably throughout the compiler.
Type signature in the code base:
- New compiler: Parser.zig (search signature)
- Old compiler: ast.rs (search TypeAnnotation)
A way to give a new name to an existing type to make code more readable or meaningful. It doesn't create a new type, just an alternative name for an existing one. For example:
Person : { first_name : Str, last_name : Str }
# Using Person:
register : Person => Result {} [RegistrationFailed]
register = |person|
...Note: the term "alias" in the code base does not always refer to a type alias, it can also refer to an import alias using as or alias analysis etc. .
A placeholder for a type. It allows generic programming that can make your code work with a collection of types. For example:
reverse : List a -> List areverse can be done on any kind of list, regardless of the type of the element, so we indicate this with the type variable a.
Type variables can also be required to have certain abilities, for example:
Graph a := Dict a (List a) where a implements EqType variables don't have to be a single letter, they just have to start with a lowercase letter.
Parsing of type vars:
- new compiler: search
ty_varin Parser.zig - old compiler: search
parse_type_variablein type_annotation.rs
A function or type that is natively provided by Roc, for example Dict.empty, List.map, Result, ... .
You don't need to import any of these to use them.
Implementation of builtins:
- new compiler: src/builtins (work in progress)
- old compiler: crates/compiler/builtins. Note: some builtin functions are implemented in zig, like
Num.f64_to_bits, see num.zig.
Interesting fact: our builtins are integrated into the compiler, there is no typical separate standard library.
A compiler phase is a distinct stage in the process the compiler goes through to translate high-level source code into machine code that a computer can execute. Compilers don’t just do this in one big step, they break it down into several phases, each handling a specific task. Some examples of phases: tokenization, parsing, code generation,... .
A single traversal through a program's IR that performs a specific transformation or analysis. insert_reset_reuse_operations is an example of a compiler pass, it analyzes and modifies the IR to detect when memory is no longer used and allows us to re-use that memory for other things.
Note the difference with a compiler phase; a phase is a larger logical unit of compilation that may include multiple passes.
The process of breaking down source code into smaller units called tokens. These tokens are the basic building blocks of a programming language, such as keywords, identifiers, operators, and symbols. The input code is scanned character by character and is grouped into meaningful sequences based on the language's syntax rules. This step makes parsing simpler.
Example source code:
module []
foo : U64Corresponding tokens:
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
Newline(1:1-1:1),
LowerIdent(3:1-3:4),OpColon(3:5-3:6),UpperIdent(3:7-3:10),Newline(1:1-1:1)
New compiler:
Old compiler:
- We did not do a separate tokenization step, everything happened in the parser.
(Abstract Syntax Tree)
An AST organizes and represents the source code as a tree-like structure. So for the code below:
module []
foo : U64The AST is:
(file
(module (1:1-1:10))
(type_anno (3:1-4:4)
"foo"
(tag (3:7-3:10) "U64")))
It captures the meaning of the code, while ignoring purely syntactic details like parentheses, commas, semicolons,... . Compared to raw source code, this structured format is much easier to analyze and manipulate programmatically by the next compiler phase.
The AST is created by the parser.
New compiler:
- See the
Nodestruct in this file. - You can see examples of ASTs in the .txt files in this folder.
Old compiler:
- See
FullAsthere - Some tests
- Many snapshot tests
The step where the compiler checks if the source code follows the correct structure or “grammar” of the programming language. It takes the tokens produced by tokenization and organizes them to see if they make sense together, like checking the structure of sentences in a language. If the code is correct, the parser builds a tree-like structure (AST) that shows how the code is organized. If not, it reports errors.
Parser implementation:
- new compiler: src/parse
- old compiler: crates/compiler/parse (tokenization is not a separate step here)
A symbol points to a specific identifier with an ident_id and a module_id (see module).
Symbol implementation:
- new compiler: Not yet implemented.
- old compiler: symbol.rs
A function that remembers and can access variables from its outer (enclosing) scope, even after that scope has finished executing. This means the function "closes over" its environment, retaining access to the variables it was created with.
Example of a closure:
# A function that returns a closure
makeCounter : I64 -> (I64 -> I64)
makeCounter = |start|
# This inner function is a closure - it "closes over" the start variable
|increment|
start + increment
# Test the closure with expect
expect
counter = makeCounter(10)
result = counter(5)
result == 15Closure implementation:
- new compiler: Not yet implemented
- old compiler:
ClosureDatain expr.rs. Closures are used all over the place, just search "Closure" (match case).
The process of automatically determining the types of expressions without explicit type annotations from the programmer. The compiler analyzes how values are used in code to deduce their types. For example:
foo = |bar|
Num.to_str(bar)foo has no type annotation so we don't immediately know the type of bar. Later in the function, Num.to_str is called on bar and we know the type of to_str is Num * -> Str, so that means bar must be a Num *! We have now inferred the type of bar 🎉
Type inference implementation:
- new compiler: Not yet implemented
- old compiler: Type inference is spread over multiple crates: solve, late-solve,unify, constrain, ...
A requirement or restriction that limits what types can be used in a particular context. Type constraints express relationships between types that must be satisfied for a program to be well-typed. Type constraints can take several forms:
- Equality constraint: requires types to be the same:
- In the expression
a + b, bothaandbneed to have the same type.
- In the expression
- Try target constraint: requires the expression you use
?on to be aResult:Str.from_utf8(byte_list)?
- Effect suffix constraint: Your function should have a
!suffix if it calls an effectful function. - See
pub enum Constraintin crates/compiler/can/src/constraint.rs for an overview of all constraints.
Type constraint implementation:
- new compiler: Not yet implemented
- old compiler:
- Definition: can/src/constraint.rs
- Type solving using constraints: crates/compiler/solve/src/solve.rs
TODO
TODO
A type system where type equivalence is based on the shape (the set of members and their types) of the values, not the names by which the types were declared. Example:
Person : { first_name : Str, last_name : Str }
User : { first_name : Str, last_name : Str }
register_person! : Person => Result {} [InvalidName]
register_person! = |person|
...
user : User
user = { first_name: "Bob", last_name: "Foo"}
# This works even though `register_person!` expects a `Person`, the types have the same structure so they are compatible!
register_person!(user)A type system where type equivalence is based on explicit names or declarations, not just structure. Two types are considered the same, only if they have the same name or originate from the same type declaration.
Roc is inspired by Elm, in Elm, variables are defined using let:
double arg =
let
two = 2
in
arg * twoThis makes the scope of two very clear.
In Roc, you can just define variables and use them without let ... in, we add let behind the scenes.
See Stmt::Let in crates/compiler/mono/src/ir.rs (old compiler).
Say we have the following code:
my_record =
id = |x| x
{ a: id(1), b: id("foo") }If we would infer the type of id to be Int -> Int, that would lead to a type error at the next call site id("foo").
So, we generalize the type of id to a -> a. This allows id to be called with any type.
In general, the rank tracks the number of let-bindings a variable is "under". Top-level definitions have rank 1. A let inside a top-level definition gets rank 2, and so on.
An example:
foo = 3
plus_five = |arg|
x = 5
arg + xHere the rank of foo is 1 because it is at the top level and the rank of x is 2 because it is under or inside plus_five.
Imported variables get rank 2.
Rank 0 is special, it is used for variables that are generalized.
Keeping track of ranks makes type inference faster. You can see how ranks are used here (old compiler).
An identifier is rigid if it has a fixed, concrete name that's written in Roc code. On the other hand, a flexible identifier is one created by the compiler during type inference.
a is rigid below, I defined it. The compiler should use a in the error message if there is something wrong with the function's type.
take_first : List a, U64 -> List a
take_first = |list, output_length|
List.sublist(list, { start: 0, len: output_length })If I make a type error in the definition:
take_first : U64
take_first = |list, output_length|
List.sublist(list, { start: 0, len: output_length })The compiler's error message will say:
The body is an anonymous function of type:
List elem, Int Unsigned64 -> List elem
But the type annotation on take_first says it should be:
U64
elem is a flexible type variable, the compiler chose that name.
Related definitions in the compiler:
- old compiler: search "pub enum Content" in types/src/subs.rs
- new compiler: search "pub const Content" in canonicalize/CIR.zig
Represents types without indirection, it's the concrete form that types take after resolving variables and aliases.
definitions in the compiler:
- old compiler: search "pub enum FlatType" in types/src/subs.rs
- new compiler: search "pub const FlatType" in types/types.zig
(can)
After parsing a Roc program, the obtained IR is transformed into a canonical form called CanIR.
Canonicalization performs analysis to catch user errors, and sets up the state necessary to figure out the types in a program. Among other things, canonicalization;
- Uniquely identifies names (think variable and function names). Along the way, canonicalization builds a graph of all variables' references, and catches unused definitions, undefined definitions, and shadowed definitions.
- Resolves type signatures, including aliases, into a form suitable for type solving.
- Determines the order definitions are used in, if they are defined out-of-order.
- Eliminates syntax sugar (for example, turning
+into the function calladd).
Canonicalization occurs on a single module (file) in isolation, so the work can be easily parallelized and cached. If the source code for a module has not changed, the CanIR can simply be loaded from disk and used immediately.
Implementation of Canonicalization:
- new compiler: canonicalize/mod.zig, canonicalize folder
- old compiler: can folder
TODO
(mono, specialization)
Monomorphization, also known as type specialization, is the process of creating a distinct copy
of each instance of a generic function or value based on all specific usages in a program.
For example; a function with the type Num a -> Num a may only be called in the program with a
U64 and a I64. Specialization will then create two functions with the types U64 -> U64 and
I64 -> I64.
This trades off some compile time for a much better runtime performance, since we don't need to
look up which implementation to call at runtime (AKA dynamic dispatch).
Related Files:
- old compiler:
TODO
(refcount)
A memory management technique where each thing in memory has an associated counter that tracks how many references are pointing to that thing.
How it works:
- Every time a new reference to something is created, their reference counter is incremented.
- Every time a reference is deleted or goes out of scope, the counter is decremented.
- When the reference count reaches zero, it means no references are pointing to the thing, so the memory occupied by it can be safely freed.
Roc uses automatic reference counting because it avoids the significant pauses that can happen with traditional garbage collection. These pauses can be annoying in games for example, because it can result in a noticeable drop in framerate.
Another approach, manual memory management, would allow you to produce the fastest program but it is also more tedious to write code that way.
Reference counting implementation:
- Old compiler: Mono folder (search ref)
TODO
TODO
TODO
(code generation)
The phase where the compiler translates intermediate representation (IR) of a program into target code, usually assembly code. This assembly code is not yet ready for execution, it needs to be linked first.
For the old compiler we have three code gen backends, see #backend for more explanation.
Code Gen implementation:
- New compiler: Not done yet
- Old compiler:
- crates/compiler/gen_llvm generates LLVM IR
- crates/compiler/gen_dev generates assembly fast for quick build times
- crates/compiler/gen_wasm generates webassembly (binary) for typical wasm usecases.
- code gen tests
TODO
TODO
TODO
Wrapping a value or function in a generic, opaque representation (box) that can easily be passed to the platform. A boxed value is allocated on the heap. You can box something in Roc with the builtin Box.box and unbox it with Box.unbox.
Example handling of boxes in basic-cli.
See also std::boxed::Box in Rust.
TODO
TODO
TODO
TODO
TODO
Left & Right Hand Side: for example in 1 + 2, 1 is on the left hand side and 2 is on the right hand side.
TODO
TODO
(join)
Let's start from this example:
make_cat_or_dog_sound : Bool -> Str
make_cat_or_dog_sound = |is_dog|
sound =
if is_dog then
"Woof"
else
"Miauw"
Str.concat sound "!"The function from the example above checks the boolean parameter;
If is_dog is true, the ”Woof” sound branch is taken, otherwise the ”Miauw” branch. Afterwards the branches join
back up and use string concatenation to add an exclamation mark. A naive way to normalise
this function would be to add an assignment and Str.concat ... to both branches. Like below:
make_cat_or_dog_sound = |is_dog|
if is_dog then
sound = "Woof"
Str.concat sound "!"
else
sound = "Miauw"
Str.concat sound "!"However, if we were to use this approach with multiple nested expressions it could result in a very large code size.
We could put the if-then-else in a function, and call this function with the is_dog parameter to
obtain its sound. But this again comes with downsides:
- Function calls require additional stack and register manipulation, making them more expensive than simple branching.
- Moving code into functions can make optimization more difficult.
Joinpoints to the rescue:
make_cat_or_dog_sound = |is_dog|
joinpoint sound_jp = |sound|
Str.concat sound "!"
if is_dog then
jump sound_jp "Woof"
else
jump sound_jp "Miauw"sound_jp is a joinpoint; a labeled code location inside a function that can be "jumped to" from multiple places in the control flow.
Joinpoints provide the best of both worlds:
- We avoid the code duplication from the naive approach.
- No need to modify the stack or registers.
- They simplify control flow by bringing branches back together.
- They are optimization friendly.
This explanation was adapted from the thesis of J. Teeuwissen: reference counting with reuse in Roc.
Joinpoints in the compiler:
- Old compiler:
- Search for
Joininenum Stmtin crates/compiler/mono/src/ir.rs cd crates && rg JoinPointcd crates && rg "joinpoint" --glob '!**/generated' --glob '*.rs'cd crates && rg "join_point" --glob '*.rs'
- Search for
- New compiler: not yet!
TODO
TODO
(ADT)
A custom type that can combine these simpler types:
- Sum Types: a value can be one of several options, for example, with Result it can be either:
Ok(something)Err(some_error)
- Product Types: multiple pieces of data together, for example, A
Personrecord might contain both anameand anage.
An example of a combination:
NonEmptyList : [Head { x : U64, y : U64 }, Tail (List { x : U64, y : U64 })]A NonEmptyList value has to be of the type Head or Tail (Sum) and { x : U64, y : U64 } combines two U64 (Product).
Note: a sum or product type by itself is also an Algebraic Data Type.