From f9f633e6cc53ccce55b9917dab9eb2f76c298a59 Mon Sep 17 00:00:00 2001 From: tajmone Date: Sat, 13 Mar 2021 15:50:12 +0100 Subject: [PATCH 1/2] EditorConfig Add `.editorconfig` and `.gitattributes` settings to enforce code styles consistency across different editors and IDEs. --- .editorconfig | 39 +++++++++++++++++++++++++++++++++++++++ .gitattributes | 6 ++++++ 2 files changed, 45 insertions(+) create mode 100644 .editorconfig create mode 100644 .gitattributes diff --git a/.editorconfig b/.editorconfig new file mode 100644 index 0000000..dbb1c68 --- /dev/null +++ b/.editorconfig @@ -0,0 +1,39 @@ +# https://editorconfig.org + +root = true + + +## Repository Configurations +############################ +[.{git*,editorconfig,*.yml}] +end_of_line = lf +indent_style = space +indent_size = unset +charset = utf-8 +trim_trailing_whitespace = true +insert_final_newline = true + +[.gitmodules] +indent_style = tab + + +## Markdown GFM +############### +[*.md] +indent_style = space +indent_size = unset +end_of_line = unset +charset = utf-8 +trim_trailing_whitespace = true +insert_final_newline = true + + +## C Source Files +################# +[*.{c,h}] +indent_style = space +indent_size = 4 +end_of_line = unset +charset = utf-8 +trim_trailing_whitespace = true +insert_final_newline = true diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..415a282 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,6 @@ + +## Repository Configuration +########################### +.editorconfig text eol=lf +.gitattributes text eol=lf +.gitignore text eol=lf From 53518b79987adf80fd8b7cf8bb67c9f5ccd6df10 Mon Sep 17 00:00:00 2001 From: tajmone Date: Sat, 13 Mar 2021 17:22:48 +0100 Subject: [PATCH 2/2] Polish English Tutorial Polish the English text of every tutorial chapter, and Improve markdown source docs: * Add Level 1 title to docs ("Preface", "1. Skeleton", etc.). * Replaces smiles with markdown emojis. * Fix titles casing using Chicago Manual of Style capitalization rules: https://capitalizemytitle.com/style/chicago# * Convert inline-link to reference-style links (DRY!). * Add a few links to external references. --- tutorial/en/0-Preface.md | 166 +++++++++-------- tutorial/en/1-Skeleton.md | 129 +++++++------ tutorial/en/2-Virtual-Machine.md | 294 ++++++++++++++++-------------- tutorial/en/3-Lexer.md | 236 +++++++++++++----------- tutorial/en/4-Top-down-Parsing.md | 119 +++++++----- tutorial/en/5-Variables.md | 80 ++++---- tutorial/en/6-Functions.md | 79 ++++---- tutorial/en/7-Statements.md | 54 +++--- tutorial/en/8-Expressions.md | 201 +++++++++++--------- 9 files changed, 764 insertions(+), 594 deletions(-) diff --git a/tutorial/en/0-Preface.md b/tutorial/en/0-Preface.md index cdd10ed..e07b121 100644 --- a/tutorial/en/0-Preface.md +++ b/tutorial/en/0-Preface.md @@ -1,104 +1,126 @@ -This series of articles is a tutorial for building a C compiler from scratch. +# Preface -I lied a little in the above sentence: it is actually an _interpreter_ instead -of _compiler_. I lied because what the hell is a "C interpreter"? You will -however, understand compilers better by building an interpreter. +This is multi-part tutorial on how to build a C compiler from scratch. -Yeah, I wish you can get a basic understanding of how a compiler is -constructed, and realize it is not that hard to build one. Good Luck! +Well, I lied a little in the previous sentence: it's actually an _interpreter_, +not a _compiler_. I had to lie, because what on earth is a "C interpreter"? +You will however gain a better understanding of compilers by building an +interpreter. -Finally, this series is written in Chinese in the first place, feel free to -correct me if you are confused by my English. And I would like it very much if -you could teach me some "native" English :) +Yeah, I want to provide you with a basic understanding of how a compiler is +constructed, and realize that it's not that hard to build one, after all. +Good Luck! -We won't write any code in this chapter, feel free to skip it if you are -desperate to see some code... +This tutorial was originally written in Chinese, so feel free to correct me if +you're confused by my English. Also, I would really appreciate it if you could +teach me some "native" English. :smile: -## Why you should care about compiler theory? +We won't be writing any code in this chapter; so if you're eager to see some code, feel free to skip it. -Because it is **COOL**! -And it is very useful. Programs are built to do something for us, when they -are used to translate some forms of data into another form, we can call them -a compiler. Thus by learning some compiler theory we are trying to master a very -powerful technique of solving problems. Isn't that cool enough to you? +## Why Should I Care about Compiler Theory? + +Because it's **COOL**! + +And it's also very useful. Programs are designed to do something for us; when +they are used to translate some form of data into another form, we can call +them compilers. Thus, by learning some compiler theory, we are trying to +master a very powerful problem solving technique. Doesn't this sound cool +enough to you? + +People used to say that understanding how a compiler works would help you to +write better code. Some would argue that modern compilers are so good at +optimizing that you shouldn't care any more. Well, that's true, most people +don't need to learn compiler theory to improve code performance — and by "most +people" I mean _you_! -People used to say understanding how a compiler works would help you to write -better code. Some would argue that modern compilers are so good at -optimization that you should not care any more. Well, that's true, most people -don't need to learn compiler theory only to improve the efficency of the code. -And by most people, I mean you! ## We Don't Like Theory Either -I have always been in awe of compiler theory because that's what makes -programing easy. Anyway can you imaging building a web browser in only -assembly language? So when I got a chance to learn compiler theory in college, -I was so excited! And then... I quit, not understanding what that it. +I've always been in awe of compiler theory because that's what makes programing +easy. Anyway, can you imagine building a web browser entirely in assembly +language? So when I got a chance to learn compiler theory in college, I was so +excited! And then ... I quit! And left without understanding what it's all +about. -Normally a course of compiler will cover: +Normally compiler course covers the following topics: -1. How to represent syntax (such as BNF, etc.) -2. Lexer, with somewhat NFA(Nondeterministic Finite Automata), - DFA(Deterministic Finite Automata). -3. Parser, such as recursive descent, LL(k), LALR, etc. +1. How to represent syntaxes (i.e. BNF, etc.) +2. Lexers, using NFA (Nondeterministic Finite Automata) and + DFA (Deterministic Finite Automata). +3. Parsers, such as recursive descent, LL(k), LALR, etc. 4. Intermediate Languages. 5. Code generation. 6. Code optimization. -Perhaps more than 90% students will not care anything beyond the parser, and -what's more, we still don't know how to build a compiler! Even after all the -effort learning the theories. Well the main reason is that what "Compiler -Thoery" trys to teach is "How to build a parser generator", namely a tool that -consumes syntax gramer and generates a compiler for you. lex/yacc or -flex/bison or things like that. +Perhaps more than 90% of the students won't really care about any of that, +except for the parser, and what's more, we'd still won't know how to actually +build a compiler! even after all the effort of learning the theory. Well, the +main reason is that what "Compiler Theory" tries to teach is "how to build a +parser generator" — i.e. a tool that consumes a syntax grammar and generates a +compiler for you, like lex/yacc or flex/bison, or similar tools. + +These theories try to teach us how to solve the general challenges of +generating compilers automatically. Once you've mastered them, you're able to +deal with all kinds of grammars. They are indeed useful in the industry. +Nevertheless, they are too powerful and too complicated for students and most +programmers. If you try to read lex/yacc's source code you'll understand what +I mean. -These theories try to teach us how to solve the general problems of generating -compilers automatically. That means once you've mastered them, you are able to -deal with all kinds of grammars. They are indeed useful in industry. -Nevertheless they are too powerful and too complicated for students and most -programmers. You will understand that if you try to read lex/yacc's source -code. +The good news is that building a compiler can be much simpler than you ever +imagined. I won't lie, it's not easy, but definitely not hard. -Good news is building a compiler can be much simpler than you ever imagined. -I won't lie, not easy, but definitely not hard. -## Birth of this project +## How This Project Began -One day I came across the project [c4](https://github.com/rswier/c4) on -Github. It is a small C interpreter which is claimed to be implemented by only -4 functions. The most amazing part is that it is bootstrapping (that interpret -itself). Also it is done with about 500 lines! +One day I came across the project [c4] on Github, a small C interpreter +claiming to be implemented with only 4 functions. The most amazing part is +that it's [bootstrapping] (i.e. it can interpret itself). Furthermore, it's +being done in around 500 lines of code! -Meanwhile I've read a lot of tutorials about compiler, they are either too -simple(such as implementing a simple calculator) or using automation -tools(such as flex/bison). c4 is however implemented all from scratch. The -sad thing is that it try to be minimal, that makes the code quite a mess, hard -to understand. So I started a new project to: +Meanwhile, I've read many tutorials on compilers design, and found them to be +either too simple (such as implementing a simple calculator) or using +automation tools (such as flex/bison). [C4], however, is implemented entirely +from scratch. The sad thing is that it aims to be "an exercise in minimalism," +which makes the code quite messy and hard to understand. So I started a new +project, in order to: -1. Implement a working C compiler(interpreter actually) -2. Write a tutorial of how it is built. +1. Implement a working C compiler (an interpreter, actually). +2. Write a step-by-step tutorial on how it was built. -It took me 1 week to re-write it, resulting 1400 lines including comments. The -project is hosted on Github: [Write a C Interpreter](https://github.com/lotabout/write-a-C-interpreter). +It took me one week to re-write it, resulting in 1400 lines of code (including +comments). The project is hosted on Github: [Write a C Interpreter]. -Thanks rswier for bringing us a wonderful project! +Thanks [@rswier] for sharing with us [c4], it's such a wonderful project! -## Before you go -Implementing a compiler could be boring and it is hard to debug. So I hope you -can spare enough time studying, as well as type the code. I am sure that you -will feel a great sense of accomplishment just like I do. +## Before You Begin + +Implementing a compiler can be boring and hard to debug. So I hope you can +spare enough time studying, and typing code. I'm sure that you will feel a +great sense of accomplishment, just like I do. + ## Good Resources -1. [Let’s Build a Compiler](http://compilers.iecc.com/crenshaw/): a very good - tutorial of building a compiler for fresh starters. -2. [Lemon Parser Generator](http://www.hwaci.com/sw/lemon/): the parser - generator that is used in SQLite. Good to read if you want to understand - compiler theory with code. +1. _[Let’s Build a Compiler]_: a very good tutorial of building a compiler, + written for beginners. +2. [Lemon Parser Generator]: the parser generator used by SQLite. + Good to read if you want to understand compiler theory with code. + +In the end, I am just a person with a general level of expertise, so there +will inevitably be some mistakes in my articles and code (and also in my +English). Feel free to correct me! + +I hope you'll enjoy it. -In the end, I am human with a general level, there will be inevitably wrong -with the articles and codes(also my English). Feel free to correct me! + -Hope you enjoy it. +[@rswier]: https://github.com/rswier "Visit @rswier's GitHub profile" +[bootstrapping]: https://en.wikipedia.org/wiki/Bootstrapping_(compilers) "Wikipedia » Bootstrapping (compilers)" +[c4]: https://github.com/rswier/c4 "Visit the c4 repository on GitHub" +[Lemon Parser Generator]: http://www.hwaci.com/sw/lemon/ "Visit Lemon homepage" +[Let’s Build a Compiler]: http://compilers.iecc.com/crenshaw/ "15-part tutorial series, by Jack Crenshaw" +[Write a C Interpreter]: https://github.com/lotabout/write-a-C-interpreter "Visit the 'Write a C Interpreter' repository on GitHub" diff --git a/tutorial/en/1-Skeleton.md b/tutorial/en/1-Skeleton.md index bf59e3a..48d8322 100644 --- a/tutorial/en/1-Skeleton.md +++ b/tutorial/en/1-Skeleton.md @@ -1,66 +1,69 @@ -In this chapter we will have an overview of the compiler's structure. +# 1. Skeleton -Before we start, I'd like to restress that it is **interperter** that we want -to build. That means we can run a C source file just like a script. It is -chosen mainly for two reasons: +In this chapter we'll present an overview of the compiler's structure. -1. Interpreter differs from Compiler only in code generation phase, thus we'll - still learn all the core techniques of building a compiler(such as lexical - analyzing and parsing). -2. We will build our own virtual machine and assembly instructions, that would - help us to understand how computers work. +Before we start, let me stress again that will be building an **interperter**. +This means we'll be able to run a C source file as if it was a script. The main +reasons behind this choice are twofold: -## Three Phases +1. An interpreter differs from a compiler only in the code generation phase, + thus we'll still learn all the core techniques of building a compiler + (such as lexical analyzing and parsing). +2. We will build our own virtual machine and [assembly instruction set]; + this will help us understand how computers work. -Given a source file, normally the compiler will cast three phases of -processing: -1. Lexical Analysis: converts source strings into internal token stream. -2. Parsing: consumes token stream and constructs syntax tree. -3. Code Generation: walk through the syntax tree and generate code for target - platform. +## The Three Phases of Compiling -Compiler Construction had been so mature that part 1 & 2 can be done by -automation tools. For example, flex can be used for lexical analysis, bison -for parsing. They are powerful but do thousands of things behind the scene. In -order to fully understand how to build a compiler, we are going to build them -all from scratch. +Given a source file, the compiler usually carries out three processing phases: -Thus we will build our interpreter in the following steps: +1. **Lexical Analysis**: + converts source strings into an internal stream of tokens. +2. **Parsing**: consumes the tokens stream and constructs a syntax tree. +3. **Code Generation**: + walks through the syntax tree and generates code for target platform. -1. Build our own virtual machine and instruction set. This is the target - platform that will be using in our code generation phase. -2. Build our own lexer for C compiler. -3. Write a recusion descent parser on our own. +Compiler Construction is so mature that phases one and two can be done by +automation tools. For example, flex can be used for lexical analysis, bison for +parsing. These are powerful tools, which do thousands of things behind the +scene. In order to fully understand how to build a compiler, we're going to +handcraft all three phases, from scratch. -## Skeleton of our compiler +Therefore, we'll build our interpreter in the following steps: +1. Build our own virtual machine and instruction set. + This will be our target platform in the code generation phase. +2. Build our own lexer for C compilers. +3. Write a [recursive descent parser] on our own. -Modeling after c4, our compiler includes 4 main functions: -1. `next()` for lexical analysis; get the next token; will ignore spaces tabs - etc. -2. `program()` main entrance for parser. -3. `expression(level)`: parser expression; level will be explained in later - chapter. -4. `eval()`: the entrance for virtual machine; used to interpret target - instructions. +## The Skeleton of Our Compiler -Why would `expression` exist when we have `program` for parser? That's because -the parser for expressions is relatively independent and complex, so we put it -into a single module(function). +Modeled after [c4], our compiler includes four main functions: -The code is as following: +1. `next()` — + for lexical analysis; fetches the next token; ignores spaces, tabs, etc. +2. `program()` — parser main entry point. +3. `expression(level)` — + expressions parser; it will be explained in a later chapter. +4. `eval()` — + virtual machine entry point; used to interpret target instructions. + +Why do we need `expression()` when we already have `program()` for the parser? +That's because the expressions parser is relatively independent and complex, +so we put it into a single module (function). + +The code is as follows: ```c #include #include #include #include -#define int long long // work with 64bit target +#define int long long // work with 64-bit target int token; // current token -char *src, *old_src; // pointer to source code string; +char *src, *old_src; // pointer to source code string int poolsize; // default size of text/data/stack int line; // line number @@ -119,34 +122,46 @@ int main(int argc, char **argv) } ``` -That's quite some code for the first chapter of the article. Nevertheless it -is actually simple enough. The code tries to reads in a source file, character -by character and print them out. +That's quite some code for the first chapter of the tutorial. Nevertheless it's +actually quite simple. The code tries to reads a source file, character by +character, and print them out. -Currently the lexer `next()` does nothing but returning the characters as they -are in the source file. The parser `program()` doesn't take care of its job -either, no syntax trees are generated, no target codes are generated. +Currently, the lexer function `next()` does nothing except returning the +characters as they are encountered in the source file. The parser's `program()` +doesn't take care of its job either — it doesn't generate any syntax trees, nor +target code. The important thing here is to understand the meaning of these functions and -how they are hooked together as they are the skeleton of our interpreter. -We'll fill them out step by step in later chapters. +how they are hooked together, since they constitute the skeleton of our +interpreter. We'll fill them out step by step, in the upcoming chapters. -## Code + +## Source Code The code for this chapter can be downloaded from -[Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-0), or -clone by: +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-0), +or cloned via: ``` git clone -b step-0 https://github.com/lotabout/write-a-C-interpreter ``` -Note that I might fix bugs later, and if there is any incosistance between the -artical and the code branches, follow the article. I would only update code in -the master branch. +> **NOTE** — I might fix bugs later; if you notice any inconsistencies between +the tutorial and the code branches, follow the tutorial. I will only update +code in the master branch. + ## Summary -After some boring typing, we have the simplest compiler: a do-nothing -compiler. In next chapter, we will implement the `eval` function, i.e. our own +After some boring typing, we now have the simplest compiler: a do-nothing +compiler. In next chapter, we'll implement the `eval` function, i.e. our own virtual machine. See you then. + + + + +[assembly instruction set]: https://en.wikipedia.org/wiki/Instruction_set_architecture "Wikipedia » Instruction set architecture" +[c4]: https://github.com/rswier/c4 "Visit the c4 repository on GitHub" +[recursive descent parser]: https://en.wikipedia.org/wiki/Recursive_descent_parser "Wikipedia » Recursive descent parser" diff --git a/tutorial/en/2-Virtual-Machine.md b/tutorial/en/2-Virtual-Machine.md index 87c2ef0..a50905a 100644 --- a/tutorial/en/2-Virtual-Machine.md +++ b/tutorial/en/2-Virtual-Machine.md @@ -1,49 +1,52 @@ -In this chapter, we are going to build a virtual machine and design our own +# 2. Virtual Machine + +In this chapter we're' going to build a virtual machine and design our own instruction set that runs on the VM. This VM will be the target platform of the interpreter's code generation phase. -If you've heard of JVM and bytecode, that's what we are trying to build, but a -way way simpler one. +If you've heard of [JVM] and [bytecode], that's what we're trying to build, but +a way way simpler one. + -## How computer works internally +## How Computers Work Internally There are three components we need to care about: CPU, registers and memory. -Code(or assembly instruction) are stored in the memory as binary data; CPU -will retrieve the instruction one by one and execute them; the running states -of the machine is stored in registers. +Code (assembly instructions) are stored in memory as binary data; the CPU will +retrieve the instruction one by one and execute them; the running states of the +machine are stored in registers. + ### Memory -Memory can be used to store data. By data I mean code(or called assembly -instructions) or other data such as the message you want to print out. All of -them are stored as binary. +Memory can be used to store data. By data I mean code (assembly instructions) +or other data such as the message you want to print out. All of them are +stored as binary data. -Modern operating system introduced *Virtual Memory* which maps memory -addresses used by a program called *virtual address* into physical addresses -in computer memory. It can hide the physical details of memory from the -program. +Modern operating systems introduced *Virtual Memory* which maps memory +addresses used by a program (called *virtual address*) into physical addresses +in computer memory. The benefit of virtual memory is that it can hide the details of a physical -memory from the programs. For example, in 32bit machine, all the available -memory address is `2^32 = 4G` while the actaul physical memory may be only +memory from the programs. For example, in a 32-bit machine, all the available +memory address is `2^32 = 4G` while the actual physical memory may be only `256M`. The program will still think that it can have `4G` memory to use, the OS will map them to physical ones. -Of course, you don't need to understand the details about that. But what you -should understand that a program's usable memory is partioned into several +Of course, you don't need to understand the full details of all this. What you +should understand is that a program's usable memory is partitioned into several segments: -1. `text` segment: for storing code(instructions). -2. `data` segment: for storing initialized data. For example `int i = 10;` - will need to utilize this segment. +1. `text` segment: for storing code (instructions). +2. `data` segment: for storing initialized data. + For example `int i = 10;` will need to utilize this segment. 3. `bss` segment: for storing un-initialized data. For example `int i[1000];` - does't need to occupy `1000*4` bytes, because the actual values in the + doesn't need to occupy `1000*4` bytes, because the actual values in the array don't matter, thus we can store them in `bss` to save some space. -4. `stack` segment: used to handling the states of function calls, such as +4. `stack` segment: used for handling the states of function calls, such as calling frames and local variables of a function. -5. `heap` segment: use to allocate memory dynamically for program. +5. `heap` segment: use to dynamically allocate memory for program. -An example of the layout of these segments here: +Here's an example of the layout of these segments: ``` +------------------+ @@ -64,20 +67,20 @@ An example of the layout of these segments here: +------------------+ ``` -Our virtual machine tends to be as simple as possible, thus we don't care -about the `bss` and `heap`. Our interperter don't support the initialization -of data, thus we'll merge the `data` and `bss` segment. More over, we only use -`data` segment for storing string literals. +Our virtual machine tends to be as simple as possible, thus we don't care about +the `bss` and `heap`. Our interpreter won't support data initialization, thus +we'll merge the `data` and `bss` segments. Moreover, we'll only use the `data` +segment for storing string literals. We'll drop `heap` as well. This might sounds insane because theoretically the -VM should maintain a `heap` for allocation memories. But hey, an interpreter -itself is also a program which had its heap allocated by our computer. We can -tell the program that we want to interpret to utilize the interpreter's heap -by introducing an instruction `MSET`. I won't say it is cheating because it -reduces the VM's complexity without reducing the knowledge we want to learn -about compiler. +VM should maintain a `heap` for allocating memory. But hey, an interpreter +itself is also a program which has its heap allocated by our computer. We can +tell the interpreted program to utilize the interpreter's heap by introducing +the `MSET` instruction. I won't say it's cheating, because it reduces the VM's +complexity without subtracting from the knowledge we want to gain on compilers +design. -Thus we adds the following codes in the global area: +Thus we'll add the following lines to the code, in the global area: ```c int *text, // text segment @@ -87,12 +90,13 @@ char *data; // data segment ``` Note the `int` here. What we should write is actually `unsigned` because we'll -store unsigned data(such as pointers/memory address) in the `text` segment. -Note we want our interpreter to be bootstraping (interpret itself), thus we -don't want to introduce `unsigned`. Finally the `data` is `char *` because +store unsigned data (such as pointers/memory addresses) in the `text` segment. +But since we want our interpreter to be bootstrapping (interpret itself), we +don't want to introduce `unsigned`. Furthermore, the `data` is `char *` because we'll use it to store string literals only. -Finally, add the code in the main function to actually allocate the segments: +Finally, add to the `main()` function the code to actually allocate the +segments: ```c int main() { @@ -125,22 +129,22 @@ int main() { ### Registers Registers are used to store the running states of computers. There are several -of them in real computers while our VM uses only 4: - -1. `PC`: program counter, it stores an memory address in which stores the - **next** instruction to be run. -2. `SP`: stack pointer, which always points to the *top* of the stack. Notice - the stack grows from high address to low address so that when we push a new - element to the stack, `SP` decreases. -3. `BP`: base pointer, points to some elements on the stack. It is used in - function calls. -4. `AX`: a general register that we used to store the result of an - instruction. +of them in real computers, whereas our VM uses only four: + +1. `PC`: **program counter** — + stores the memory address of the *next* instruction to be run. +2. `SP`: **stack pointer** — always points to the *top* of the stack. + The stack grows from high address to low address, so whenever we push a new + element onto the stack, `SP` decreases. +3. `BP`: **base pointer** — + points to some elements on the stack. It's used in function calls. +4. `AX` — + a general purpose register used to store the result of an instruction. In order to fully understand why we need these registers, you need to -understand what states will a computer need to store during computation. They -are just a place to store value. You will get a better understanding after -finished this chapter. +understand what states a computer will need to store during computation. They +are just a place to store values. You will get a better understanding after +finishing this chapter. Well, add some code into the global area: @@ -149,8 +153,8 @@ int *pc, *bp, *sp, ax, cycle; // virtual machine registers ``` And add the initialization code in the `main` function. Note that `pc` should -points to the `main` function of the program to be interpreted. But we don't -have any code generation yet, thus skip for now. +points to the `main` function of the program being interpreted. But since we +don't have any code generation yet, we'll just skip it for now. ```c memset(stack, 0, poolsize); @@ -163,17 +167,19 @@ have any code generation yet, thus skip for now. program(); ``` -What's left is the CPU part, what we should actually do is implementing the -instruction sets. We'll save that for a new section. +What's left now is the CPU part. What we should actually do is implementing the +instruction set. We'll save that for a new section. + ## Instruction Set -Instruction set is a set of instruction that CPU can understand, it is the -language we need to master in order to talk to CPU. We are going to design a -language for our VM, it is based on x86 instruction set yet much simpler. +The [instruction set] is a set of instruction that the CPU can understand, +it's the language we need to master in order to talk to CPU. We are going to +design a language for our VM, it's based on x86 instruction set, but much +simpler. We'll start by adding an `enum` type listing all the instructions that our VM -would understand: +will understand: ```c // instructions @@ -182,40 +188,40 @@ enum { LEA ,IMM ,JMP ,CALL,JZ ,JNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PUSH, OPEN,READ,CLOS,PRTF,MALC,MSET,MCMP,EXIT }; ``` -These instruction are ordered intentionally as you will find out later that -instructions with arguments comes first while those without arguments comes +These instruction are intentionally ordered; as you will find out later, +instructions with arguments come first, while those without arguments comes after. The only benefit here is for printing debug info. However we will not rely on this order to introduce them. ### MOV -`MOV` is one of the most fundamental instructions you'll met. Its job is to -move data into registers or the memory, kind of like the assignment expression -in C. There are two arguments in `x86`'s `MOV` instruction: `MOV dest, -source`(Intel style), `source` can be a number, a register or a memory -address. +`MOV` is one of the most fundamental instructions you'll meet. Its job is to +move data into registers or the memory, similar to an assignment expression +in C. There are two arguments in x86's `MOV` instruction: +`MOV dest, source` (Intel style), where `source` can be a number, a register +or a memory address. -But we won't follow `x86`. On one hand our VM has only one general -register(`AX`), on the other hand it is difficult to determine the type of the -arguments(wheter it is number, register or adddress). Thus we tear `MOV` apart -into 5 pieces: +But we won't follow x86. On one hand our VM has only one general register +(`AX`), on the other hand it is difficult to determine the type of the +arguments (whether it's a number, register or address). Thus we tear `MOV` +apart into 5 pieces: 1. `IMM ` to put immediate `` into register `AX`. 2. `LC` to load a character into `AX` from a memory address which is stored in `AX` before execution. 3. `LI` just like `LC` but dealing with integer instead of character. -4. `SC` to store the character in `AX` into the memory whose address is stored - on the top of the stack. +4. `SC` to store the character in `AX` into the memory address which is stored + on top of the stack. 5. `SI` just like `SC` but dealing with integer instead of character. -What? I want one `MOV`, not 5 instruction just to replace it! Don't panic! +What? I want one `MOV`, not five instructions just to replace it! Don't panic! You should know that `MOV` is actually a set of instruction that depends on the `type` of its arguments, so you got `MOVB` for bytes and `MOVW` for words, etc. Now `LC/SC` and `LI/SI` don't seems that bad, uha? -Well the most important reason is that by turning `MOV` into 5 sub +Well the most important reason is that by turning `MOV` into five sub instructions, we reduce the complexity a lot! Only `IMM` will accept an -argument now yet no need to worry about its type. +argument now, and now there's no need to worry about its type. Let's implement it in the `eval` function: @@ -240,15 +246,16 @@ void eval() { You might wonder why we store the address in `AX` register for `LI/LC` while storing them on top of the stack segment for `SI/SC`. The reason is that the -result of an instruction is stored in `AX` by default. The memory address is also -calculate by an instruction, thus it is more convenient for `LI/LC` to fetch -it directly from `AX`. Also `PUSH` can only push the value of `AX` onto the -stack. So if we want to put an address onto the stack, we'll have to store it -in `AX` anyway, why not skip that? +result of an instruction is stored in `AX` by default. The memory address is +also calculated by an instruction, thus it is more convenient for `LI/LC` to +fetch it directly from `AX`. Also `PUSH` can only push the value of `AX` onto +the stack. So if we want to put an address onto the stack, we'll have to store +it in `AX` anyway, why not skip that? + ### PUSH -`PUSH` in `x86` can push an immediate value or a register's value onto the +`PUSH` in x86 can push an immediate value or a register's value onto the stack. Here in our VM, `PUSH` will push the value in `AX` onto the stack, only. @@ -258,7 +265,7 @@ else if (op == PUSH) {*--sp = ax;} // push t ### JMP -`JMP ` will unconditionally set the value `PC` register to ``. +`JMP ` will unconditionally set the value of `PC` register to ``. ```c else if (op == JMP) {pc = (int *)*pc;} // jump to the address @@ -267,10 +274,11 @@ else if (op == JMP) {pc = (int *)*pc;} // jump t Notice that `PC` points to the **NEXT** instruction to be executed. Thus `*pc` stores the argument of `JMP` instruction, i.e. the ``. + ### JZ/JNZ -We'll need conditional jump so as to implement `if` statement. Only two -are needed here to jump when `AX` is `0` or not. +We'll need conditional jumps to implement `if` statements. +Only two are needed here, to jump when `AX` is `0` or not. ```c else if (op == JZ) {pc = ax ? pc + 1 : (int *)*pc;} // jump if ax is zero @@ -283,11 +291,11 @@ It will introduce the calling frame which is hard to understand, so we put it together to give you an overview. We'll add `CALL`, `ENT`, `ADJ` and `LEV` in order to support function calls. -A function is a block of code, it may be physically far form the instruction +A function is a block of code, it may be physically far from the instruction we are currently executing. So we'll need to `JMP` to the starting point of a function. Then why introduce a new instruction `CALL`? Because we'll need to do some bookkeeping: store the current execution position so that the program -can resume after function call returns. +can resume after the function call returns. So we'll need `CALL ` to call the function whose starting point is `` and `RET` to fetch the bookkeeping information to resume previous @@ -303,16 +311,17 @@ We've commented out `RET` because we'll replace it with `LEV` later. In practice the compiler should deal with more: how to pass the arguments to a function? How to return the data from the function? -Our convention here about returning value is to store it into `AX` no matter -you're returning a value or a memory address. Then how about argument? +Our convention here for returning a value is to store it into `AX`, no matter +whether you're returning a value or a memory address. +Then what about arguments? -Different language has different convension, here is the standard for C: +Different languages have [different conventions], here is the standard for C: 1. It is the caller's duty to push the arguments onto stack. -2. After the function call returns, caller need to pop out the arguments. -3. The arguments are pushed in the reversed order. +2. After the function call returns, the caller needs to pop out the arguments. +3. The arguments are pushed in reversed order. -Note that we won't follow rule 3. Now let's check how C standard works(from +Note that we won't follow rule 3. Now let's check how C standard works (from [Wikipedia](https://en.wikipedia.org/wiki/X86_calling_conventions)): ```c @@ -358,18 +367,19 @@ reasons: 1. `push ebp`, while our `PUSH` doesn't accept arguments at all. 2. `move ebp, esp`, our `MOV` instruction cannot do this. -3. `add esp, 12`, well, still cannot do this(as you'll find out later). +3. `add esp, 12`, well, still cannot do this (as you'll find out later). + +Our instruction set is so simple that we cannot not support function calls! But +we will not surrender to change our design cause it will be too complex for us. +So we add more instructions! Adding new instructions to real computers is an +expensive operation, but not so with virtual machines. -Our instruction set is too simply that we cannot not support function calls! -But we will not surrender to change our design cause it will be too complex -for us. So we add more instructions! It might cost a lot in real computers to -add a new instruction, but not for virtual machine. ### ENT `ENT ` is called when we are about to enter the function call to "make a new calling frame". It will store the current `PC` value onto the stack, and -save some space(`` bytes) to store the local variables for function. +save some space (`` bytes) to store the function's local variables. ``` ; make new call frame @@ -378,7 +388,7 @@ mov ebp, esp sub 1, esp ; save stack for variable: i ``` -Will be translated into: +will be translated into: ```c else if (op == ENT) {*--sp = (int)bp; bp = sp; sp = sp - *pc++;} // make new stack frame @@ -386,8 +396,8 @@ else if (op == ENT) {*--sp = (int)bp; bp = sp; sp = sp - *pc++;} // make n ### ADJ -`ADJ ` is to adjust the stack, to "remove arguments from frame". We need -this instruction mainly because our `ADD` don't have enough power. So, treat +`ADJ ` is to adjust the stack, to "remove arguments from frame." We need +this instruction mainly because our `ADD` doesn't have enough power. So, treat it as a special add instruction. ``` @@ -395,16 +405,17 @@ it as a special add instruction. add esp, 12 ``` -Is implemented as: +is implemented as: ```c else if (op == ADJ) {sp = sp + *pc++;} // add esp, ``` + ### LEV -In case you don't notice, our instruction set don't have `POP`. `POP` in our -compiler would only be used when function call returns. Which is like this: +In case you didn't notice, our instruction set doesn't have `POP`. `POP` in our +compiler would only be used when a function call returns. Which is like this: ``` ; restore old call frame @@ -425,10 +436,10 @@ else if (op == LEV) {sp = bp; bp = (int *)*sp++; pc = (int *)*sp++;} // restor The instructions introduced above try to solve the problem of creating/destructing calling frames, one thing left here is how to fetch the -arguments *inside* sub function. +arguments *inside* sub functions. But we'll check out what a calling frame looks like before learning how to -fetch arguments (Note that arguments are pushed in its calling order): +fetch arguments (note that arguments are pushed in their calling order): ``` sub_function(arg1, arg2, arg3); @@ -462,15 +473,16 @@ else if (op == LEA) {ax = (int)(bp + *pc++);} // load a Together with the instructions above, we are able to make function calls. -### Mathmetical Instructions -Our VM will provide an instruction for each operators in C language. Each -operator has two arguments: the first one is stored on the top of the stack -while the second is stored in `AX`. The order matters especially in operators -like `-`, `/`. After the calculation is done, the argument on the stack will -be poped out and the result will be stored in `AX`. So you are not able to -fetch the first argument from the stack after the calculation, please note -that. +### Mathematical Instructions + +Our VM will provide an instruction for each operators in the C language. Each +operator has two arguments: the first one is stored on the top of the stack, +while the second is stored in `AX`. Their order matters, especially with +operators like `-` and `/`. After the calculation is done, the argument on the +stack will be popped out and the result will be stored in `AX`. So you are not +able to fetch the first argument from the stack after the calculation, please +note that. ```c else if (op == OR) ax = *sp++ | ax; @@ -493,17 +505,18 @@ else if (op == MOD) ax = *sp++ % ax; ### Built-in Instructions -Besides core logic, a program will need input/output mechanism to be +Besides core logic, a program will need input/output mechanisms to be able to interact with. `printf` in C is one of the commonly used output functions. `printf` is very complex to implement but unavoidable if our -compiler wants to be bootstraping(interpret itself) yet it is meaningless for +compiler wants to be bootstrapping (interpret itself) yet it is meaningless for building a compiler. Our plan is to create new instructions to build a bridge between the interpreted program and the interpreter itself. So that we can utilize the -libraries of the host system(your computer that runs the interpreter). +libraries of the host system (your computer that runs the interpreter). -We'll need `exit`, `open`, `close`, `read`, `printf`, `malloc`, `memset` and `memcmp`: +We'll need `exit`, `open`, `close`, `read`, `printf`, `malloc`, `memset` and +`memcmp`: ```c else if (op == EXIT) { printf("exit(%d)", *sp); return *sp;} @@ -516,7 +529,7 @@ else if (op == MSET) { ax = (int)memset((char *)sp[2], sp[1], *sp);} else if (op == MCMP) { ax = memcmp((char *)sp[2], (char *)sp[1], *sp);} ``` -At last, add some error handling: +And last, add some error handling: ```c else { @@ -525,6 +538,7 @@ else { } ``` + ## Test Now we'll do some "assembly programing" to calculate `10 + 20`: @@ -549,36 +563,46 @@ int main(int argc, char *argv[]) } ``` -Compile the interpreter with `gcc xc-tutor.c` and run it with `./a.out -hello.c`, got the following result: +Compile the interpreter with `gcc xc-tutor.c` and run it with +`./a.out hello.c`; you should get the following result: ``` exit(30) ``` -Note that we specified `hello.c` but it is actually not used, we need it -because the interpreter we build in last chapter needs it. +Note that we specified `hello.c`, but it's not actually being used, we need it +only because the interpreter we've build in the previous chapter needs it. + +Well, it seems that our VM works well. :smile: -Well, it seems that our VM works well :) ## Summary -We learned how computer works internally and build our own instruction set -modeled after `x86` assembly instructions. We are actually trying to learn -assembly language and how it actually work by building our own version. +We learned how a computer works internally, and build our own instruction set +modeled after x86 Assembly instructions. We are in fact trying to learn +Assembly language and how it actually works, by building our own version of it. The code for this chapter can be downloaded from -[Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-1), or -clone by: +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-1), +or cloned by: ``` git clone -b step-1 https://github.com/lotabout/write-a-C-interpreter ``` Note that adding a new instruction would require designing lots of circuits -and cost a lot. But it is almost free to add new instructions in our virtual -machine. We are taking advantage of this spliting the functions of an -intruction into several to simplify the implementation. +and cost a lot. But it's almost free to add new instructions in our virtual +machine. We are taking advantage of this by splitting the functions of an +instruction into several, to simplify the implementation. If you are interested, build your own instruction sets! + + + +[bytecode]: https://en.wikipedia.org/wiki/Bytecode "Wikipedia » Bytecode" +[different conventions]: https://en.wikipedia.org/wiki/Calling_convention "Wikipedia » Calling convention" +[instruction set]: https://en.wikipedia.org/wiki/Instruction_set_architecture "Wikipedia » Instruction set architecture" +[Java virtual machine]: https://en.wikipedia.org/wiki/Java_virtual_machine "Wikipedia » Java virtual machine" diff --git a/tutorial/en/3-Lexer.md b/tutorial/en/3-Lexer.md index 73a4d39..00d8692 100644 --- a/tutorial/en/3-Lexer.md +++ b/tutorial/en/3-Lexer.md @@ -1,10 +1,12 @@ +# 3. The Lexer + > lexical analysis is the process of converting a sequence of characters (such > as in a computer program or web page) into a sequence of tokens (strings with > an identified "meaning"). -Normally we represent the token as a pair: `(token type, token value)`. For -example, if a program's source file contains string: "998", the lexer will -treat it as token `(Number, 998)` meaning it is a number with value of `998`. +Normally we represent the token as a pair: `(token type, token value)`. +For example, if a program's source file contains the string "998", the lexer +will treat it as token `(Number, 998)`, meaning it's a number with value `998`. ## Lexer vs Compiler @@ -17,36 +19,39 @@ Let's first look at the structure of a compiler: +-------+ +--------+ ``` -The Compiler can be treated as a transformer that transform C source code into -assembly. In this sense, lexer and parser are transformers as well: Lexer -takes C source code as input and output token stream; Parser will consume the -token stream and generate assembly code. +The compiler can be treated as a transformer that transform C source code into +Assembly. In this sense, both lexer and parser are transformers as well: the +lexer takes C source code as input and outputs a token stream; the parser will +consume the token stream and generate Assembly code. -Then why do we need lexer and a parser? Well the Compiler's job is hard! So we -recruit lexer to do part of the job and parser to do the rest so that each -will need to deal with simple one only. +Then why do we need lexer and a parser? Well the compiler's job is hard! So we +recruit the lexer to do part of the job, and the parser to do the rest of it, +so that each will only need to deal with a single and simple job. That's the value of a lexer: to simplify the parser by converting the stream -of source code into token stream. +of source code into a stream of tokens. + ## Implementation Choice Before we start I want to let you know that crafting a lexer is boring and -error-prone. That's why geniuses out there had already created automation -tools to do the job. `lex/flex` are example that allows us to describe the -lexical rules using regular expressions and generate lexer for us. +error-prone. That's why geniuses out there have already created automation +tools to do the job. Tools like `lex/flex` allow us to describe the lexical +rules using regular expressions and then generate a lexer for us. + +Also note that we won't follow the graph shown in the previous section, i.e. +we won't be converting all the source code into a token stream at once. The +reasons being: -Also note that we won't follow the graph in the section above, i.e. not -converting all source code into token stream at once. The reasons are: +1. Converting source code into token stream is stateful. + How a string is interpreted depends on the context where it appears. +2. storing all the tokens is a waste, because only a few of them will be + accessed at any given time. -1. Converting source code into token stream is stateful. How a string is - interpreted is related with the place where it appears. -2. It is a waste to store all the tokens because only a few of them will be - accessed at the same time. +Thus we'll implement the `next()` function, which returns one token per call. -Thus we'll implement one function: `next()` that returns a token in one call. -## Tokens Supported +## Supported Tokens Add the definition into the global area: @@ -59,24 +64,27 @@ enum { }; ``` -These are all the tokens our lexer can understand. Our lexer will interpret +These are all the tokens our lexer can understand. Our lexer will interpret the string `=` as token `Assign`; `==` as token `Eq`; `!=` as `Ne`; etc. So we have the impression that a token will contain one or more characters. -That is the reason why lexer can reduce the complexity, now the parser doesn't -have to look at several character to identify a the meaning of a substring. -The job had been done. +This is the reason why the lexer can reduce complexity, now the parser doesn't +have to look at several character to identify the meaning of a substring — this +job has already be taken care of. -Of course, the tokens above is properly ordered reflecting their priority in -the C programming language. `*(Mul)` operator for example has higher priority -the `+(Add)` operator. We'll talk about it later. +Of course, the tokens above are properly ordered reflecting their priority in +the C programming language. The `*(Mul)` operator, for example, has higher +priority than the `+(Add)` operator. We'll talk more about this later. -At last, there are some characters we don't included here are themselves a -token such as `]` or `~`. The reason that we done encode them like others are: +Finally, there are some characters we didn't include here, although they are +tokens themselves, such as `]` or `~`. The reason why we're not encoding them +like the others are: -1. These tokens contains only single character, thus are easier to identify. +1. These tokens consist of just a single character, therefore they're easier + to identify. 2. They are not involved into the priority battle. + ## Skeleton of Lexer ```c @@ -91,23 +99,25 @@ void next() { } ``` -While do we need `while` here knowing that `next` will only parse one token? -This raises a quesion in compiler construction(remember that lexer is kind of -compiler?): How to handle error? +Why do we need a `while` here, knowing that `next()` will only parse one token? +This raises a question in compiler construction (remember that lexer is kind of +compiler?): How to handle errors? -Normally we had two solutions: +Normally we have two solutions: -1. points out where the error happans and quit. -2. points out where the error happans, skip it, and continue. +1. Point out where the error occurred, and quit. +2. Point out where the error occurred, skip it, and continue. -That will explain the existance of `while`: to skip unknown characters in the -source code. Meanwhile it is used to skip whitespaces which is not the actual +These explains the presence of `while`: to skip unknown characters in the +source code. Meanwhile, it's used to skip whitespace, which is not actually part of a program. They are treated as separators only. + ## Newline -It is quite like space in that we are skipping it. The only difference is that -we need to increase the line number once a newline character is met: +A newline is similar to whitespace, because we're' skipping it. The only +difference is that we need to increase the line number whenever a newline +character is encountered: ```c // parse token here @@ -120,8 +130,8 @@ if (token == '\n') { ## Macros -Macros in C starts with character `#` such as `#include `. Our -compiler don't support any macros, so we'll skip all of them: +Macros in C start with character `#`, e.g. `#include `. +Our compiler doesn't support any macros, so we'll skip all of them: ```c else if (token == '#') { @@ -132,19 +142,20 @@ else if (token == '#') { } ``` -## Identifers and Symbol Table -Identifier is the name of a variable. But we don't actually care about the -names in lexer, we cares about the identity. For example: `int a;` -declares a variable, we have to know that the statement `a = 10` that comes -after refers to the same variable that we declared before. +## Identifiers and Symbol Table + +An identifier is the name of a variable. But we don't actually care about the +names in lexer, we care about the identity. For example: `int a;` declares a +variable, we have to know that the statement `a = 10` that comes after refers +to the same variable that we declared before. Based on this reason, we'll make a table to store all the names we've already -met and call it Symbol Table. We'll look up the table when a new -name/identifier is accountered. If the name exists in the symbol table, the +met and call it Symbol Table. We'll look up the table whenever a new +name/identifier is encountered. If the name exists in the symbol table, its identity is returned. -Then how to represent an identity? +Then, how do we represent an identity? ```c struct identifier { @@ -162,25 +173,24 @@ struct identifier { We'll need a little explanation here: -1. `token`: is the token type of an identifier. Theoretically it should be - fixed to type `Id`. But it is not true because we will add keywords(e.g - `if`, `while`) as special kinds of identifier. -2. `hash`: the hash value of the name of the identifier, to speed up the - comparision of table lookup. -3. `name`: well, name of the identifier. -4. `class`: Whether the identifier is global, local or constants. +1. `token`: is the token type of an identifier. + Theoretically it should be fixed to type `Id`. But it is not true because we + will add keywords (e.g `if`, `while`) as special kinds of identifier. +2. `hash`: the hash value of the identifier's name, to speed up comparisons + during table lookup operations. +3. `name`: well, the identifier's name. +4. `class`: Whether the identifier is global, local or constant. 5. `type`: type of the identifier, `int`, `char` or pointer. 6. `value`: the value of the variable that the identifier points to. -7. `BXXX`: local variable can shadow global variable. It is used to store - global ones if that happens. +7. `BXXX`: local variables can [shadow] global variables. + It's used to store global ones, if that happens. -Traditional symbol table will contain only the unique identifer while our -symbol table stores other information that will only be accessed by parser +A traditional symbol table will contain only the unique identifier, whereas our +symbol table stores other information that will only be accessed by the parser, such as `type`. -Yet sadly, our compiler do not support `struct` while we are trying to be -bootstrapping. So we have to compromise in the actual structure of an -identifier: +Yet sadly, our compiler won't support `struct` while we are trying to bootstrap +it. So we have to compromise in the actual structure of an identifier: ``` Symbol table: @@ -190,8 +200,8 @@ Symbol table: |<--- one single identifier --->| ``` -That means we use a single `int` array to store all identifier information. -Each ID will use 9 cells. The code is as following: +This means we use a single `int` array to store all identifier information. +Each ID will use 9 cells. The code is as follows: ```c int token_val; // value of current token (mainly for number) @@ -229,12 +239,13 @@ void next() { } ``` -Note that the search in symbol table is linear search. +Note that the search in the symbol table is a linear search. + ## Number We need to support decimal, hexadecimal and octal. The logic is quite -straightforward except how to get the hexadecimal value. Maybe.. +straightforward, except how to get the hexadecimal value. Maybe. ```c token_val = token_val * 16 + (token & 0x0F) + (token >= 'A' ? 9 : 0); @@ -286,14 +297,14 @@ void next() { ## String Literals -If we find any string literal, we need to store it into the `data segment` -that we introduced in a previous chapter and return the address. Another issue -is we need to care about escaped characters such as `\n` to represent newline -character. But we don't support escaped characters other than `\n` like `\t` -or `\r`because we aim at bootstrapping only. Note that we still support -syntax that `\x` to be character `x` itself. +If we encounter a string literal, we need to store it into the `data segment` +that we introduced in a previous chapter, and return its address. Another issue +we need to care about are escaped characters, e.g. `\n` to represent a newline +character. But we won't support escaped characters other than `\n` (like `\t` +or `\r`) because we're aiming at bootstrapping only. Note that we still support +the syntax where `\x` represent the `x` character itself. -Our lexer will analyze single character (e.g. `'a'`) at the same time. Once +Our lexer will analyze a single character (e.g. `'a'`) at the same time. Once a character is found, we return it as a `Num`. ```c @@ -333,8 +344,8 @@ void next() { ## Comments -Only C++ style comments(e.g. `// comment`) is supported. C style (`/* ... */`) -is not supported. +Only C++ style comments (e.g. `// comment`) are supported. +C style comments (`/* ... */`) are not supported. ```c void next() { @@ -357,14 +368,14 @@ void next() { } ``` -Now we'll introduce the concept: `lookahead`. In the above code we see that +Now we'll introduce a new concept: `lookahead`. In the above code we see that for source code starting with character `/`, either 'comment' or `/(Div)` may be encountered. -Sometimes we cannot decide which token to generate by only looking at the current -character (such as the above example about divide and comment), thus we need to -check the next character (called `lookahead`) in order to determine. In our -example, if it is another slash `/`, then we've encountered a comment line, +Sometimes we cannot decide which token to generate by only looking at the +current character (such as the above example about divide and comment), thus we +need to check the next character (called `lookahead`) in order to determine. In +our example, if it is another slash `/`, then we've encountered a comment line, otherwise it is a divide operator. Like we've said that a lexer and a parser are inherently a kind of compiler, @@ -372,13 +383,14 @@ Like we've said that a lexer and a parser are inherently a kind of compiler, instead of "character". The `k` in `LL(k)` of compiler theory is the amount of tokens a parser needs to look ahead. -Also if we don't split the compiler into a lexer and a parser, the compiler will have -to look ahead a lot of character to decide what to do next. So we can say that -a lexer reduces the amount of lookahead a compiler needs to check. +Also, if we don't split the compiler into a lexer and a parser, the compiler +will have to look ahead a lot of characters to decide what to do next. So we +can say that a lexer reduces the amount of lookahead a compiler needs to check. + -## Others +## Other Tokens -Others are simpiler and straightforward, check the code: +The remaining tokes are simpler and straightforward, check the code: ```c void next() { @@ -496,23 +508,23 @@ void next() { } ``` -## Keywords and Builtin Functions +## Keywords and Built-in Functions Keywords such as `if`, `while` or `return` are special because they are known by the compiler in advance. We cannot treat them like normal identifiers -because the special meanings in it. There are two ways to deal with it: +because they have special meanings. There are two ways to deal with it: 1. Let lexer parse them and return a token to identify them. 2. Treat them as normal identifier but store them into the symbol table in advance. -We choose the second way: add corresponding identifers into symbol table in -advance and set the needed properties(e.g. the `Token` type we mentioned). So -that when keywords are encountered in the source code, they will be interpreted +We choose the second path: add corresponding identifiers into the symbol table +in advance and set the needed properties (e.g. the `Token` type we mentioned). +So, when keywords are encountered in the source code, they will be interpreted as identifiers, but since they already exist in the symbol table we can know that they are different from normal identifiers. -Builtin function are similar. They are only different in the internal +Built-in function are similar. They are only different in their internal information. In the main function, add the following: ```c @@ -551,28 +563,36 @@ int main(int argc, char **argv) { } ``` -## Code + +## Source Code You can check out the code on -[Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-2), or -clone with: +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-2), +or clone with: ``` git clone -b step-2 https://github.com/lotabout/write-a-C-interpreter ``` -Executing the code will give 'Segmentation Falt' because it will try to -execute the virtual machine that we build in previous chapter which will not -work because it doesn't contain any runnable code. +Executing the code will raise a 'Segmentation Fault' because it will try to +execute the virtual machine that we've build in the previous chapter, which +won't work because it doesn't contain any runnable code. + ## Summary -1. Lexer is used to pre-process the source code, so as to reduce the - complexity of parser. -2. Lexer is also a kind of compiler which consumes source code and output - token stream. -3. `lookahead(k)` is used to fully determine the meaning of current +1. Lexer is used to pre-process the source code, in order to reduce the + complexity of the parser. +2. Lexer is also a kind of compiler which consumes source code and outputs + a token stream. +3. `lookahead(k)` is used to fully determine the meaning of the current character/token. -4. How to represent identifier and symbol table. +4. How to represent identifiers and the symbol table. + +We will discuss about top-down recursive parser. See you then. :smile: + + -We will discuss about top-down recursive parser. See you then :) +[shadow]: https://en.wikipedia.org/wiki/Variable_shadowing "Wikipedia » Variable shadowing" diff --git a/tutorial/en/4-Top-down-Parsing.md b/tutorial/en/4-Top-down-Parsing.md index 39e4fbe..5a81e56 100644 --- a/tutorial/en/4-Top-down-Parsing.md +++ b/tutorial/en/4-Top-down-Parsing.md @@ -1,30 +1,34 @@ -In this chapter we will build a simple calculator using the top-down parsing -technique. This is the preparation before we start to implement the parser. +# 4. Top-down Parsing -I will introduce a small set of theories but will not gurantee to be absolutely -correct, please consult your textbook if you have any confusion. +In this chapter we will build a simple calculator using the [top-down parsing] +technique. This is just preparation work, before we start to implement the +parser. -## Top-down parsing +I will introduce a small set of theories, but can't guarantee them to be +absolutely correct, please consult your textbook if you have any confusion. -Traditionally, we have top-down parsing and bottom-up parsing. The top-down -method will start with a non-terminator and recursively check the source code to -replace the non-terminators with its alternatives until no non-terminator is -left. + +## Top-down Parsers + +Traditionally, we have top-down parsing and [bottom-up parsing]. The top-down +method will start with a non-terminator and recursively check the source code +to replace every non-terminators with its alternatives, until no +non-terminator is left. You see I used the top-down method for explaining "top-down" because you'll -have to know what a "non-terminator" is to understand the above paragraph. But I -havn't told you what that is. We will explain in the next section. For now, -consider "top-down" is trying to tear down a big object into small pieces. +have to know what a "non-terminator" is to understand the previous paragraph. +But I haven't told you what that is. We will explain it in the next section. +For now, consider "top-down" as trying to tear down a big object into small +pieces. + +On the other hand, "bottom-up" parsing is trying to combine small objects into +a bigger one. It is often used in automation tools that generate parsers. -On the other hand "bottom-up" parsing is trying to combine small objects into -a big one. It is often used in automation tools that generate parsers. -## Terminator and Non-terminator +## Terminators and Non-terminators -They are terms used in -[BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form) (Backus–Naur -Form) which is a language used to describe grammars. A simple elementary -arithmetic calulater in BNF will be: +They are terms used in [BNF] (Backus–Naur Form), which is a language used to +describe grammars. A simple elementary arithmetic calculator in BNF will be: ``` ::= + @@ -39,22 +43,24 @@ arithmetic calulater in BNF will be: | Num ``` -The item enclosed by `<>` is called a `Non-terminator`. They got the name -because we can replace them with the items on the right hand of `::=`. -`|` means alternative that means you can replace `` with any one of -` * `, ` / ` or ``. Those do not appear on -the left side of `::=` is called `Terminator` such as `+`, `(`, `Num`, etc. -They often corresponds to the tokens we got from the lexer. +Item enclosed by `<>` are called *Non-terminators*. They got that name because +we can replace them with the items on the right hand side of `::=`. `|` means +alternative, i.e. you can replace `` with any one of +` * `, ` / ` or ``. +Those which do not appear on the left side of `::=` are called *Terminators*, +such as `+`, `(`, `Num`, etc. They often corresponds to the tokens we get from +the lexer. -## Top-down Example for Simple Calculator + +## Top-down Example of a Simple Calculator The parse tree is the inner structure we get after the parser consumes all the tokens and finishes all the parsing. Let's take `3 * (4 + 2)` as an example to -show the connections between BNF grammer, parse tree and top-down parsing. +show the connections between BNF grammar, parse tree and top-down parsing. -Top-down parsing starts from a starting non-terminator which is `` in -our example. You can specify it in practice, but also defaults to the first -non-terminator we encountered. +Top-down parsing beings from a starting non-terminator, which is `` in +our example. You can specify it in real practice, but also defaults to the +first non-terminator encountered. ``` 1. => @@ -71,27 +77,30 @@ non-terminator we encountered. ``` You can see that each step we replace a non-terminator using one of its -alternatives (top-down) Until all of the sub-items are replaced by +alternatives (top-down) until all of the sub-items are replaced by terminators (bottom). Some non-terminators are used recursively such as ``. + ## Advantages of Top-down Parsing -As you can see in the above example, the parsing step is similar to the BNF +As you can see in the above example, the parsing steps are similar to the BNF grammar. Which means it is easy to convert the grammar into actual code by converting a production rule (`<...> ::= ...`) into a function with the same name. One question arises here: how do you know which alternative to apply? Why do you choose ` ::= * ` over ` ::= / `? -That's right, we `lookahead`! We peek the next token and it is `*` so it is the -first one to apply. +That's right, with `lookahead`! We peek the next token, which is `*`, so it's +the first one to apply. + +However, top-down parsing requires that the grammar doesn't contain +left-recursion. -However, top-down parsing requires the grammar should not have left-recursion. ## Left-recursion -Suppose we have a grammer like this: +Suppose we have a grammar like this: ``` ::= + Num @@ -106,13 +115,13 @@ int expr() { } ``` -As you can see, function `expr` will never exit! In the grammar, -non-terminator `` is used recursively and appears immediately after -`::=` which causes left-recursion. +As you can see, function `expr` will never exit! In the grammar, the `` +non-terminator is used recursively and appears immediately after `::=`, which +causes left-recursion. -Luckly, most left-recursive grammers (maybe all? I don't remember) can be -properly transformed into non left-recursive equivalent ones. Our grammar for -calculator can be converted into: +Luckily, most left-recursive grammars (maybe all? I don't remember) can be +properly transformed into non left-recursive equivalents. Our calculator +grammar can be converted into: ``` ::= @@ -133,6 +142,7 @@ calculator can be converted into: You should check out your textbook for more information. + ## Implementation The following code is directly converted from the grammar. Notice how @@ -193,7 +203,9 @@ int expr() { } ``` -Implmenting a top-down parser is straightforward with the help of BNF grammar. +Implementing a top-down parser is straightforward with the help of a BNF +grammar. + We now add the code for lexer: ```c @@ -253,15 +265,24 @@ int main(int argc, char *argv[]) ``` You can play with your own calculator now. Or try to add some more functions -based on what we've learned in the previous chapter. Such as variable support +based on what we've learned in the previous chapter. Such as variable support, so that a user can define variables to store values. + ## Summary -We don't like theory, but it exists for good reason as you can see that BNF can -help us to build the parser. So I want to convice you to learn some theories, -it will help you to become a better programmer. +We don't like theory, but it exists for a good reason as you can see that BNF +can help us to build the parser. So I want to convince you to learn some +theory, it will help you to become a better programmer. -Top-down parsing technique is often used in manually crafting of parsers, so -you are able to handle most jobs if you master it! As you'll see in laster +The top-down parsing technique is often used when crafting parsers manually, +so you'll be able to handle most jobs if you master it! As you'll see in later chapters. + + + +[BNF]: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form "Wikipedia: Backus–Naur form" +[bottom-up parsing]: https://en.wikipedia.org/wiki/Bottom-up_parsing "Wikipedia » Bottom-up parsing" +[top-down parsing]: https://en.wikipedia.org/wiki/Top-down_parsing "Wikipedia » Top-down parsing" diff --git a/tutorial/en/5-Variables.md b/tutorial/en/5-Variables.md index 26b162c..67ee23e 100644 --- a/tutorial/en/5-Variables.md +++ b/tutorial/en/5-Variables.md @@ -1,17 +1,19 @@ -In this chapter we are going to use EBNF to describe the grammer of our C +# 5. Variables + +In this chapter we are going to use [EBNF] to describe the grammar of our C interpreter, and add the support of variables. -The parser is more complicated than the lexer, thus we will split it into 3 parts: -variables, functions and expressions. +The parser is more complicated than the lexer, thus we will split it into three +parts: variables, functions and expressions. + ## EBNF grammar -We've talked about BNF in the previous chapter, -[EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form) is -Extended-BNF. If you are familiar with regular expression, you should feel -right at home. Personally I think it is more powerful and straightforward than -BNF. Here is the EBNF grammar of our C interpreter, feel free to skip it if -you feel it's too hard to understand. +We've talked about [BNF] in the previous chapter, [EBNF] is Extended-BNF. +If you are familiar with regular expression, you should feel right at home. +Personally I think it is more powerful and straightforward than BNF. Here is +the EBNF grammar of our C interpreter, feel free to skip it if you feel it's +too hard to understand. ``` program ::= {global_declaration}+ @@ -40,12 +42,13 @@ while_statement ::= 'while' '(' expression ')' non_empty_statement We'll leave `expression` to later chapters. Our grammar won't support function *declaration*, that means recursive calling between functions are not -supported. And since we're bootstrapping, that means our code for +supported. And since we're bootstrapping, that means our code for its implementation cannot use any cross-function recursions. (Sorry for the whole chapter of top-down recursive parsers.) In this chapter, we'll implement the `enum_decl` and `variable_decl`. + ## program() We've already defined function `program`, turn it into: @@ -60,15 +63,16 @@ void program() { } ``` -I know that we havn't defined `global_declaration`, sometimes we need wishful +I know that we haven't defined `global_declaration`, sometimes we need wishful thinking that maybe someone (say Bob) will implement that for you. So you can -focus on the big picture at first instead of drill down into all the details. -That's the essence of top-down thinking. +focus on the big picture at first, instead of having to drill down into all the +details. That's the essence of top-down thinking. + ## global_declaration() Now it is our duty (not Bob's) to implement `global_declaration`. It will try -to parse variable definitions, type definitions (only enum is supported) and +to parse variable definitions, type definitions (only `enum` is supported) and function definitions: ```c @@ -158,20 +162,20 @@ void global_declaration() { } ``` -Well, that's more than two screen of code! I think that is a direct +Well, that's more than two screens of code! I think that is a direct translation of the grammar. But to help you understand, I'll explain some of them: -**Lookahead Token**: The `if (token == xxx)` statement is used to peek the +**Lookahead Token** — The `if (token == xxx)` statement is used to peek at the next token to decide which production rule to use. For example, if token -`enum` is met, we know it is enumeration that we are trying to parse. But if a +`enum` is met, we know it's an enumeration that we're trying to parse. But if a type if parsed such as `int identifier`, we still cannot tell whether `identifier` is a variable or function. Thus the parser should continue to look ahead for the next token, if `(` is met, we are now sure `identifier` is -a function, otherwise it is a variable. +a function, otherwise it's a variable. -**Variable Type**: Our C interpreter supports pointers, that means pointers -that points to other pointers are also supported such as `int **data;`. How do +**Variable Type** — Our C interpreter supports pointers, that means pointers +that point to other pointers are also supported, such as `int **data;`. How do we represents them in code? We've already defined the types that we support: ```c @@ -180,13 +184,14 @@ enum { CHAR, INT, PTR }; ``` So we will use an `int` to store the type. It starts with a base type: `CHAR` -or `INT`. When the type is a pointer that points to a base type such as `int -*data;` we add `PTR` to it: `type = type + PTR;`. The same goes to the pointer -of pointer, we add another `PTR` to the type, etc. +or `INT`. When the type is a pointer that points to a base type such as +`int *data;` we add `PTR` to it: `type = type + PTR;`. The same goes to the +pointer of pointer, we add another `PTR` to the type, etc. + ## enum_declaration -The main logic is trying to parse the `,` seperated variables. You need to pay +The main logic is trying to parse the `,` separated variables. You need to pay attention to the representation of enumerations. We will store an enumeration as a global variable. However its `type` is set to @@ -228,7 +233,9 @@ void enum_declaration() { ## Misc -Of course `function_declaration` will be introduced in next chapter. `match` appears a lot. It is helper function that consume the current token and fetch the next: +Of course `function_declaration` will be introduced in next chapter. `match` +appears a lot. It's a helper function that consume the current token and +fetches the next: ```c void match(int tk) { @@ -241,22 +248,33 @@ void match(int tk) { } ``` -## Code -You can download the code of this chapter from [Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-3), or clone with: +## Source Code + +You can download the code of this chapter from +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-3), +or clone with: ``` git clone -b step-3 https://github.com/lotabout/write-a-C-interpreter ``` -The code won't run because there are still some un-implemented functions. You -can challange yourself to fill them out first. +The code won't run because there are still some unimplemented functions. You +can challenge yourself to fill them out first. + ## Summary EBNF might be difficult to understand because of its syntax (maybe). But it should be easy to follow this chapter once you can read the syntax. What we do is to translate EBNF directly into C code. So the parsing is by no means -exciting but you should pay attention to the representation of each concept. +exciting, but you should pay attention to the representation of each concept. + +We'll talk about function definitions in the next chapter, see you then. + + -We'll talk about the function definition in the next chapter, see you then. +[BNF]: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form "Wikipedia: Backus–Naur form" +[EBNF]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form "Wikipedia: Extended Backus–Naur form" diff --git a/tutorial/en/6-Functions.md b/tutorial/en/6-Functions.md index 006c8dc..76318f8 100644 --- a/tutorial/en/6-Functions.md +++ b/tutorial/en/6-Functions.md @@ -1,11 +1,14 @@ +# 6. Functions + We've already seen how variable definitions are parsed in our interpreter. Now it's time for function definitions (note it is definition, not declaration, thus our interpreter doesn't support recursion across functions). + ## EBNF Grammar -Let's start by refreshing our memory of the EBNF grammar introduced in last -chapter, we've already implement `program`, `global_declaration` and +Let's start by refreshing our memory of the EBNF grammar introduced in the +previous chapter, we've already implemented `program`, `global_declaration` and `enum_decl`. We'll deal with part of `variable_decl`, `function_decl`, `parameter_decl` and `body_decl`. The rest will be covered in the next chapter. @@ -28,7 +31,8 @@ if_statement ::= 'if' '(' expression ')' statement ['else' non_empty_statement] while_statement ::= 'while' '(' expression ')' non_empty_statement ``` -## Function definition + +## Function Definition Recall that we've already encountered functions when handling `global_declaration`: @@ -44,13 +48,14 @@ if (token == '(') { ``` The type for the current identifier (i.e. function name) had already been set -correctly. The above chunk of code set the type (i.e. `Fun`) and the +correctly. The above chunk of code sets the type (i.e. `Fun`) and the address in `text segment` for the function. Here comes `parameter_decl` and `body_decl`. + ## Parameters and Assembly Output -Before we get our hands dirty, we have to understand the assembly code that +Before we get our hands dirty, we have to understand the Assembly code that will be output for a function. Consider the following: ```c @@ -83,11 +88,13 @@ following (please refer to the VM of chapter 2): | .... | low address ``` -The key point here is no matter if it is a parameter (e.g. `param_a`) or local -variable (e.g. `local_1`), they are all stored on the **stack**. Thus they are -referred to by the pointer `new_bp` and relative offsets, while global variables -which are stored in `text segment` are refered to by direct address. So we -need to know the number of parameters and the offset of each. +The key point here is that it doesn't matter if it is a parameter (e.g. +`param_a`) or a local variable (e.g. `local_1`), they are all stored on the +**stack**. Thus they are referred to by the pointer `new_bp` and relative +offset, while global variables which are stored in `text segment` are referred +to by a direct address. So we need to know the number of parameters and the +offset of each. + ## Skeleton for Parsing Function @@ -122,13 +129,14 @@ because `variable_decl` and `function_decl` are parsed together (because of the same prefix in EBNF grammar) inside `global_declaration`. `variable_decl` ends with `;` while `function_decl` ends with `}`. If `}` is consumed, the `while` loop in `global_declaration` won't be able to know that a `function_decl` -parsing is end. Thus we leave it to `global_declaration` to consume it. +parsing has ended. Thus we leave it to `global_declaration` to consume it. What ② is trying to do is unwind local variable declarations for all local variables. As we know, local variables can have the same name as global ones, -once it happans, global ones will be shadowed. So we should recover the status +once it happens, global ones will be shadowed. So we should recover the status once we exit the function body. Informations about global variables are backed -up to fields `BXXX`, so we iterate over all identifiers to recover. +up to fields `BXXX`, so we iterate over all identifiers to recover them. + ## function_parameter() @@ -192,21 +200,22 @@ void function_parameter() { } ``` -Part ① is the same to what we've seen in `global_declaration` which is used to -parse the type for the parameter. +Part ① is the same as what we've seen in `global_declaration`, which is used to +parse the type of the parameter. Part ② is to backup the information for global variables which will be -shadowed by local variables. The position of current parameter is stored in +shadowed by local variables. The current parameter's position is stored in field `Value`. -Part ③ is used to calculate the position of pointer `bp` which corresponds to -`new_bp` that we talked about in the above section. +Part ③ is used to calculate the position of pointer `bp`, which corresponds to +`new_bp` which we talked about in the above section. + ## function_body() -Different with modern C, our interpreter requires that all the definitinos of -variables that are used in current function should be put at the beginning of -current function. This rule is actually the same to ancient C compilers. +Unlike modern C, our interpreter requires that all variables definitions which +are used in the current function should be put at the beginning of the +current function. This rule is actually the same as in ancient C compilers. ```c void function_body() { @@ -274,31 +283,35 @@ void function_body() { } ``` -You should be familiar with ①, it had been repeated several times. +You should be familiar with ①, it has been repeated several times. + +Part ② is writing Assembly code into the text segment. In the VM chapter, we +said we have to preserve spaces for local variables on the stack, well, this +is it. -Part ② is writing assembly code into text segment. In the VM chapter, we said -we have to preserve spaces for local variables on stack, well, this is it. -## Code +## Source Code -You can download the code of this chapter from [Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-4), or clone with: +You can download the code of this chapter from +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-4), +or clone with: ``` git clone -b step-4 https://github.com/lotabout/write-a-C-interpreter ``` -The code still won't run because there are still some un-implemented -functions. You can challange yourself to fill them out first. +The code still won't run because there are still some unimplemented +functions. You can challenge yourself to fill them out first. ## Summary -The code of this chapter isn't long, most of the part are used to parse -variables and much of them are duplicated. The parsing for parameter and local -variables are almost the same, but the stored information are different. +The code of this chapter isn't long, most of its parts are used to parse +variables and much of them are duplicated. Parsing parameters and local +variables is almost the same, but their stored information is different. Of course, you may want to review the VM chapter (chapter 2) to get better understanding of the expected output for function, so as to understand why -would we want to gather such information. This is what we called +we would want to gather such information. This is what we call "domain knowledge". -We'll deal with `if`, `while` next chapter, see you then. +We'll deal with `if` and `while` int next chapter. See you then. diff --git a/tutorial/en/7-Statements.md b/tutorial/en/7-Statements.md index 9b5eb42..2ede9ec 100644 --- a/tutorial/en/7-Statements.md +++ b/tutorial/en/7-Statements.md @@ -1,5 +1,7 @@ +# 7. Statements + We have two concepts in C: statement and expression. Basically statements won't -have a value as its result while expressions do. That means you cannot assign +have a value as their result while expressions do. That means you cannot assign a statement to a variable. We have 6 statements in our interpreter: @@ -9,15 +11,16 @@ We have 6 statements in our interpreter: 3. `{ }` 4. `return xxx;` 5. ``; -6. `expression;` (expression end with semicolon) +6. `expression;` (expression ends with semicolon) The parsing job is relatively easy compared to understanding the expected -assembly output. Let's explain one by one. +Assembly output. Let's explain them one by one. + ## IF -`if` statement is to jump to a different location according to the condition. -Let's check its pseudo cdoe: +An `if` statement serves to jump to a different location according to the +condition. Let's check its pseudo cdoe: ``` if (...) [else ] @@ -31,12 +34,12 @@ if (...) [else ] b: b: ``` -The flow of assembly code is: +The flow of Assembly code is: 1. execute ``. 2. If the condition fails, jump to position `a`, i.e. `else` statement. -3. Because assembly is executed sequentially, if `` is executed, - we need to skip ``, thus a jump to `b` is needed. +3. Because assembly is executed sequentially, if `` is + executed, we need to skip ``, thus a jump to `b` is needed. Corresponding C code: @@ -66,6 +69,7 @@ Corresponding C code: } ``` + ## While `while` is simplier than `if`: @@ -79,7 +83,7 @@ a: a: b: b: ``` -Nothing worth mention. C code: +Nothing worth mentioning. C code: ```c else if (token == While) { @@ -102,10 +106,11 @@ Nothing worth mention. C code: } ``` + ## Return -Once we meet `return`, it means the function is about to end, thus `LEV` is -needed to indicate the exit. +When we encounter `return`, it means the function is about to end, thus `LEV` +is needed to indicate the exit. ```c else if (token == Return) { @@ -123,10 +128,11 @@ needed to indicate the exit. } ``` -## Others -Other statement acts as helpers for compiler to group the codes better. They -won't generate assembly codes. As follows: +## Other Statements + +Other statements act as helpers for the compiler to better organize the code in +groups. They won't generate Assembly code. As follows: ```c else if (token == '{') { @@ -150,25 +156,29 @@ won't generate assembly codes. As follows: } ``` -## Code -You can download the code of this chapter from [Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-5), or clone with: +## Source Code + +You can download the code of this chapter from +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-5), +or clone with: ``` git clone -b step-5 https://github.com/lotabout/write-a-C-interpreter ``` -The code still won't run because there are still some un-implemented -functions. You can challange yourself to fill them out first. +The code still won't run because there are still some unimplemented +functions. You can challenge yourself to fill them out first. + ## Summary As you can see, implementing parsing for an interpreter is not hard at all. But it did seems complicated because we need to gather enough knowledge during -parsing in order to generate target code (assembly in our case). You see, that -is one big obstacle for beginner to start implementation. So instead of -"programming knowledge", "domain knowledge" is also required to actually +parsing in order to generate target code (Assembly in our case). You see, +that's a major obstacle for beginners to start the implementation. So instead +of "programming knowledge", "domain knowledge" is also required to actually achieve something. -Thus I suggest you to learn assembly if you haven't, it is not difficult but +Thus I suggest you to learn Assembly if you haven't, it is not difficult but helpful to understand how computers work. diff --git a/tutorial/en/8-Expressions.md b/tutorial/en/8-Expressions.md index f1aac22..62afc66 100644 --- a/tutorial/en/8-Expressions.md +++ b/tutorial/en/8-Expressions.md @@ -1,46 +1,49 @@ -This chapter will be long, please make sure you have enough time for this. -We'll dealing with the last puzzle of our interpreter: expressions. +# 8. Expressions -What is an expression? Well, it is combination of the elements of a -programming language and generates a result, such as function invocation, +This chapter is going to be long, please make sure you have enough time for +this. We'll be dealing with the last piece of the puzzle of our interpreter: +expressions. + +What is an expression? Well, it's a combination of the elements of a +programming language and generates a result, such as a function invocation, variable assignment, calculation using various operators. We had to pay attention to two things: the precedence of operators and the target assembly code for operators. -## Precedence of operators + +## Operator Precedence The precedence of operators means we should compute some operators before -others even though the latter may show up first. For example: operator `*` -has higher precedence than operator `+`, so that in expression `2 + 3 * 4`, +others even though the latter may show up first. For example: the `*` operator +has higher precedence than the `+` operator, so that in expression `2 + 3 * 4`, the correct calculation result is `2 + (3 * 4)` instead of `(2 + 3) * 4` even though `+` comes before `*`. -C programming language had already defined the precedence for various -operators, you can refer to [Operator -Precedence](http://en.cppreference.com/w/c/language/operator_precedence). +The C programming language has already defined the precedence for various +operators, you can refer to [Operator Precedence]. -We'll use stack for handling precedence. One stack for arguments, the other +We'll use stacks to handle precedence. One stack for arguments, the other one for operators. I'll give an example directly: consider `2 + 3 - 4 * 5`, we'll get the result through the following steps: -1. push `2` onto the stack. -2. operator `+` is met, push it onto the stack, now we are expecting the other +1. Push `2` onto the stack. +2. Operator `+` is met, push it onto the stack, now we are expecting the other argument for `+`. 3. `3` is met, push it onto the stack. We are supposed to calculate `2+3` - immediately, but we are not sure whether `3` belongs to the operator with + immediately, but we are not sure whether `3` belongs to an operator with higher precedence, so leave it there. -4. operator `-` is met. `-` has the same precedence as `+`, so we are sure +4. Operator `-` is met. `-` has the same precedence as `+`, so we are sure that the value `3` on the stack belongs to `+`. Thus the pending - calculation `2+3` is evaluated. `3`, `+`, `2` are poped from the stack and + calculation `2+3` is evaluated. `3`, `+`, `2` are popped from the stack and the result `5` is pushed back. Don't forget to push `-` onto the stack. 5. `4` is met, but we are not sure if it 'belongs' to `-`, leave it there. 6. `*` is met and it has higher precedence than `-`, now we have two operators pending. 7. `5` is met, and still not sure whom `5` belongs to. Leave it there. -8. The expression end. Now we are sure that `5` belongs to the operator lower +8. The expression ends. Now we are sure that `5` belongs to the operator lower on the stack: `*`, pop them out and push the result `4 * 5 = 20` back. -9. Continue to pop out items push the result `5 - 20 = -15` back. +9. Continue to pop out items, push the result `5 - 20 = -15` back. 10. Now the operator stack is empty, pop out the result: `-15` ``` @@ -69,7 +72,7 @@ we'll get the result through the following steps: +------+ +------+ ``` -As described above, we had to make sure that the right side of argument +As described above, we had to make sure that the right side of an argument belongs to current operator 'x', thus we have to look right of the expression, find out and calculate the ones that have higher precedence. Then do the calculation for current operator 'x'. @@ -78,14 +81,16 @@ Finally, we need to consider precedence for only binary/ternary operators. Because precedence means different operators try to "snatch" arguments from each other, while unary operators are the strongest. -## Unary operators -Unary operators are strongest, so we serve them first. Of course, we'll also -parse the arguments(i.e. variables, number, string, etc) for operators. +## Unary Operators + +Unary operators are the strongest, so we serve them first. Of course, we'll +also parse the arguments (i.e. variables, number, string, etc) for operators. -We've already learned the parsing +We've already learned how the parsing works. -### Constant + +### Constants First comes numbers, we use `IMM` to load it into `AX`: @@ -101,7 +106,7 @@ if (token == Num) { ``` Next comes string literals, however C support this kind of string -concatination: +concatenation: ```c char *p; @@ -133,10 +138,11 @@ else if (token == '"') { } ``` -### sizeof -It is an unary operator, we'll have to know to type of its argument which we -are familiar with. +### Sizeof + +It is a unary operator, we'll have to know the type of its argument, which +we're already familiar with. ```c else if (token == Sizeof) { @@ -168,10 +174,12 @@ else if (token == Sizeof) { } ``` -Note that only `sizeof(int)`, `sizeof(char)` (which by the way, is always `1` - by definition) and `sizeof(pointer type ...)` -are supported, and the type of the result is `int`. +Note that only `sizeof(int)`, `sizeof(char)` (which by the way, is always `1`, +by definition) and `sizeof(pointer type ...)` are supported, and the type of +the result is `int`. + -### Variable and function invocation +### Variable and Function Invocation They all starts with an `Id` token, thus are handled together. @@ -261,32 +269,33 @@ else if (token == Id) { ``` ①: Notice we are using the normal order to push the arguments which -corresponds to the implementation of our virtual machine. However C standard -push the argument in reverse order. +correspond to the implementation of our virtual machine. However C standard +pushes the arguments in reverse order. -②: Note how we support `printf`, `read`, `malloc` and other built in -functions in our virtual machine. These function calls have specific assembly +②: Note how we support `printf`, `read`, `malloc` and other built-in +functions in our virtual machine. These function calls have specific Assembly instructions while normal functions are compiled into `CALL `. -③: Remove the arguments on the stack, we modifies the stack pointer directly +③: Remove the arguments on the stack, we modify the stack pointer directly because we don't care about the values. ④: Enum variables are treated as constant numbers. ⑤: Load the values of variable, use the `bp + offset` style for local -variable(refer to chapter 7), use `IMM` to load the address of global +variables (refer to chapter 7), use `IMM` to load the address of global variables. -⑥: Finally load the value of variables using `LI/LC` according to their type. +⑥: Finally, load the value of variables using `LI/LC` according to their type. You might ask how to deal with expressions like `a[10]` if we use `LI/LC` to -load immediatly when an identifier is met? We might modify existing assembly +load immediately when an identifier is met? We might modify existing assembly instructions according to the operator after the identifier, as you'll see later. + ### Casting -Perhaps you've notice that we use `expr_type` to store the type of the return +Perhaps you've noticed that we use `expr_type` to store the type of the return value of an expression. Type Casting is for changing the type of the return value of an expression. @@ -315,11 +324,12 @@ else if (token == '(') { } ``` + ### Dereference/Indirection -`*a` in C is to get the object pointed by pointer `a`. It is essential to find -out the type of pointer `a`. Luckily the type information will be stored in -variable `expr_type` when an expression ends. +In C `*a` server to get the object pointed by pointer `a`. It is essential to +find out the type of pointer `a`. Luckily the type information will be stored +in variable `expr_type` when an expression ends. ```c else if (token == Mul) { @@ -338,12 +348,13 @@ else if (token == Mul) { } ``` -### Address-Of -In section "Variable and function invocation", we said we will modify `LI/LC` +### Address-of + +In section "Variable and function invocation," we said we will modify `LI/LC` instructions dynamically, now it is the time. We said that we'll load the -address of a variable first and call `LI/LC` instruction to load the actual -content according to the type: +address of a variable first and call the `LI/LC` instructions to load the +actual content according to the type: ``` IMM @@ -369,9 +380,10 @@ else if (token == And) { } ``` + ### Logical NOT -We don't have logical not instruction in our virtual machine, thus we'll +We don't have a logical `not` instruction in our virtual machine, thus we'll compare the result to `0` which represents `False`: ```c @@ -390,10 +402,11 @@ else if (token == '!') { } ``` + ### Bitwise NOT -We don't have corresponding instruction in our virtual machine either. Thus we -use `XOR` to implement, e.g. `~a = a ^ 0xFFFF`. +We don't have the corresponding instruction in our virtual machine either. +Thus we use `XOR` to implement it, e.g. `~a = a ^ 0xFFFF`. ```c else if (token == '~') { @@ -411,7 +424,8 @@ else if (token == '~') { } ``` -### Unary plus and Unary minus + +### Unary Plus and Unary Minus Use `0 - x` to implement `-x`: @@ -442,6 +456,7 @@ else if (token == Add) { } ``` + ### Increment and Decrement The precedence for increment or decrement is related to the position of the @@ -474,19 +489,20 @@ else if (token == Inc || token == Dec) { } ``` -For `++p` we need to access `p` twice: one for load the value, one for storing +For `++p` we need to access `p` twice: once to load the value, once to store the incremented value, that's why we need to `PUSH` (①) it once. -② deal with cases when `p` is pointer. +② deals with cases when `p` is pointer. + ## Binary Operators -Now we need to deal with the precedence of operators. We will scan to the -right of current operator, until one that has **the same or lower** precedence -than the current operator is met. +Now we need to deal with operators' precedence. We will scan to the right of +the current operator, until one that has **the same or lower** precedence than +the current operator is met. Let's recall the tokens that we've defined, they are order by their -precedences from low to high. That mean `Assign` is the lowest and `Brak`(`[`) +precedences from low to high. That means `Assign` is the lowest and `Brak`(`[`) is the highest. ```c @@ -498,7 +514,7 @@ enum { ``` Thus the argument `level` in calling `expression(level)` is actually used to -indicate the precedence of current operator, thus the skeleton for parsing +indicate the current operator's precedence, so the skeleton for parsing binary operators is: ```c @@ -507,12 +523,13 @@ while (token >= level) { } ``` -Now we know how to deal with precedence, let's check how operators are -compiled into assembly instructions. +Now that we know how to deal with precedence, let's check how operators are +compiled into Assembly instructions. + ### Assignment -`Assign` has the lowest precedence. Consider expression `a = (expression)`, +`Assign` has the lowest precedence. Consider the expression `a = (expression)`, we've already generated instructions for `a` as: ``` @@ -550,9 +567,10 @@ if (token == Assign) { } ``` + ### Ternary Conditional -That is `? :` in C. It is a operator version for `if` statement. The target +That is `? :` in C. It's the operator version of the `if` statement. The target instructions are almost identical to `if`: ```c @@ -576,9 +594,10 @@ else if (token == Cond) { } ``` + ### Logical Operators -Two of them: `||` and `&&`. Their corresponding assembly instructions are: +Two of them: `||` and `&&`. Their corresponding Assembly instructions are: ```c || && @@ -589,7 +608,7 @@ Two of them: `||` and `&&`. Their corresponding assembly instructions are: b: b: ``` -Source code as following: +The source code is as follows: ```c else if (token == Lor) { @@ -612,6 +631,7 @@ else if (token == Lan) { } ``` + ### Mathematical Operators Including `|`, `^`, `&`, `==`, `!=`, `<=`, `>=`, `<`, `>`, `<<`, `>>`, `+`, @@ -640,16 +660,15 @@ else if (token == Xor) { } ``` -Quite easy, hah? There are still something to mention about addition and -substraction for pointers. A pointer plus/minus some number equals to the -shiftment for a pointer according to its type. For example, `a + 1` will shift -for 1 byte if `a` is `char *`, while 4 bytes(in 32 bit machine) if a is `int -*`. +Quite easy, ah? There is still something to mention about addition and +subtraction for pointers. A pointer plus/minus some number equals to shiftinga +pointer according to its type. For example, `a + 1` will shift for 1 byte if +`a` is `char *`, and 4 bytes (in a 32-bit machine) if `a` is `int *`. -Also, substraction for two pointers will give the number of element between +Also, the subtraction of two pointers will give the number of elements between them, thus need special treatment. -Take addition as example: +Take addition as an example: ``` + @@ -687,12 +706,13 @@ else if (token == Add) { } ``` -You can try to implement substraction by your own or refer to the repository. +You can try to implement subtraction on your own or refer to the repository. + ### Increment and Decrement Again Now we deal with the postfix version, e.g. `p++` or `p--`. Different from the -prefix version, the postfix version need to store the value **before** +prefix version, the postfix version needs to store the value **before** increment/decrement on `AX` after the increment/decrement. Let's compare them: ```c @@ -715,6 +735,7 @@ increment/decrement on `AX` after the increment/decrement. Let's compare them: *++text = (token == Inc) ? SUB : ADD; // ``` + ### Indexing You may already know that `a[10]` equals to `*(a + 10)` in C language. The @@ -745,11 +766,11 @@ else if (token == Brak) { } ``` -## Code +## Source Code -We need to initialize the stack for our virtual machine besides all the -expressions in the above sections so that `main` function is correctly called, -and exit when `main` exit. +We need to initialize the stack for our virtual machine, besides all the +expressions in the above sections, so that `main` function is correctly called, +and exit when `main` exits. ```c int *tmp; @@ -762,12 +783,12 @@ sp = (int *)((int)stack + poolsize); *--sp = (int)tmp; ``` -Last, due to the limitation of our interpreter, all the definitions of variables +Lastly, due to the limitations of our interpreter, all variables' definitions should be put before all expressions, just like the old C compiler requires. -You can download all the code on -[Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-6) or -clone it by: +You can download all the code from +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-6), +or clone it by: ``` git clone -b step-6 https://github.com/lotabout/write-a-C-interpreter @@ -776,12 +797,12 @@ git clone -b step-6 https://github.com/lotabout/write-a-C-interpreter Compile it by `gcc -o xc-tutor xc-tutor.c` and run it by `./xc-tutor hello.c` to check the result. -Our code is bootstraping, that means our interpreter can parse itself, so that -you can run our C interpreter inside itself by `./xc-tutor xc-tutor.c -hello.c`. +Our code is bootstrapping, that means our interpreter can parse itself, so that +you can run our C interpreter inside itself via `./xc-tutor xc-tutor.c hello.c`. You might need to compile with `gcc -m32 -o xc-tutor xc-tutor.c` if you use a -64 bit machine. +64-bit machine. + ## Summary @@ -789,6 +810,12 @@ This chapter is long mainly because there are quite a lot of operators there. You might pay attention to: 1. How to call `expression` recursively to handle the precedence of operators. -2. What is the target assembly instructions for each operator. +2. What is the target Assembly instructions for each operator. + +Congratulations! You've build a runnable C interpreter all by yourself! + + -Congratulations! You've build a runnable C interpeter all by yourself! +[Operator Precedence]: https://en.cppreference.com/w/c/language/operator_precedence "C Operator Precedence"