|
1 |
| -================================================= |
2 |
| -Kaleidoscope: Tutorial Introduction and the Lexer |
3 |
| -================================================= |
| 1 | +:orphan: |
4 | 2 |
|
5 |
| -.. contents:: |
6 |
| - :local: |
7 |
| - |
8 |
| -Tutorial Introduction |
| 3 | +===================== |
| 4 | +Kaleidoscope Tutorial |
9 | 5 | =====================
|
10 | 6 |
|
11 |
| -Welcome to the "Implementing a language with LLVM" tutorial. This |
12 |
| -tutorial runs through the implementation of a simple language, showing |
13 |
| -how fun and easy it can be. This tutorial will get you up and started as |
14 |
| -well as help to build a framework you can extend to other languages. The |
15 |
| -code in this tutorial can also be used as a playground to hack on other |
16 |
| -LLVM specific things. |
17 |
| - |
18 |
| -The goal of this tutorial is to progressively unveil our language, |
19 |
| -describing how it is built up over time. This will let us cover a fairly |
20 |
| -broad range of language design and LLVM-specific usage issues, showing |
21 |
| -and explaining the code for it all along the way, without overwhelming |
22 |
| -you with tons of details up front. |
23 |
| - |
24 |
| -It is useful to point out ahead of time that this tutorial is really |
25 |
| -about teaching compiler techniques and LLVM specifically, *not* about |
26 |
| -teaching modern and sane software engineering principles. In practice, |
27 |
| -this means that we'll take a number of shortcuts to simplify the |
28 |
| -exposition. For example, the code uses global variables |
29 |
| -all over the place, doesn't use nice design patterns like |
30 |
| -`visitors <http://en.wikipedia.org/wiki/Visitor_pattern>`_, etc... but |
31 |
| -it is very simple. If you dig in and use the code as a basis for future |
32 |
| -projects, fixing these deficiencies shouldn't be hard. |
33 |
| - |
34 |
| -I've tried to put this tutorial together in a way that makes chapters |
35 |
| -easy to skip over if you are already familiar with or are uninterested |
36 |
| -in the various pieces. The structure of the tutorial is: |
37 |
| - |
38 |
| -- `Chapter #1 <#language>`_: Introduction to the Kaleidoscope |
39 |
| - language, and the definition of its Lexer - This shows where we are |
40 |
| - going and the basic functionality that we want it to do. In order to |
41 |
| - make this tutorial maximally understandable and hackable, we choose |
42 |
| - to implement everything in C++ instead of using lexer and parser |
43 |
| - generators. LLVM obviously works just fine with such tools, feel free |
44 |
| - to use one if you prefer. |
45 |
| -- `Chapter #2 <LangImpl02.html>`_: Implementing a Parser and AST - |
46 |
| - With the lexer in place, we can talk about parsing techniques and |
47 |
| - basic AST construction. This tutorial describes recursive descent |
48 |
| - parsing and operator precedence parsing. Nothing in Chapters 1 or 2 |
49 |
| - is LLVM-specific, the code doesn't even link in LLVM at this point. |
50 |
| - :) |
51 |
| -- `Chapter #3 <LangImpl03.html>`_: Code generation to LLVM IR - With |
52 |
| - the AST ready, we can show off how easy generation of LLVM IR really |
53 |
| - is. |
54 |
| -- `Chapter #4 <LangImpl04.html>`_: Adding JIT and Optimizer Support |
55 |
| - - Because a lot of people are interested in using LLVM as a JIT, |
56 |
| - we'll dive right into it and show you the 3 lines it takes to add JIT |
57 |
| - support. LLVM is also useful in many other ways, but this is one |
58 |
| - simple and "sexy" way to show off its power. :) |
59 |
| -- `Chapter #5 <LangImpl05.html>`_: Extending the Language: Control |
60 |
| - Flow - With the language up and running, we show how to extend it |
61 |
| - with control flow operations (if/then/else and a 'for' loop). This |
62 |
| - gives us a chance to talk about simple SSA construction and control |
63 |
| - flow. |
64 |
| -- `Chapter #6 <LangImpl06.html>`_: Extending the Language: |
65 |
| - User-defined Operators - This is a silly but fun chapter that talks |
66 |
| - about extending the language to let the user program define their own |
67 |
| - arbitrary unary and binary operators (with assignable precedence!). |
68 |
| - This lets us build a significant piece of the "language" as library |
69 |
| - routines. |
70 |
| -- `Chapter #7 <LangImpl07.html>`_: Extending the Language: Mutable |
71 |
| - Variables - This chapter talks about adding user-defined local |
72 |
| - variables along with an assignment operator. The interesting part |
73 |
| - about this is how easy and trivial it is to construct SSA form in |
74 |
| - LLVM: no, LLVM does *not* require your front-end to construct SSA |
75 |
| - form! |
76 |
| -- `Chapter #8 <LangImpl08.html>`_: Compiling to Object Files - This |
77 |
| - chapter explains how to take LLVM IR and compile it down to object |
78 |
| - files. |
79 |
| -- `Chapter #9 <LangImpl09.html>`_: Extending the Language: Debug |
80 |
| - Information - Having built a decent little programming language with |
81 |
| - control flow, functions and mutable variables, we consider what it |
82 |
| - takes to add debug information to standalone executables. This debug |
83 |
| - information will allow you to set breakpoints in Kaleidoscope |
84 |
| - functions, print out argument variables, and call functions - all |
85 |
| - from within the debugger! |
86 |
| -- `Chapter #10 <LangImpl10.html>`_: Conclusion and other useful LLVM |
87 |
| - tidbits - This chapter wraps up the series by talking about |
88 |
| - potential ways to extend the language, but also includes a bunch of |
89 |
| - pointers to info about "special topics" like adding garbage |
90 |
| - collection support, exceptions, debugging, support for "spaghetti |
91 |
| - stacks", and a bunch of other tips and tricks. |
92 |
| - |
93 |
| -By the end of the tutorial, we'll have written a bit less than 1000 lines |
94 |
| -of non-comment, non-blank, lines of code. With this small amount of |
95 |
| -code, we'll have built up a very reasonable compiler for a non-trivial |
96 |
| -language including a hand-written lexer, parser, AST, as well as code |
97 |
| -generation support with a JIT compiler. While other systems may have |
98 |
| -interesting "hello world" tutorials, I think the breadth of this |
99 |
| -tutorial is a great testament to the strengths of LLVM and why you |
100 |
| -should consider it if you're interested in language or compiler design. |
101 |
| - |
102 |
| -A note about this tutorial: we expect you to extend the language and |
103 |
| -play with it on your own. Take the code and go crazy hacking away at it, |
104 |
| -compilers don't need to be scary creatures - it can be a lot of fun to |
105 |
| -play with languages! |
106 |
| - |
107 |
| -The Basic Language |
108 |
| -================== |
109 |
| - |
110 |
| -This tutorial will be illustrated with a toy language that we'll call |
111 |
| -"`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived |
112 |
| -from "meaning beautiful, form, and view"). Kaleidoscope is a procedural |
113 |
| -language that allows you to define functions, use conditionals, math, |
114 |
| -etc. Over the course of the tutorial, we'll extend Kaleidoscope to |
115 |
| -support the if/then/else construct, a for loop, user defined operators, |
116 |
| -JIT compilation with a simple command line interface, etc. |
117 |
| - |
118 |
| -Because we want to keep things simple, the only datatype in Kaleidoscope |
119 |
| -is a 64-bit floating point type (aka 'double' in C parlance). As such, |
120 |
| -all values are implicitly double precision and the language doesn't |
121 |
| -require type declarations. This gives the language a very nice and |
122 |
| -simple syntax. For example, the following simple example computes |
123 |
| -`Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_ |
124 |
| - |
125 |
| -:: |
126 |
| - |
127 |
| - # Compute the x'th fibonacci number. |
128 |
| - def fib(x) |
129 |
| - if x < 3 then |
130 |
| - 1 |
131 |
| - else |
132 |
| - fib(x-1)+fib(x-2) |
133 |
| - |
134 |
| - # This expression will compute the 40th number. |
135 |
| - fib(40) |
136 |
| - |
137 |
| -We also allow Kaleidoscope to call into standard library functions (the |
138 |
| -LLVM JIT makes this completely trivial). This means that you can use the |
139 |
| -'extern' keyword to define a function before you use it (this is also |
140 |
| -useful for mutually recursive functions). For example: |
141 |
| - |
142 |
| -:: |
143 |
| - |
144 |
| - extern sin(arg); |
145 |
| - extern cos(arg); |
146 |
| - extern atan2(arg1 arg2); |
147 |
| - |
148 |
| - atan2(sin(.4), cos(42)) |
149 |
| - |
150 |
| -A more interesting example is included in Chapter 6 where we write a |
151 |
| -little Kaleidoscope application that `displays a Mandelbrot |
152 |
| -Set <LangImpl06.html#kicking-the-tires>`_ at various levels of magnification. |
153 |
| - |
154 |
| -Lets dive into the implementation of this language! |
155 |
| - |
156 |
| -The Lexer |
157 |
| -========= |
158 |
| - |
159 |
| -When it comes to implementing a language, the first thing needed is the |
160 |
| -ability to process a text file and recognize what it says. The |
161 |
| -traditional way to do this is to use a |
162 |
| -"`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka |
163 |
| -'scanner') to break the input up into "tokens". Each token returned by |
164 |
| -the lexer includes a token code and potentially some metadata (e.g. the |
165 |
| -numeric value of a number). First, we define the possibilities: |
166 |
| - |
167 |
| -.. code-block:: c++ |
168 |
| - |
169 |
| - // The lexer returns tokens [0-255] if it is an unknown character, otherwise one |
170 |
| - // of these for known things. |
171 |
| - enum Token { |
172 |
| - tok_eof = -1, |
173 |
| - |
174 |
| - // commands |
175 |
| - tok_def = -2, |
176 |
| - tok_extern = -3, |
177 |
| - |
178 |
| - // primary |
179 |
| - tok_identifier = -4, |
180 |
| - tok_number = -5, |
181 |
| - }; |
182 |
| - |
183 |
| - static std::string IdentifierStr; // Filled in if tok_identifier |
184 |
| - static double NumVal; // Filled in if tok_number |
185 |
| - |
186 |
| -Each token returned by our lexer will either be one of the Token enum |
187 |
| -values or it will be an 'unknown' character like '+', which is returned |
188 |
| -as its ASCII value. If the current token is an identifier, the |
189 |
| -``IdentifierStr`` global variable holds the name of the identifier. If |
190 |
| -the current token is a numeric literal (like 1.0), ``NumVal`` holds its |
191 |
| -value. Note that we use global variables for simplicity, this is not the |
192 |
| -best choice for a real language implementation :). |
193 |
| - |
194 |
| -The actual implementation of the lexer is a single function named |
195 |
| -``gettok``. The ``gettok`` function is called to return the next token |
196 |
| -from standard input. Its definition starts as: |
197 |
| - |
198 |
| -.. code-block:: c++ |
199 |
| - |
200 |
| - /// gettok - Return the next token from standard input. |
201 |
| - static int gettok() { |
202 |
| - static int LastChar = ' '; |
203 |
| - |
204 |
| - // Skip any whitespace. |
205 |
| - while (isspace(LastChar)) |
206 |
| - LastChar = getchar(); |
207 |
| - |
208 |
| -``gettok`` works by calling the C ``getchar()`` function to read |
209 |
| -characters one at a time from standard input. It eats them as it |
210 |
| -recognizes them and stores the last character read, but not processed, |
211 |
| -in LastChar. The first thing that it has to do is ignore whitespace |
212 |
| -between tokens. This is accomplished with the loop above. |
213 |
| - |
214 |
| -The next thing ``gettok`` needs to do is recognize identifiers and |
215 |
| -specific keywords like "def". Kaleidoscope does this with this simple |
216 |
| -loop: |
217 |
| - |
218 |
| -.. code-block:: c++ |
219 |
| - |
220 |
| - if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* |
221 |
| - IdentifierStr = LastChar; |
222 |
| - while (isalnum((LastChar = getchar()))) |
223 |
| - IdentifierStr += LastChar; |
224 |
| - |
225 |
| - if (IdentifierStr == "def") |
226 |
| - return tok_def; |
227 |
| - if (IdentifierStr == "extern") |
228 |
| - return tok_extern; |
229 |
| - return tok_identifier; |
230 |
| - } |
231 |
| - |
232 |
| -Note that this code sets the '``IdentifierStr``' global whenever it |
233 |
| -lexes an identifier. Also, since language keywords are matched by the |
234 |
| -same loop, we handle them here inline. Numeric values are similar: |
235 |
| - |
236 |
| -.. code-block:: c++ |
237 |
| - |
238 |
| - if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ |
239 |
| - std::string NumStr; |
240 |
| - do { |
241 |
| - NumStr += LastChar; |
242 |
| - LastChar = getchar(); |
243 |
| - } while (isdigit(LastChar) || LastChar == '.'); |
244 |
| - |
245 |
| - NumVal = strtod(NumStr.c_str(), 0); |
246 |
| - return tok_number; |
247 |
| - } |
248 |
| - |
249 |
| -This is all pretty straight-forward code for processing input. When |
250 |
| -reading a numeric value from input, we use the C ``strtod`` function to |
251 |
| -convert it to a numeric value that we store in ``NumVal``. Note that |
252 |
| -this isn't doing sufficient error checking: it will incorrectly read |
253 |
| -"1.23.45.67" and handle it as if you typed in "1.23". Feel free to |
254 |
| -extend it :). Next we handle comments: |
255 |
| - |
256 |
| -.. code-block:: c++ |
257 |
| - |
258 |
| - if (LastChar == '#') { |
259 |
| - // Comment until end of line. |
260 |
| - do |
261 |
| - LastChar = getchar(); |
262 |
| - while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); |
263 |
| - |
264 |
| - if (LastChar != EOF) |
265 |
| - return gettok(); |
266 |
| - } |
267 |
| - |
268 |
| -We handle comments by skipping to the end of the line and then return |
269 |
| -the next token. Finally, if the input doesn't match one of the above |
270 |
| -cases, it is either an operator character like '+' or the end of the |
271 |
| -file. These are handled with this code: |
272 |
| - |
273 |
| -.. code-block:: c++ |
274 |
| - |
275 |
| - // Check for end of file. Don't eat the EOF. |
276 |
| - if (LastChar == EOF) |
277 |
| - return tok_eof; |
278 |
| - |
279 |
| - // Otherwise, just return the character as its ascii value. |
280 |
| - int ThisChar = LastChar; |
281 |
| - LastChar = getchar(); |
282 |
| - return ThisChar; |
283 |
| - } |
284 |
| - |
285 |
| -With this, we have the complete lexer for the basic Kaleidoscope |
286 |
| -language (the `full code listing <LangImpl02.html#full-code-listing>`_ for the Lexer |
287 |
| -is available in the `next chapter <LangImpl02.html>`_ of the tutorial). |
288 |
| -Next we'll `build a simple parser that uses this to build an Abstract |
289 |
| -Syntax Tree <LangImpl02.html>`_. When we have that, we'll include a |
290 |
| -driver so that you can use the lexer and parser together. |
291 |
| - |
292 |
| -`Next: Implementing a Parser and AST <LangImpl02.html>`_ |
293 |
| - |
| 7 | +The Kaleidoscope Tutorial has `moved to another location <MyFirstLanguageFrontend/index>`_ . |
0 commit comments