In this project we want to write a compiler and we have actually no idea how to do it right now, so we are gonna learn and code and share our experiences with you step by step.
The first thing I like to do is to study this documentation I strongly recommend that you study the link above but I'm going to write a summary too.
I'm a newbie so I decided to first write a compiler for a very simple language and then write a bigger better one. I'm going to put my exercises in the 101 folder. In this folder I want to write a compiler for a simple calculator which can only do plus, minus, times and division.
The vary first thing a compiler needs to do is to tokenize the input. To do this it needs a TOKEN LIST
that defines all
of the possible token names.
Note: Everything in here is so implicit 😠. For example to define the token list we should write tokens
😐 or to
specify each token we make declarations with a special prefix t_
to indicate that it defines a token.
Token is specified by regex like t_PLUS = r'+'
Note: r
in the beginning of the regex means raw string and we use \
because +
itself has a meaning in regex
(it means one or more) we want to say we mean the '+' character itself.
If some kind of action needs to be performed a token rule can be specified as a function.
Pay attention:
- All tokens defined by functions are added in the same order as they appear in the lexer file.
- Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first). So suppose you define two regex for '=' and '==', if the rules are defined by strings everything is ok but if they are defined by functions then you should be careful to put '==' rule before '=' rule.
We most probably have some reserved word in our language like 'IF' and also have some names that the user can identify for example the word price in a line of code like 'int price = 100;'. To handle these words I have a dictionary for reserved words the key 'if' has the value 'IF' and added another token named 'ID', the token 'ID' includes all the names starting with a letter or underline then have zero or more letter, underline or digit in the rest, what about the reserved words? In the ID token function we check that if the 'ID' is a reserved word we change the type to the type of the reserved word.
Note: Do not write individual rules for reserved words, It's strange but if you write a rule like t_FOR = r'for' then this will be triggered for a word like 'forget' 😐.
The user may like to have write comments among his lines of codes that should be discarded by the compiler, it's easily done by defining a token rule that returns no value.
lex.py knows nothing about what constitutes a line as result can't understand the line number, if you don't define a rule for '\n' it will consider it as an Illegal character
The t_ignore
rule is used for characters that should completely be ignored for example whitespace.
Note:the characters given in t_ignore are not ignored when such characters are part of other regular expression patterns. For example, if you had a rule to capture quoted text, that pattern can include the ignored characters
A literal character is simply a single character that is returned "as is", there may be so many literal characters in your
language, it will be boring and time consuming to define them as a token and then define a regex rule for each and every
of them.We can specify them using the literals
keyword.Literals are checked after all the regex rules.When a literal is
returned, both its type and value attributes are set to the character itself.It's possible to write token functions for
literals but remember to set the token type appropriately.
The t_error()
function is used to handle lexing errors.The t.value contains the rest of the input string
I think for lexer part I have done enough now It's time to go to parse phase.
Syntax is usually specified in terms of a BNF grammar.
yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.LR parsing is a bottom up technique that tries to recognize the right-hand-side of various grammar rules. Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the grammar symbols are replaced by the grammar symbol on the left-hand-side. Take a close look at the bottom example:
Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar
specification.The statements that make up the function body implement the semantic actions of the rule.Each function accepts
a single argument p
that is a sequence containing the values of each grammar symbol in the corresponding rule.
yacc.yacc() function looks at the module and attempts to construct all of the LR parsing tables for the grammar. The first
time yacc.yacc() is invoked, you get the message Generating LALR tables
since table construction is expensive, the resulting
parsing table is written to a file called parsetab.py. In addition, a debugging file called parser.out is created. On subsequent
executions, yacc will reload the table from parsetab.py.
If any errors are detected in your grammar specification, yacc.py will raise an exception.Some of the errors include:
- Duplicated function names
- Ambiguous grammar
- Badly specified grammar rules
- Infinite recursion
- Unused rules and tokens
- Undefined rules and tokens
If parsing performance is a concern, you should resist to put too much conditional processing into a single grammar rule. When you add checks to see which grammar rule is being handled, you are actually duplicating the work that the parser has already performed.
One of the things you may like to do is having epsilon
but in the lex part we didn't define any token for that purpose.
Well, there is no need you can simply write empty rules anywhere by specifying an empty right hand side but dabeaz
suggests to write an "empty" rule and use that to denote an empty production he says it's easier to read
Look at the grammar defined in the "AmbiguousGrammarParser" in lexer.py, this grammar is ambiguous. For example in "3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.In case of an ambiguous grammar yacc.py will print messages about "shif/reduce conflict" or "reduce/reduce conflict".A shift/reduce conflict is caused when the parser generator can't decide whether or not to reduce a rule or shift a symbol on the parsing stack.By default all shift/reduce conflicts are resolved in favor of shifting.To resolve ambiguity yacc.py allows individual tokens to be assigned a precedence level and associativity.
One problem with the precedence specifier technique is that it is sometimes necessary to change the precedence of an operator in certain contexts. For example, consider a unary-minus operator in "3 + 4 * -5". Mathematically, the unary minus is normally given a very high precedence. To deal with this, precedence rules can be given for so-called "fictitious tokens" like this:
Now, in the grammar file, we can write our unary minus rule like this:
In this case, %prec UMINUS
overrides the default rule precedence. UMINUS is not an input token or a grammar rule. Instead,
you shoould think of it as the name of a special marker in the precedence table.