Skip to content

1995parham-teaching/ellang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elahe Language 🙈

Introduction

In this project we want to write a compiler and we have actually no idea how to do it right now, so we are gonna learn and code and share our experiences with you step by step.

The first thing I like to do is to study this documentation I strongly recommend that you study the link above but I'm going to write a summary too.

101

I'm a newbie so I decided to first write a compiler for a very simple language and then write a bigger better one. I'm going to put my exercises in the 101 folder. In this folder I want to write a compiler for a simple calculator which can only do plus, minus, times and division.

Tokenizer

The vary first thing a compiler needs to do is to tokenize the input. To do this it needs a TOKEN LIST that defines all of the possible token names.

Note: Everything in here is so implicit 😠. For example to define the token list we should write tokens 😐 or to specify each token we make declarations with a special prefix t_ to indicate that it defines a token.

Token specification

Token is specified by regex like t_PLUS = r'+'

Note: r in the beginning of the regex means raw string and we use \ because + itself has a meaning in regex (it means one or more) we want to say we mean the '+' character itself.

If some kind of action needs to be performed a token rule can be specified as a function.

Pay attention:

  1. All tokens defined by functions are added in the same order as they appear in the lexer file.
  2. Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first). So suppose you define two regex for '=' and '==', if the rules are defined by strings everything is ok but if they are defined by functions then you should be careful to put '==' rule before '=' rule.

Reserved words

We most probably have some reserved word in our language like 'IF' and also have some names that the user can identify for example the word price in a line of code like 'int price = 100;'. To handle these words I have a dictionary for reserved words the key 'if' has the value 'IF' and added another token named 'ID', the token 'ID' includes all the names starting with a letter or underline then have zero or more letter, underline or digit in the rest, what about the reserved words? In the ID token function we check that if the 'ID' is a reserved word we change the type to the type of the reserved word.

Note: Do not write individual rules for reserved words, It's strange but if you write a rule like t_FOR = r'for' then this will be triggered for a word like 'forget' 😐.

Comment

The user may like to have write comments among his lines of codes that should be discarded by the compiler, it's easily done by defining a token rule that returns no value.

Line number

lex.py knows nothing about what constitutes a line as result can't understand the line number, if you don't define a rule for '\n' it will consider it as an Illegal character

Ignored characters

The t_ignore rule is used for characters that should completely be ignored for example whitespace.

Note:the characters given in t_ignore are not ignored when such characters are part of other regular expression patterns. For example, if you had a rule to capture quoted text, that pattern can include the ignored characters

Literal characters

A literal character is simply a single character that is returned "as is", there may be so many literal characters in your language, it will be boring and time consuming to define them as a token and then define a regex rule for each and every of them.We can specify them using the literals keyword.Literals are checked after all the regex rules.When a literal is returned, both its type and value attributes are set to the character itself.It's possible to write token functions for literals but remember to set the token type appropriately.

Error handling

The t_error() function is used to handle lexing errors.The t.value contains the rest of the input string

Parser

I think for lexer part I have done enough now It's time to go to parse phase.

Parsing basics

Syntax is usually specified in terms of a BNF grammar.

yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.LR parsing is a bottom up technique that tries to recognize the right-hand-side of various grammar rules. Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the grammar symbols are replaced by the grammar symbol on the left-hand-side. Take a close look at the bottom example:

Parsing methods

Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification.The statements that make up the function body implement the semantic actions of the rule.Each function accepts a single argument p that is a sequence containing the values of each grammar symbol in the corresponding rule.

How does yacc work

yacc.yacc() function looks at the module and attempts to construct all of the LR parsing tables for the grammar. The first time yacc.yacc() is invoked, you get the message Generating LALR tables since table construction is expensive, the resulting parsing table is written to a file called parsetab.py. In addition, a debugging file called parser.out is created. On subsequent executions, yacc will reload the table from parsetab.py. If any errors are detected in your grammar specification, yacc.py will raise an exception.Some of the errors include:

  • Duplicated function names
  • Ambiguous grammar
  • Badly specified grammar rules
  • Infinite recursion
  • Unused rules and tokens
  • Undefined rules and tokens

Combining grammar rule functions

If parsing performance is a concern, you should resist to put too much conditional processing into a single grammar rule. When you add checks to see which grammar rule is being handled, you are actually duplicating the work that the parser has already performed.

Empty productions

One of the things you may like to do is having epsilon but in the lex part we didn't define any token for that purpose. Well, there is no need you can simply write empty rules anywhere by specifying an empty right hand side but dabeaz suggests to write an "empty" rule and use that to denote an empty production he says it's easier to read

Dealing with ambiguity

Look at the grammar defined in the "AmbiguousGrammarParser" in lexer.py, this grammar is ambiguous. For example in "3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.In case of an ambiguous grammar yacc.py will print messages about "shif/reduce conflict" or "reduce/reduce conflict".A shift/reduce conflict is caused when the parser generator can't decide whether or not to reduce a rule or shift a symbol on the parsing stack.By default all shift/reduce conflicts are resolved in favor of shifting.To resolve ambiguity yacc.py allows individual tokens to be assigned a precedence level and associativity.

%prec

One problem with the precedence specifier technique is that it is sometimes necessary to change the precedence of an operator in certain contexts. For example, consider a unary-minus operator in "3 + 4 * -5". Mathematically, the unary minus is normally given a very high precedence. To deal with this, precedence rules can be given for so-called "fictitious tokens" like this:

Now, in the grammar file, we can write our unary minus rule like this:

In this case, %prec UMINUS overrides the default rule precedence. UMINUS is not an input token or a grammar rule. Instead, you shoould think of it as the name of a special marker in the precedence table.

References

About

Elahe Language - Compiler Fall 2020 - Prof. Mohammadreza Razzazi

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •