A professional, industrial-grade multi-pass compiler written in Rust. This project transforms a high-level C-like language into an optimized Three-Address Code (TAC) intermediate representation. It features a robust Pratt parser, a context-sensitive semantic analyzer, and a sophisticated 5-pass IR optimization pipeline.
The compiler is organized into a modular pipeline, ensuring high maintainability and clear separation of concerns:
- Lexical Analysis (Lexer): Scans source code into a stream of typed tokens with full UTF-8/Unicode support.
- Syntax Analysis (Parser): Implements a Pratt Parser to construct a high-fidelity Abstract Syntax Tree (AST).
- Semantic Analysis:
- Scope Analyzer: Manages symbol tables, handle shadowing, and ensures declaration-before-use.
- Type Checker: Enforces strict typing rules and validates operation compatibility.
- IR Generation: Translates the validated AST into flat, machine-agnostic Three-Address Code (TAC).
- IR Optimization: A persistent optimizer that runs 5 distinct passes (Constant Folding, Propagation, DCE, etc.) until code reachability and efficiency are maximized.
- Strongly Typed: No implicit type coercion (e.g.,
int+floatis an error). - Native Types:
int,float,double,char,bool,string,void. - Enumerations:
- Declared globally with
enum Name { ... }. - Automatically assigned sequential integer values.
- Compatible with
intfor logic and switch statements.
- Declared globally with
const: Marks variables as immutable post-initialization, enabling aggressive compile-time optimizations.global: Declares variables in the global segment, surviving local scope pruning.
Identifiers can include any Unicode character:
- Standard:
int count = 0; - International:
int ει = 10;,int ΠΏΠ΅ΡΠ΅ΠΌ = 5; - Creative:
float xππ = 3.14;
The full formal grammar used by the parser:
Program ::= Includes* Declaration* MainDecl?
Includes ::= "include" ("<" Identifier ">" | StringLiteral)
Declaration ::= VarDecl | FunctionProto | FunctionDecl | EnumDeclVarDecl ::= ("const" | "global")? Type Identifier ("=" Expression)? ";"
Type ::= "int" | "float" | "double" | "char" | "bool" | "string" | "void" | Identifier
FunctionProto ::= Type Identifier "(" ParamList? ")" ";"
FunctionDecl ::= Type Identifier "(" ParamList? ")" Block
MainDecl ::= "main" Block
EnumDecl ::= "enum" Identifier "{" Identifier ("," Identifier)* "}" ";"
ParamList ::= (Type Identifier) ("," Type Identifier)*Statement ::= VarDecl | Block | IfStmt | WhileStmt | DoWhileStmt | ForStmt
| SwitchStmt | ReturnStmt | BreakStmt | PrintStmt | ExpressionStmt
Block ::= "{" Statement* "}"
IfStmt ::= "if" "(" Expression ")" Block ("else" Block)?
WhileStmt ::= "while" "(" Expression ")" Block
DoWhileStmt ::= "do" Block "while" "(" Expression ")" ";"
ForStmt ::= "for" "(" (VarDecl | ExpressionStmt | ";") Expression? ";" Expression? ")" Block
SwitchStmt ::= "switch" "(" Expression ")" "{" CaseBlock* DefaultBlock? "}"
CaseBlock ::= "case" Expression ":" Statement*
ReturnStmt ::= "return" Expression? ";"
PrintStmt ::= "print" "(" ExpressionList? ")" ";"Expression ::= Assignment
Assignment ::= LogicalOr ( "=" Assignment )?
LogicalOr ::= LogicalAnd ( "||" LogicalAnd )*
Equality ::= Comparison ( ("==" | "!=") Comparison )*
Comparison ::= Term ( ("<" | ">" | "<=" | ">=") Term )*
Term ::= Factor ( ("+" | "-") Factor )*
Factor ::= Unary ( ("*" | "/" | "%") Unary )*
Unary ::= ("-" | "!" | "++" | "--") Unary | Postfix
Postfix ::= Primary ("++" | "--" | Call)*
Primary ::= Literal | Identifier | "(" Expression ")"The parser uses the Top-Down Operator Precedence (Pratt) algorithm. This allows the compiler to handle 12 levels of precedence with clean, non-recursive calls for infix operations.
| Precedence | Operators | Association |
|---|---|---|
| Call | () |
Left |
| Postfix | ++, -- |
Left |
| Unary | -, !, ++, -- |
Right |
| Factor | *, /, % |
Left |
| Term | +, - |
Left |
| Bitwise | &, ` |
, ^, <<, >>` |
| Comparison | <, >, <=, >= |
Left |
| Equality | ==, != |
Left |
| Logical AND | && |
Left |
| Logical OR | ` | |
| Assignment | = |
Right |
Implements a 2-pass approach:
- Declaration Collection: Harvests top-level symbols for forward-reference safety.
- Lexical Validation: Traverses the AST, maintaining a tree of
ScopeFramenodes. It handles Shadowing (inner blocks overriden outer ones) and prevents duplicate declarations.
Enforces the safety of the program:
- Parameter Checking: Verifies function call argument types and counts.
- Condition Validation:
if/whileconditions must bebool. - Switch Safety: Only
int,char, orenumare allowed for switch expressions. - Context Awareness: Prevents
breakstatements outside of loops andreturninvoidfunctions.
The compiler translates the high-level AST into optimized Three-Address Code.
- Register-like: Uses virtual temporaries
t0,t1,t2... - Control Flow: Uses
Label:,Goto Label, andifFalse/ifTrue. - Metadata: Preserves
constandglobalqualifiers for downstream backends.
Example Transformation:
// Source: x = 10 + 5 * 2;
t0 = 5 Multiply 2
t1 = 10 Plus t0
int x = t1
The compiler features a persistent optimization engine that reaches a "Fixed Point" (no further changes possible).
- Constant Folding: Solves
5 + 10β15at compile time. - Constant Propagation: Replaces variable uses with known constants (e.g.,
const int a = 10; b = a + 5βb = 10 + 5). - Copy Propagation: Eliminates chains like
t1 = x; t2 = t1βt2 = x. - Dead Code Elimination (DCE): Multi-step algorithm that prunes assignments to unused variables and removes unreachable labels.
- Peephole Optimization:
- Algebraic Simplification:
x + 0βx,x * 1βx,x * 0β0. - Redundant Jumps: Deletes
goto L1ifL1is the next instruction.
- Algebraic Simplification:
Custom-Compiler/
βββ src/
β βββ core/ # Token definitions and AST Node models
β βββ lexer/ # State-machine based Lexical Analyzer
β βββ parser/ # Pratt Parser & AST Visualization tools
β βββ semantics/ # Scope Analyzer and Type Checker
β βββ ir_pipeline/ # TAC Generator and 5-pass Optimizer
β βββ main.rs # CLI Entry Point
βββ docs/ # Detailed technical documentation for each module
βββ test_input.txt # Main integration test file
βββ Cargo.toml # Rust dependencies
- Rust & Cargo (Stable)
cargo build --releaseThe compiler takes a source file as an argument and outputs the AST, Raw TAC, and Optimized TAC.
cargo run -- test_input.txt- β Robust Lexer: 35+ token types with full Unicode/Emoji identification.
- β Pratt Parser: Efficient 12-level precedence handling.
- β Semantic Suite: Detection of 15+ scope errors and 17+ type errors.
- β IR Pipeline: Full TAC generation for all control flow constructs.
- β Fixed-Point Optimizer: Significant reduction in code density and complexity through 5-pass analysis.
- Full IR Optimization Suite
- RISC-V Code Generation Backend
- Register Allocation (Graph Coloring)
- Structs and Pointers support
- Standard Library (I/O, String manipulation)