Skip to content

A professional, multi-pass compiler that brings the warmth of the Urdu language to modern programming. It builds optimized Three-Address Code, running natively on Windows and in-browser via WebAssembly

Notifications You must be signed in to change notification settings

BazilSuhail/YaarScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

153 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Custom Compiler πŸš€

A professional, industrial-grade multi-pass compiler written in Rust. This project transforms a high-level C-like language into an optimized Three-Address Code (TAC) intermediate representation. It features a robust Pratt parser, a context-sensitive semantic analyzer, and a sophisticated 5-pass IR optimization pipeline.


�️ Compiler Architecture

The compiler is organized into a modular pipeline, ensuring high maintainability and clear separation of concerns:

  1. Lexical Analysis (Lexer): Scans source code into a stream of typed tokens with full UTF-8/Unicode support.
  2. Syntax Analysis (Parser): Implements a Pratt Parser to construct a high-fidelity Abstract Syntax Tree (AST).
  3. Semantic Analysis:
    • Scope Analyzer: Manages symbol tables, handle shadowing, and ensures declaration-before-use.
    • Type Checker: Enforces strict typing rules and validates operation compatibility.
  4. IR Generation: Translates the validated AST into flat, machine-agnostic Three-Address Code (TAC).
  5. IR Optimization: A persistent optimizer that runs 5 distinct passes (Constant Folding, Propagation, DCE, etc.) until code reachability and efficiency are maximized.

✨ Language Features & Specifications

πŸ”¬ Advanced Type System

  • Strongly Typed: No implicit type coercion (e.g., int + float is an error).
  • Native Types: int, float, double, char, bool, string, void.
  • Enumerations:
    • Declared globally with enum Name { ... }.
    • Automatically assigned sequential integer values.
    • Compatible with int for logic and switch statements.

πŸ” Variable Modifiers

  • const: Marks variables as immutable post-initialization, enabling aggressive compile-time optimizations.
  • global: Declares variables in the global segment, surviving local scope pruning.

🌐 Modern UTF-8 Support

Identifiers can include any Unicode character:

  • Standard: int count = 0;
  • International: int ε˜ι‡ = 10;, int ΠΏΠ΅Ρ€Π΅ΠΌ = 5;
  • Creative: float xπŸ˜’πŸ˜’ = 3.14;

πŸ“ Language Grammar (EBNF)

The full formal grammar used by the parser:

Program Structure

Program          ::= Includes* Declaration* MainDecl?
Includes         ::= "include" ("<" Identifier ">" | StringLiteral)
Declaration      ::= VarDecl | FunctionProto | FunctionDecl | EnumDecl

Declarations & Types

VarDecl          ::= ("const" | "global")? Type Identifier ("=" Expression)? ";"
Type             ::= "int" | "float" | "double" | "char" | "bool" | "string" | "void" | Identifier
FunctionProto    ::= Type Identifier "(" ParamList? ")" ";"
FunctionDecl     ::= Type Identifier "(" ParamList? ")" Block
MainDecl         ::= "main" Block
EnumDecl         ::= "enum" Identifier "{" Identifier ("," Identifier)* "}" ";"
ParamList        ::= (Type Identifier) ("," Type Identifier)*

Statements & Control Flow

Statement        ::= VarDecl | Block | IfStmt | WhileStmt | DoWhileStmt | ForStmt 
                   | SwitchStmt | ReturnStmt | BreakStmt | PrintStmt | ExpressionStmt
Block            ::= "{" Statement* "}"
IfStmt           ::= "if" "(" Expression ")" Block ("else" Block)?
WhileStmt        ::= "while" "(" Expression ")" Block
DoWhileStmt      ::= "do" Block "while" "(" Expression ")" ";"
ForStmt          ::= "for" "(" (VarDecl | ExpressionStmt | ";") Expression? ";" Expression? ")" Block
SwitchStmt       ::= "switch" "(" Expression ")" "{" CaseBlock* DefaultBlock? "}"
CaseBlock        ::= "case" Expression ":" Statement*
ReturnStmt       ::= "return" Expression? ";"
PrintStmt        ::= "print" "(" ExpressionList? ")" ";"

Expressions (Precedence-based)

Expression       ::= Assignment
Assignment       ::= LogicalOr ( "=" Assignment )?
LogicalOr        ::= LogicalAnd ( "||" LogicalAnd )*
Equality         ::= Comparison ( ("==" | "!=") Comparison )*
Comparison       ::= Term ( ("<" | ">" | "<=" | ">=") Term )*
Term             ::= Factor ( ("+" | "-") Factor )*
Factor           ::= Unary ( ("*" | "/" | "%") Unary )*
Unary            ::= ("-" | "!" | "++" | "--") Unary | Postfix
Postfix          ::= Primary ("++" | "--" | Call)*
Primary          ::= Literal | Identifier | "(" Expression ")"

πŸ“– Syntax Analysis: Pratt Parsing

The parser uses the Top-Down Operator Precedence (Pratt) algorithm. This allows the compiler to handle 12 levels of precedence with clean, non-recursive calls for infix operations.

Precedence Operators Association
Call () Left
Postfix ++, -- Left
Unary -, !, ++, -- Right
Factor *, /, % Left
Term +, - Left
Bitwise &, ` , ^, <<, >>`
Comparison <, >, <=, >= Left
Equality ==, != Left
Logical AND && Left
Logical OR `
Assignment = Right

🧠 Semantic Intelligence

Scope Analyzer (ScopeAnalyzer)

Implements a 2-pass approach:

  1. Declaration Collection: Harvests top-level symbols for forward-reference safety.
  2. Lexical Validation: Traverses the AST, maintaining a tree of ScopeFrame nodes. It handles Shadowing (inner blocks overriden outer ones) and prevents duplicate declarations.

Type Checker (TypeChecker)

Enforces the safety of the program:

  • Parameter Checking: Verifies function call argument types and counts.
  • Condition Validation: if/while conditions must be bool.
  • Switch Safety: Only int, char, or enum are allowed for switch expressions.
  • Context Awareness: Prevents break statements outside of loops and return in void functions.

πŸš€ Intermediate Representation (TAC)

The compiler translates the high-level AST into optimized Three-Address Code.

TAC Instruction Set

  • Register-like: Uses virtual temporaries t0, t1, t2...
  • Control Flow: Uses Label:, Goto Label, and ifFalse/ifTrue.
  • Metadata: Preserves const and global qualifiers for downstream backends.

Example Transformation:

// Source: x = 10 + 5 * 2;
t0 = 5 Multiply 2
t1 = 10 Plus t0
int x = t1

✨ IR Optimization passes

The compiler features a persistent optimization engine that reaches a "Fixed Point" (no further changes possible).

  1. Constant Folding: Solves 5 + 10 β†’ 15 at compile time.
  2. Constant Propagation: Replaces variable uses with known constants (e.g., const int a = 10; b = a + 5 β†’ b = 10 + 5).
  3. Copy Propagation: Eliminates chains like t1 = x; t2 = t1 β†’ t2 = x.
  4. Dead Code Elimination (DCE): Multi-step algorithm that prunes assignments to unused variables and removes unreachable labels.
  5. Peephole Optimization:
    • Algebraic Simplification: x + 0 β†’ x, x * 1 β†’ x, x * 0 β†’ 0.
    • Redundant Jumps: Deletes goto L1 if L1 is the next instruction.

πŸ—οΈ Project Structure

Custom-Compiler/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ core/           # Token definitions and AST Node models
β”‚   β”œβ”€β”€ lexer/          # State-machine based Lexical Analyzer
β”‚   β”œβ”€β”€ parser/         # Pratt Parser & AST Visualization tools
β”‚   β”œβ”€β”€ semantics/      # Scope Analyzer and Type Checker
β”‚   β”œβ”€β”€ ir_pipeline/    # TAC Generator and 5-pass Optimizer
β”‚   └── main.rs         # CLI Entry Point
β”œβ”€β”€ docs/               # Detailed technical documentation for each module
β”œβ”€β”€ test_input.txt      # Main integration test file
└── Cargo.toml          # Rust dependencies

πŸ“¦ Installation & Developer Guide

Prerequisites

Building the Project

cargo build --release

Running Tests

The compiler takes a source file as an argument and outputs the AST, Raw TAC, and Optimized TAC.

cargo run -- test_input.txt

πŸŽ“ Project Achievements

  • βœ… Robust Lexer: 35+ token types with full Unicode/Emoji identification.
  • βœ… Pratt Parser: Efficient 12-level precedence handling.
  • βœ… Semantic Suite: Detection of 15+ scope errors and 17+ type errors.
  • βœ… IR Pipeline: Full TAC generation for all control flow constructs.
  • βœ… Fixed-Point Optimizer: Significant reduction in code density and complexity through 5-pass analysis.

🚧 Road Map

  • Full IR Optimization Suite
  • RISC-V Code Generation Backend
  • Register Allocation (Graph Coloring)
  • Structs and Pointers support
  • Standard Library (I/O, String manipulation)

About

A professional, multi-pass compiler that brings the warmth of the Urdu language to modern programming. It builds optimized Three-Address Code, running natively on Windows and in-browser via WebAssembly

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages