"Everything is controlled by electrical contacts and relays; I won't give you the details, you know all that. And besides, what's more, the piano really works." - Boris Vian, L'Écume des jours
Welcome to ft_lex, a robust reimplementation of the lex utility, written entirely in Zig. Just like Colin's pianocktail, this project transforms regex patterns into token streams. Fully compliant with the POSIX.1-2024 standard.
- About The Project
- Architecture
- Core Features
- Quick Start
- Usage
- Example .l File
- The 42 Network Perspective
- Resources
- Contributing
- Project Status
- Acknowledgments
The lex utility is a foundational tool in compiler construction, responsible for lexical analysis. It converts input character streams into token sequences based on regular expression rules.
This from-scratch implementation handles the entire pipeline:
- Parsing
.lfiles containing lexer rules - Constructing finite automata from regular expressions
- Generating C source files (
lex.yy.c) with state machines for lexical analysis
Written in Zig for performance, safety, and memory control. Output C code is portable and efficient, using minimal standard library functions.
Multi-stage pipeline based on compiler theory:
graph TD
%% --- Catppuccin Mocha Palette Definition ---
classDef default fill:#313244,stroke:#45475a,stroke-width:2px,color:#a6adc8
classDef input fill:#89b4fa,stroke:#cba6f7,stroke-width:2px,color:#1e1e2e
classDef structure fill:#f9e2af,stroke:#f5c2e7,stroke-width:2px,color:#1e1e2e
classDef final_dfa fill:#cba6f7,stroke:#f5c2e7,stroke-width:2px,color:#1e1e2e
classDef algo fill:#94e2d5,stroke:#a6e3a1,stroke-width:2px,color:#1e1e2e
classDef codegen fill:#f5c2e7,stroke:#f38ba8,stroke-width:2px,color:#1e1e2e
classDef output fill:#a6e3a1,stroke:#94e2d5,stroke-width:2px,color:#1e1e2e
classDef lib fill:#f38ba8,stroke:#f5c2e7,stroke-width:2px,color:#1e1e2e
%% --- Diagram Structure ---
subgraph "Phase 1: Parsing & AST Generation"
A(fa:fa-file-alt .l File Input):::input -- .l file content --> B["<b>.l File Parser</b><br/>Separates definitions, rules, & user code sections"];
B -- Regex patterns & actions --> C["<b>Regex Parser</b><br/>Generates an Abstract Syntax Tree (AST) for each pattern"]:::structure;
end
subgraph "Phase 2: Finite Automata Construction"
C -- ASTs --> D["<b>NFA Builder</b><br/><i>Thompson's Construction Algorithm</i><br/>Converts each AST into an NFA"]:::algo;
D -- Individual NFAs --> E["<b>NFA-to-DFA Conversion</b><br/><i>Subset Construction Algorithm</i><br/>Converts combined NFA into one DFA"]:::algo;
E -- Generated DFA --> F["<b>DFA Minimization</b><br/><i>Hopcroft's or Moore's Algorithm</i><br/>Reduces DFA states to the minimum"]:::final_dfa;
end
subgraph "Phase 3: Code Generation & Linking"
F -- Minimized DFA Transition Table --> G["<b>Code Generator</b><br/>Translates the DFA's state table into C or Zig source code"]:::codegen;
G -- Source Code --> H(fa:fa-file-code lex.yy.c / .zig Output):::output;
L["<b>libl Runtime Library</b><br/>Provides standard functions like yytext(), yyleng, input(), etc."]:::lib
L --> H
end
- Extended Regular Expressions (ERE): Full POSIX ERE syntax support
- Standard Library (
libl): Complete library with all required functions and external variables - Scanner Control Functions:
input(),unput(),yywrap(),yymore(), andyyless() - Start Conditions: Inclusive (
%s) and exclusive (%x) withBEGINmacro - Context Sensitivity: Trailing context (
/) and line anchoring (^,$) - Action Keywords: Including
REJECTfor subsequent rule matching
- Tokenizer & Parser: Hand-written parser for
.lformat and regex grammar - AST Construction: Rules converted to Abstract Syntax Trees
- Thompson's Construction: ASTs transformed to NFAs
- Subset Construction: NFAs unified into single DFA
- DFA Minimization: Moore's algorithm reduces states to minimum
- Modular Code Generation: Template-based system generates only required features
- Zig Scanner Generation: Generate scanner source in Zig
- Triple Array Trie Compression: Reduced memory footprint and improved performance
- Graph Visualization (
-gflag): Generate.dotfiles for Graphviz
- Zig toolchain (version 0.11.0 or later) - Installation Guide
- A lot of caffeine ☕
- Clone the repository:
git clone https://github.com/pix3l-p33p3r/ft_lex.git && cd ft_lex
- Build
liblstatic library:zig build libl
- Build
ft_lexexecutable:Executable available atzig build
./zig-out/bin/ft_lex
- Create lexer definition file (e.g.,
scanner.l) - Generate C source:
./zig-out/bin/ft_lex scanner.l
- Compile with
libl:cc -o scanner lex.yy.c src/libl/libl.a
- Run on input:
./scanner < input.txt
./zig-out/bin/ft_lex --helpThe examples/ directory contains .l files demonstrating various features.
Certain features significantly impact DFA size and performance:
REJECT,yymore(), and trailing context (/) substantially increase DFA states- This is inherent to finite automata handling, consistent with other implementations like
flex - Use judiciously in performance-critical applications
Simple arithmetic lexer:
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("NUMBER(%s)\n", yytext); }
"+" { printf("PLUS\n"); }
"-" { printf("MINUS\n"); }
"*" { printf("MULTIPLY\n"); }
"/" { printf("DIVIDE\n"); }
[ \t\n] { /* ignore whitespace */ }
. { printf("ERROR: Unexpected character %s\n", yytext); }
%%
int main() {
yylex();
return 0;
}
int yywrap() {
return 1;
}This project exemplifies why we learn low-level programming. You're implementing fundamental computer science concepts powering every compiler, interpreter, and parser.
What you'll learn:
- Finite automata theory
- Compiler construction techniques
- Memory management in systems programming
- State machine design
- Regex engine internals
Pro tips:
- Read POSIX spec carefully
- Test edge cases thoroughly
- Draw state diagrams
- Every bug teaches something
- Thompson, Ken. "Regular expression search algorithm"
- Aho, Sethi, Ullman. "Compilers: Principles, Techniques, and Tools" - The Dragon Book
- POSIX.1-2024 lex specification
- Zig Documentation
- Regular Expressions: From Theory to Practice - Russ Cox's series
- Fork the repo
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add some amazing feature') - Push branch (
git push origin feature/amazing-feature) - Open Pull Request
Complete. All mandatory 1337 school requirements fulfilled, bonus objectives implemented and tested.
- Boris Vian for poetic inspiration about pianocktails and finite state transducers
- Ken Thompson for regex compilation
- Zig team for making systems programming enjoyable
- 42 students who left wisdom in the halls
- POSIX.1-2024 and flex source as references
- Part of advanced UNIX curriculum at 1337 (42 Network)
"There's only one problem," said Colin. "The loud pedal for the whipped egg. I had to put in a special system of interlocking parts because when you play a tune that's too 'hot,' pieces of omelet fall into the cocktail and it's hard to swallow."
Don't let your regex patterns be too hot, or you might end up with omelet in your token stream!
_
$Happy\ lexing!$ _