A standalone, layout-aware lexer for the Dana language.
It recognizes tokens, inserts implicit layout tokens (e.g., AUTOEND), and reports errors precisely with line/column and a caret indicator.
- Accurate error messages with line/column and caret display.
- Layout handling: indentation → automatic block end markers.
- String, char, byte, and int literals with escape support.
- Configurable debug mode for tracing the lexing process.
- Stable exit codes (
0on success, non-zero on error).
- flex (or
lex) - make
- A POSIX shell (Linux / macOS)
From this directory:
make # build lexer
make clean # remove generated .cpp/.o files
make distclean # remove all generated files, including the lexer binaryThe build produces the binary: ./lexer.
Usage: ./lexer [OPTIONS] [input_file]
Options:
-d, --debug Enable detailed debug output
-h, --help Show this help and exit
Examples:
./lexer program.dana
./lexer -d program.dana
cat program.dana | ./lexer -d
./lexer < program.dana| Option | Description |
|---|---|
-d, --debug |
Enable detailed debug output from the lexer. |
-h, --help |
Show short usage help. |
0— success- non-zero — lexing error (diagnostic printed to stderr)
def hello
writeString: "Hello World!\n"
token = 1007, lexeme = def
token = 1024, lexeme = hello
token = 1024, lexeme = writeString
token = 58, lexeme = :
token = 1027, lexeme = "Hello world!\n"
token = 1, lexeme = <AUTOEND>
token = 0, lexeme = <EOF>
lexeme 'def' (len=3)
[layout] Opened 'def' block → pushed indent=0
token = 1007, lexeme = def
WS len=1
lexeme 'hello' (len=5)
token = 1024, lexeme = hello
newline
==================== Line: 2 ====================
WS len=3
[layout] Line 2: BOL whitespace len=3
[layout] Line 2: indent=4 → check layout stack
lexeme 'writeString' (len=11)
token = 1024, lexeme = writeString
lexeme ':' (len=1)
token = 58, lexeme = :
WS len=1
lexeme '"Hello world!\n"' (len=16)
token = 1027, lexeme = "Hello world!\n"
newline
==================== Line: 3 ====================
[layout] EOF: pop driver='def' (indent=0)
token = 1, lexeme = <AUTOEND>
token = 0, lexeme = <EOF>
WS len=N: Whitespace bytes (spaces/tabs) consumed since last token. At beginning of line (after the banner), it shows indentation length.newline: A line break was consumed. The following banner marks the start of the new line.[layout] …: Indentation-based block handling — push, pop, or check layout stack.Opened 'def' block: Entered a new layout block at column 0.indent=4 → check layout stack: Current line is indented 4 spaces; compared against the stack to decide block continuation or closure.EOF: pop driver='def': At end of file, remaining blocks are closed by emittingAUTOEND.token = …, lexeme = …: Token code and recognized lexeme.<AUTOEND>: Automatic layout token (similar to Python’s implicit DEDENT).<EOF>: End of input.
On illegal characters or lexing errors, the lexer prints a diagnostic:
# The $ at the end of the string should trigger a lexer error
def main
begin
writeString: "Hello world!"$
endd
Lexer error at line 4, column 37
4: writeString: "Hello world!"$
^
Illegal character encountered: $
- The caret (
^) points to the offending column. - Process exits with a non-zero code (see Exit Codes).
Use the repository’s Python test harness:
cd ../testing
python3 test_lexer.pyThe script:
- Runs Dana programs through
lexer. - Compares printed lexemes against expected files in
testing/lexer/output. - Ignores numeric token codes (lexemes only are checked).
lexeme = def
lexeme = hello
lexeme = writeString
lexeme = :
lexeme = "Hello world!\n"
lexeme = <AUTOEND>
lexeme = <EOF>
-
Install Flex if missing:
sudo apt-get install flex
-
Ensure files are UTF-8 encoded; lexer positions are computed per byte.
-
Prefer LF (
\n) line endings. Convert Windows endings if needed:dos2unix program.dana
-
Piped input: supported on stdin. In debug mode, layout events are still shown line by line. To terminate input, press
Ctrl+Dtwice: once to end stdin and once to flush output.