Skip to content

Latest commit

 

History

History
194 lines (148 loc) · 4.84 KB

File metadata and controls

194 lines (148 loc) · 4.84 KB

Dana Lexer

A standalone, layout-aware lexer for the Dana language. It recognizes tokens, inserts implicit layout tokens (e.g., AUTOEND), and reports errors precisely with line/column and a caret indicator.

Features

  • Accurate error messages with line/column and caret display.
  • Layout handling: indentation → automatic block end markers.
  • String, char, byte, and int literals with escape support.
  • Configurable debug mode for tracing the lexing process.
  • Stable exit codes (0 on success, non-zero on error).

Prerequisites

  • flex (or lex)
  • make
  • A POSIX shell (Linux / macOS)

Build

From this directory:

make            # build lexer
make clean      # remove generated .cpp/.o files
make distclean  # remove all generated files, including the lexer binary

The build produces the binary: ./lexer.

Usage

Usage: ./lexer [OPTIONS] [input_file]

Options:
  -d, --debug     Enable detailed debug output
  -h, --help      Show this help and exit

Examples:

./lexer program.dana
./lexer -d program.dana
cat program.dana | ./lexer -d
./lexer < program.dana

Options

Option Description
-d, --debug Enable detailed debug output from the lexer.
-h, --help Show short usage help.

Exit Codes

  • 0 — success
  • non-zero — lexing error (diagnostic printed to stderr)

Examples

Input

def hello
    writeString: "Hello World!\n"

Output (default)

token = 1007, lexeme = def
token = 1024, lexeme = hello
token = 1024, lexeme = writeString
token = 58,   lexeme = :
token = 1027, lexeme = "Hello world!\n"
token = 1,    lexeme = <AUTOEND>
token = 0,    lexeme = <EOF>

Output (--debug)

lexeme 'def' (len=3)
[layout] Opened 'def' block → pushed indent=0
token = 1007, lexeme = def
WS len=1
lexeme 'hello' (len=5)
token = 1024, lexeme = hello
newline
==================== Line: 2 ====================
WS len=3
[layout] Line 2: BOL whitespace len=3
[layout] Line 2: indent=4 → check layout stack
lexeme 'writeString' (len=11)
token = 1024, lexeme = writeString
lexeme ':' (len=1)
token = 58, lexeme = :
WS len=1
lexeme '"Hello world!\n"' (len=16)
token = 1027, lexeme = "Hello world!\n"
newline
==================== Line: 3 ====================
[layout] EOF: pop driver='def' (indent=0)
token = 1, lexeme = <AUTOEND>
token = 0, lexeme = <EOF>

Debug Output Explained

  • WS len=N: Whitespace bytes (spaces/tabs) consumed since last token. At beginning of line (after the banner), it shows indentation length.
  • newline: A line break was consumed. The following banner marks the start of the new line.
  • [layout] …: Indentation-based block handling — push, pop, or check layout stack.
  • Opened 'def' block: Entered a new layout block at column 0.
  • indent=4 &rarr; check layout stack: Current line is indented 4 spaces; compared against the stack to decide block continuation or closure.
  • EOF: pop driver='def': At end of file, remaining blocks are closed by emitting AUTOEND.
  • token = …, lexeme = …: Token code and recognized lexeme.
  • <AUTOEND>: Automatic layout token (similar to Python’s implicit DEDENT).
  • <EOF>: End of input.

Error Reporting

On illegal characters or lexing errors, the lexer prints a diagnostic:

Input

# The $ at the end of the string should trigger a lexer error
def main
    begin
        writeString: "Hello world!"$
    endd

Output

Lexer error at line 4, column 37
4:         writeString: "Hello world!"$
                                      ^
Illegal character encountered: $
  • The caret (^) points to the offending column.
  • Process exits with a non-zero code (see Exit Codes).

Testing

Use the repository’s Python test harness:

cd ../testing
python3 test_lexer.py

The script:

  • Runs Dana programs through lexer.
  • Compares printed lexemes against expected files in testing/lexer/output.
  • Ignores numeric token codes (lexemes only are checked).

Expected Output File Format (hello.output)

lexeme = def
lexeme = hello
lexeme = writeString
lexeme = :
lexeme = "Hello world!\n"
lexeme = <AUTOEND>
lexeme = <EOF>

Tips & Troubleshooting

  • Install Flex if missing:

    sudo apt-get install flex
  • Ensure files are UTF-8 encoded; lexer positions are computed per byte.

  • Prefer LF (\n) line endings. Convert Windows endings if needed:

    dos2unix program.dana
  • Piped input: supported on stdin. In debug mode, layout events are still shown line by line. To terminate input, press Ctrl+D twice: once to end stdin and once to flush output.