Skip to content

feat: implement mapping language lexer with full token set#43

Merged
alvinreal merged 1 commit intomainfrom
fix/issue-15
Feb 21, 2026
Merged

feat: implement mapping language lexer with full token set#43
alvinreal merged 1 commit intomainfrom
fix/issue-15

Conversation

@alvinreal
Copy link
Owner

Summary

Closes #15 — Mapping lexer.
Part of #3 (Mapping DSL epic).

Changes

Implements a complete tokenizer for the morph mapping DSL in src/mapping/lexer.rs.

Token types supported:

  • 18 keywords: rename, select, drop, set, default, cast, as, where, sort, each, when, not, and, or, flatten, nest, asc, desc
  • 13 operators: -> = == != > >= < <= + - * / %
  • 8 delimiters: { } ( ) [ ] , .
  • String literals with escape sequences: \" \\\\ \\n \\t \\r \\uXXXX, plus direct UTF-8
  • Number literals: integers, floats, scientific notation (1e10, 5e-3), negative numbers with context-aware unary minus vs subtraction disambiguation
  • Boolean (true/false) and null literals
  • Identifiers for function names and field references
  • Newlines as statement separators (consecutive newlines collapsed)
  • Line comments (# ...)
  • Span tracking (line:column) for error reporting

Key design decisions:

  • -7 is a negative integer literal when preceded by an operator/delimiter/newline, but x -7 is x, Minus, 7 (subtraction)
  • 42.field tokenizes as Int(42), Dot, Ident(field) — the dot is only a decimal point if followed by a digit
  • Keyword matching is exact: setter is an identifier, not set + ter

Tests (100+):

  • Every keyword, operator, and delimiter individually
  • String escapes (quote, backslash, newline, tab, CR, unicode BMP, UTF-8 emoji)
  • Number formats (int, zero, float, negative, scientific, large)
  • Paths (.a.b.c, .[0], .[*], .["key"], mixed)
  • Comments (ignored, with code, preserves newlines, end-of-line)
  • Whitespace handling (spaces, tabs, mixed)
  • Newline collapsing (multiple → one, no leading/trailing)
  • Full statements (rename, set, select, cast, sort, each block, function calls)
  • Subtraction vs negative number disambiguation
  • Error cases (unterminated strings, invalid characters, invalid escapes, invalid numbers) with position checking
  • Span correctness across multiple tokens and lines
  • Edge cases (empty input, whitespace-only, comment-only, integer-then-dot-field)

All checks pass: cargo build, cargo test (361 unit + 45 integration = 406 tests), cargo fmt --check, cargo clippy -D warnings.

- Implement tokenizer for the morph mapping DSL with complete token coverage:
  18 keywords (rename, select, drop, set, default, cast, as, where, sort,
  each, when, not, and, or, flatten, nest, asc, desc)
- 13 operators (->  =  ==  !=  >  >=  <  <=  +  -  *  /  %)
- 8 delimiters ({ } ( ) [ ] , .)
- String literals with escape sequences (\" \\ \n \t \r \uXXXX)
  and direct UTF-8 support
- Number literals: integers, floats, scientific notation, negative numbers
  with context-aware unary minus vs subtraction operator disambiguation
- Boolean (true/false) and null literals
- Identifiers for function names and field references
- Newlines as statement separators (collapsed)
- Line comments (# ...)
- Span tracking (line:column) for error reporting
- Add 100+ tests covering: every keyword, operator, and delimiter individually,
  string escapes (quote, backslash, newline, tab, CR, unicode BMP, UTF-8),
  number formats (int, float, negative, scientific, large), paths (.a.b.c,
  .[0], .[*], .["key"]), comments, whitespace handling, newline collapsing,
  full statements (rename, set, select, cast, sort, each blocks, function
  calls), subtraction vs negative number disambiguation, error cases
  (unterminated strings, invalid characters, invalid escapes, invalid numbers),
  span correctness, and edge cases

Fixes #15
@alvinreal alvinreal merged commit fb4fe38 into main Feb 21, 2026
6 checks passed
@alvinreal alvinreal deleted the fix/issue-15 branch February 21, 2026 08:50
@github-actions github-actions bot mentioned this pull request Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mapping language: Lexer (tokenizer)

1 participant