Skip to content

Latest commit

 

History

History
371 lines (304 loc) · 18.4 KB

File metadata and controls

371 lines (304 loc) · 18.4 KB
language python
framework none
build_cmd python3 -m py_compile main.py tokenizer/tokenizer.py lexers/lexer.py validate.py hard_validate.py
test_cmd python3 -m pytest tests/ -v
lint_cmd echo 'lint not configured'
fmt_cmd echo 'format not configured'
birth_date 2026-03-13

You must only write code and tests that meet the features and scenarios of this behaviour driven development document.

System: A YAML-configured lexer (tokenizer) that converts source code files into a stream of typed JSON tokens. Language support is defined by YAML lexer files. The tokenizer is invoked from the command line and outputs a JSON array.

Feature: Tokenise source code
    As a developer
    I want to tokenise source code using a YAML lexer config
    So that I can analyse or transform code programmatically

    Scenario: Tokenise a Python hello-world file
        Given a Python source file and the Python lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "print"

    Scenario: Tokenise a JavaScript hello-world file
        Given a JavaScript source file and the JavaScript lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "function"

    Scenario: Tokenise a TypeScript hello-world file
        Given a TypeScript source file and the TypeScript lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "function"

    Scenario: Tokenise a Rust hello-world file
        Given a Rust source file and the Rust lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "fn"

    Scenario: Tokenise a C++ hello-world file
        Given a C++ source file and the C++ lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "int"

    Scenario: Tokenise a Fortran hello-world file
        Given a Fortran source file and the Fortran lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "program"

    Scenario: Tokenise a Vyper hello-world file
        Given a Vyper source file and the Vyper lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And every token has a non-empty "type" and "value"

Feature: JSON output format
    As a developer consuming PyLex output
    I want the tokenizer output to be valid, well-structured JSON
    So that I can parse and process it reliably

    Scenario: Output is a valid JSON array
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then stdout is parseable as a JSON array
        And each element has a "value" field and a "type" field

    Scenario: Token types are non-empty strings
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then every token has a non-empty string "type"
        And every token has a non-empty string "value"

    Scenario: Concatenating token values reconstructs the original input
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then concatenating all token "value" fields produces the original source file content

    Scenario: Unrecognised characters are reported on stderr not stdout
        Given a source file containing a character not matched by any token rule
        When I run the tokenizer
        Then stdout remains a valid JSON array
        And the warning about the unrecognised character appears on stderr

Feature: Token type identification
    As a developer
    I want source code tokens to be correctly classified by type
    So that I can distinguish keywords from identifiers, literals, and operators

    Scenario: Keywords are identified as keyword tokens
        Given a source file containing a language keyword
        When I run the tokenizer with the appropriate lexer
        Then the keyword appears as a token with type "keyword"

    Scenario: Identifiers are identified as identifier tokens
        Given a source file containing a user-defined name
        When I run the tokenizer with the appropriate lexer
        Then the name appears as a token with type "identifier"

    Scenario: Whitespace is preserved as whitespace tokens
        Given a source file containing spaces or newlines
        And the lexer YAML defines a whitespace token pattern
        When I run the tokenizer with the appropriate lexer
        Then whitespace appears as tokens with type "whitespace" in the output
        And whitespace is not silently discarded

    Scenario: String literals are identified as string literal tokens
        Given a source file containing a quoted string
        When I run the tokenizer with a lexer that defines a string_literal pattern
        Then the quoted string appears as a token with type "string_literal"

    Scenario: Operators are identified as operator tokens
        Given a source file containing operators such as "+" or "="
        When I run the tokenizer with a lexer that defines operator tokens
        Then each operator appears as a token with type "operator"

    Scenario: Keywords are not misidentified as identifiers
        Given a source file where a keyword appears adjacent to an identifier
        When I run the tokenizer with the appropriate lexer
        Then the keyword token has type "keyword" not "identifier"
        And the identifier token has type "identifier"

Feature: Comprehensive language tokenisation
    As a developer
    I want each bundled lexer to handle common real-world code constructs
    So that the tokenizer is useful beyond hello-world examples

    Scenario: Tokenise a Python file with function definition and control flow
        Given a Python source file containing a function definition and an if statement
        When I run the tokenizer with the Python lexer
        Then tokens include keywords "def", "if", and "return"
        And identifiers for the function name and variable names are present

    Scenario: Tokenise a JavaScript file with variable declarations and arrow functions
        Given a JavaScript source file containing "const", "let", and an arrow function
        When I run the tokenizer with the JavaScript lexer
        Then tokens include keywords "const" and "let"
        And the arrow function syntax is tokenised without errors

    Scenario: Tokenise a Rust file with struct and impl definitions
        Given a Rust source file containing a struct and an impl block
        When I run the tokenizer with the Rust lexer
        Then tokens include keywords "struct" and "impl"
        And identifiers for the struct name are present

Feature: CLI error handling
    As a developer running PyLex from the command line
    I want clear error messages when I misuse the tool
    So that I can quickly understand and fix the problem

    Scenario: Missing command-line arguments prints usage and exits non-zero
        Given I run the tokenizer with no arguments
        Then the exit code is non-zero
        And a usage message is printed to stderr

    Scenario: Input file not found exits with a clear error message
        Given I run the tokenizer with a path to a file that does not exist
        Then the exit code is non-zero
        And an error message mentioning the missing file is printed to stderr

    Scenario: Invalid YAML lexer config exits with a clear error message
        Given I run the tokenizer with a lexer config file containing invalid YAML
        Then the exit code is non-zero
        And an error message describing the YAML parse failure is printed to stderr

Feature: Lexer schema validation
    As a lexer author
    I want invalid YAML lexer files to be rejected
    So that I can catch configuration mistakes early

    Scenario: Valid lexer YAML passes validation
        Given a correctly structured lexer YAML file
        When I run validate.py
        Then the exit code is 0
        And no errors are reported

    Scenario: Lexer YAML missing required field fails validation
        Given a lexer YAML file missing the "tokens" field
        When I run validate.py
        Then the exit code is non-zero
        And an error message describes the missing field

    Scenario: Lexer YAML with a token missing both value and pattern fails validation
        Given a lexer YAML file containing a token definition with neither "value" nor "pattern"
        When I run validate.py
        Then the exit code is non-zero
        And an error message describes the invalid token definition

    Scenario: All bundled lexer files pass validation
        Given all YAML files in the lexers/ directory
        When I run validate.py on each one
        Then every file passes validation with exit code 0

Feature: Custom lexer configuration
    As a developer
    I want to write my own lexer YAML configuration
    So that I can tokenise languages or dialects not bundled with PyLex

    Scenario: A custom lexer tokenises a simple DSL
        Given a custom lexer YAML defining two token types and a simple input string
        When I run the tokenizer with that custom lexer
        Then the output contains tokens matching the custom definitions
        And no tokens from the built-in lexers appear in the output

Feature: Comment tokenisation
    As a developer
    I want comments in source code to be tokenised correctly
    So that I can preserve or strip them in downstream tooling

    Scenario: Single-line comments are tokenised as comment tokens
        Given a source file containing a single-line comment (e.g. "# hello" or "// hello")
        When I run the tokenizer with a lexer that defines a single-line comment pattern
        Then the comment text appears as a token with type "comment"
        And the concatenation of all token values reconstructs the original input

    Scenario: Multi-line comments are tokenised as comment tokens
        Given a source file containing a block comment (e.g. "/* ... */")
        When I run the tokenizer with a lexer that defines a multi-line comment pattern
        Then the entire block comment appears as a token with type "comment"
        And the concatenation of all token values reconstructs the original input

Feature: Import statement tokenisation
    As a developer
    I want import and use statements to be tokenised correctly
    So that I can analyse dependencies programmatically

    Scenario: Python import statement is tokenised correctly
        Given a Python source file containing "import os" and "from sys import path"
        When I run the tokenizer with the Python lexer
        Then "import" and "from" appear as keyword tokens
        And "os", "sys", and "path" appear as identifier tokens

    Scenario: JavaScript import statement is tokenised correctly
        Given a JavaScript source file containing an ES6 import statement
        When I run the tokenizer with the JavaScript lexer
        Then "import" and "from" appear as keyword tokens
        And the module name appears as a string literal token

    Scenario: Rust use statement is tokenised correctly
        Given a Rust source file containing a "use" declaration
        When I run the tokenizer with the Rust lexer
        Then "use" appears as a keyword token
        And the path components appear as identifier tokens

Feature: Multi-character operator tokenisation
    As a developer
    I want multi-character operators to be tokenised as single tokens
    So that "==" is not split into two "=" tokens

    Scenario: Equality operator is tokenised as a single token
        Given a source file containing "=="
        When I run the tokenizer with a lexer that defines a "==" operator
        Then "==" appears as a single operator token
        And it is not split into two separate "=" tokens

    Scenario: Arrow operator is tokenised as a single token
        Given a source file containing "=>" or "->"
        When I run the tokenizer with a lexer that defines arrow operators
        Then the arrow appears as a single operator token

    Scenario: Compound assignment operators are tokenised as single tokens
        Given a source file containing "+=", "-=", or "*="
        When I run the tokenizer with a lexer that defines compound assignment operators
        Then each compound operator appears as a single operator token

Feature: Number literal tokenisation
    As a developer
    I want numeric literals to be tokenised as number tokens
    So that I can distinguish them from identifiers and keywords

    Scenario: Integer literals are tokenised as number tokens
        Given a source file containing integer literals such as "42" and "0"
        When I run the tokenizer with a lexer that defines a number pattern
        Then each integer appears as a token with type "number"

    Scenario: Float literals are tokenised as number tokens
        Given a source file containing float literals such as "3.14" and "0.5"
        When I run the tokenizer with a lexer that defines a number pattern
        Then each float appears as a token with type "number"

    Scenario: Hexadecimal literals are tokenised as number tokens
        Given a source file containing hex literals such as "0xFF" and "0x1A2B"
        When I run the tokenizer with a lexer that defines a hex number pattern
        Then each hex literal appears as a token with type "number"

Feature: Empty input handling
    As a developer
    I want the tokenizer to handle empty or whitespace-only files gracefully
    So that it does not crash on edge-case inputs

    Scenario: Empty file produces an empty token array
        Given an empty source file
        When I run the tokenizer
        Then the output is a valid JSON array
        And the array is empty

    Scenario: Whitespace-only file produces only whitespace tokens
        Given a source file containing only spaces and newlines
        When I run the tokenizer with a lexer that defines a whitespace pattern
        Then the output contains only whitespace tokens
        And the concatenation of token values equals the original input

Feature: Duplicate token deduplication
    As a lexer author
    I want duplicate token definitions to be detected
    So that I can avoid silent redundancy in my lexer configs

    Scenario: Duplicate value-based tokens are rejected during validation
        Given a lexer YAML file containing two tokens with the same value (e.g. "**" defined twice)
        When I run validate.py
        Then the exit code is non-zero
        And an error message identifies the duplicate token value

    Scenario: Bundled lexers contain no duplicate token values
        Given all YAML files in the lexers/ directory
        When I check each file for tokens with duplicate "value" fields
        Then no duplicates are found in any bundled lexer

Feature: Regex pattern precompilation
    As a developer processing large source files
    I want lexer regex patterns to be compiled once and reused
    So that tokenisation performance scales linearly with input size

    Scenario: Patterns are compiled once before tokenisation begins
        Given a lexer YAML containing regex pattern tokens
        When I tokenise a source file
        Then each regex pattern is compiled only once, not on every character position

    Scenario: Large file tokenisation completes within a reasonable time
        Given a Python source file of at least 1000 lines
        When I run the tokenizer with the Python lexer
        Then tokenisation completes within 5 seconds
        And the output is a valid JSON array

Feature: Efficient string slicing
    As a developer processing large source files
    I want the lexer to avoid O(n^2) string concatenation
    So that performance does not degrade on long tokens

    Scenario: String slice extraction uses direct slicing instead of character-by-character concatenation
        Given the lexer's get_string_slice function
        When it extracts a slice from an input string
        Then it uses index-based slicing rather than appending one character at a time

Feature: CLI error messages
    As a developer running PyLex from the command line
    I want user-friendly error messages instead of Python tracebacks
    So that I can understand problems without reading implementation details

    Scenario: File not found produces a clean error message without a traceback
        Given I run the tokenizer with a path to a file that does not exist
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"

    Scenario: Invalid YAML produces a clean error message without a traceback
        Given I run the tokenizer with a lexer config file containing invalid YAML
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"

    Scenario: Unreadable file produces a clean error message without a traceback
        Given I run the tokenizer with a file that exists but is not readable
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"