PyLex/BDD.md at develop · dweng0/PyLex

language	python
framework	none
build_cmd	python3 -m py_compile main.py tokenizer/tokenizer.py lexers/lexer.py validate.py hard_validate.py
test_cmd	python3 -m pytest tests/ -v
lint_cmd	echo 'lint not configured'
fmt_cmd	echo 'format not configured'
birth_date	2026-03-13
You must only write code and tests that meet the features and scenarios of this behaviour driven development document.
System: A YAML-configured lexer (tokenizer) that converts source code files into a stream of typed JSON tokens. Language support is defined by YAML lexer files. The tokenizer is invoked from the command line and outputs a JSON array.
Feature: Tokenise source code
    As a developer
    I want to tokenise source code using a YAML lexer config
    So that I can analyse or transform code programmatically

    Scenario: Tokenise a Python hello-world file
        Given a Python source file and the Python lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "print"

    Scenario: Tokenise a JavaScript hello-world file
        Given a JavaScript source file and the JavaScript lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "function"

    Scenario: Tokenise a TypeScript hello-world file
        Given a TypeScript source file and the TypeScript lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "function"

    Scenario: Tokenise a Rust hello-world file
        Given a Rust source file and the Rust lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "fn"

    Scenario: Tokenise a C++ hello-world file
        Given a C++ source file and the C++ lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "int"

    Scenario: Tokenise a Fortran hello-world file
        Given a Fortran source file and the Fortran lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And the output contains a keyword token for "program"

    Scenario: Tokenise a Vyper hello-world file
        Given a Vyper source file and the Vyper lexer YAML
        When I run the tokenizer
        Then I receive a JSON array of tokens
        And every token has a non-empty "type" and "value"

Feature: JSON output format
    As a developer consuming PyLex output
    I want the tokenizer output to be valid, well-structured JSON
    So that I can parse and process it reliably

    Scenario: Output is a valid JSON array
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then stdout is parseable as a JSON array
        And each element has a "value" field and a "type" field

    Scenario: Token types are non-empty strings
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then every token has a non-empty string "type"
        And every token has a non-empty string "value"

    Scenario: Concatenating token values reconstructs the original input
        Given any supported source file and its lexer YAML
        When I run the tokenizer
        Then concatenating all token "value" fields produces the original source file content

    Scenario: Unrecognised characters are reported on stderr not stdout
        Given a source file containing a character not matched by any token rule
        When I run the tokenizer
        Then stdout remains a valid JSON array
        And the warning about the unrecognised character appears on stderr

Feature: Token type identification
    As a developer
    I want source code tokens to be correctly classified by type
    So that I can distinguish keywords from identifiers, literals, and operators

    Scenario: Keywords are identified as keyword tokens
        Given a source file containing a language keyword
        When I run the tokenizer with the appropriate lexer
        Then the keyword appears as a token with type "keyword"

    Scenario: Identifiers are identified as identifier tokens
        Given a source file containing a user-defined name
        When I run the tokenizer with the appropriate lexer
        Then the name appears as a token with type "identifier"

    Scenario: Whitespace is preserved as whitespace tokens
        Given a source file containing spaces or newlines
        And the lexer YAML defines a whitespace token pattern
        When I run the tokenizer with the appropriate lexer
        Then whitespace appears as tokens with type "whitespace" in the output
        And whitespace is not silently discarded

    Scenario: String literals are identified as string literal tokens
        Given a source file containing a quoted string
        When I run the tokenizer with a lexer that defines a string_literal pattern
        Then the quoted string appears as a token with type "string_literal"

    Scenario: Operators are identified as operator tokens
        Given a source file containing operators such as "+" or "="
        When I run the tokenizer with a lexer that defines operator tokens
        Then each operator appears as a token with type "operator"

    Scenario: Keywords are not misidentified as identifiers
        Given a source file where a keyword appears adjacent to an identifier
        When I run the tokenizer with the appropriate lexer
        Then the keyword token has type "keyword" not "identifier"
        And the identifier token has type "identifier"

Feature: Comprehensive language tokenisation
    As a developer
    I want each bundled lexer to handle common real-world code constructs
    So that the tokenizer is useful beyond hello-world examples

    Scenario: Tokenise a Python file with function definition and control flow
        Given a Python source file containing a function definition and an if statement
        When I run the tokenizer with the Python lexer
        Then tokens include keywords "def", "if", and "return"
        And identifiers for the function name and variable names are present

    Scenario: Tokenise a JavaScript file with variable declarations and arrow functions
        Given a JavaScript source file containing "const", "let", and an arrow function
        When I run the tokenizer with the JavaScript lexer
        Then tokens include keywords "const" and "let"
        And the arrow function syntax is tokenised without errors

    Scenario: Tokenise a Rust file with struct and impl definitions
        Given a Rust source file containing a struct and an impl block
        When I run the tokenizer with the Rust lexer
        Then tokens include keywords "struct" and "impl"
        And identifiers for the struct name are present

Feature: CLI error handling
    As a developer running PyLex from the command line
    I want clear error messages when I misuse the tool
    So that I can quickly understand and fix the problem

    Scenario: Missing command-line arguments prints usage and exits non-zero
        Given I run the tokenizer with no arguments
        Then the exit code is non-zero
        And a usage message is printed to stderr

    Scenario: Input file not found exits with a clear error message
        Given I run the tokenizer with a path to a file that does not exist
        Then the exit code is non-zero
        And an error message mentioning the missing file is printed to stderr

    Scenario: Invalid YAML lexer config exits with a clear error message
        Given I run the tokenizer with a lexer config file containing invalid YAML
        Then the exit code is non-zero
        And an error message describing the YAML parse failure is printed to stderr

Feature: Lexer schema validation
    As a lexer author
    I want invalid YAML lexer files to be rejected
    So that I can catch configuration mistakes early

    Scenario: Valid lexer YAML passes validation
        Given a correctly structured lexer YAML file
        When I run validate.py
        Then the exit code is 0
        And no errors are reported

    Scenario: Lexer YAML missing required field fails validation
        Given a lexer YAML file missing the "tokens" field
        When I run validate.py
        Then the exit code is non-zero
        And an error message describes the missing field

    Scenario: Lexer YAML with a token missing both value and pattern fails validation
        Given a lexer YAML file containing a token definition with neither "value" nor "pattern"
        When I run validate.py
        Then the exit code is non-zero
        And an error message describes the invalid token definition

    Scenario: All bundled lexer files pass validation
        Given all YAML files in the lexers/ directory
        When I run validate.py on each one
        Then every file passes validation with exit code 0

Feature: Custom lexer configuration
    As a developer
    I want to write my own lexer YAML configuration
    So that I can tokenise languages or dialects not bundled with PyLex

    Scenario: A custom lexer tokenises a simple DSL
        Given a custom lexer YAML defining two token types and a simple input string
        When I run the tokenizer with that custom lexer
        Then the output contains tokens matching the custom definitions
        And no tokens from the built-in lexers appear in the output

Feature: Comment tokenisation
    As a developer
    I want comments in source code to be tokenised correctly
    So that I can preserve or strip them in downstream tooling

    Scenario: Single-line comments are tokenised as comment tokens
        Given a source file containing a single-line comment (e.g. "# hello" or "// hello")
        When I run the tokenizer with a lexer that defines a single-line comment pattern
        Then the comment text appears as a token with type "comment"
        And the concatenation of all token values reconstructs the original input

    Scenario: Multi-line comments are tokenised as comment tokens
        Given a source file containing a block comment (e.g. "/* ... */")
        When I run the tokenizer with a lexer that defines a multi-line comment pattern
        Then the entire block comment appears as a token with type "comment"
        And the concatenation of all token values reconstructs the original input

Feature: Import statement tokenisation
    As a developer
    I want import and use statements to be tokenised correctly
    So that I can analyse dependencies programmatically

    Scenario: Python import statement is tokenised correctly
        Given a Python source file containing "import os" and "from sys import path"
        When I run the tokenizer with the Python lexer
        Then "import" and "from" appear as keyword tokens
        And "os", "sys", and "path" appear as identifier tokens

    Scenario: JavaScript import statement is tokenised correctly
        Given a JavaScript source file containing an ES6 import statement
        When I run the tokenizer with the JavaScript lexer
        Then "import" and "from" appear as keyword tokens
        And the module name appears as a string literal token

    Scenario: Rust use statement is tokenised correctly
        Given a Rust source file containing a "use" declaration
        When I run the tokenizer with the Rust lexer
        Then "use" appears as a keyword token
        And the path components appear as identifier tokens

Feature: Multi-character operator tokenisation
    As a developer
    I want multi-character operators to be tokenised as single tokens
    So that "==" is not split into two "=" tokens

    Scenario: Equality operator is tokenised as a single token
        Given a source file containing "=="
        When I run the tokenizer with a lexer that defines a "==" operator
        Then "==" appears as a single operator token
        And it is not split into two separate "=" tokens

    Scenario: Arrow operator is tokenised as a single token
        Given a source file containing "=>" or "->"
        When I run the tokenizer with a lexer that defines arrow operators
        Then the arrow appears as a single operator token

    Scenario: Compound assignment operators are tokenised as single tokens
        Given a source file containing "+=", "-=", or "*="
        When I run the tokenizer with a lexer that defines compound assignment operators
        Then each compound operator appears as a single operator token

Feature: Number literal tokenisation
    As a developer
    I want numeric literals to be tokenised as number tokens
    So that I can distinguish them from identifiers and keywords

    Scenario: Integer literals are tokenised as number tokens
        Given a source file containing integer literals such as "42" and "0"
        When I run the tokenizer with a lexer that defines a number pattern
        Then each integer appears as a token with type "number"

    Scenario: Float literals are tokenised as number tokens
        Given a source file containing float literals such as "3.14" and "0.5"
        When I run the tokenizer with a lexer that defines a number pattern
        Then each float appears as a token with type "number"

    Scenario: Hexadecimal literals are tokenised as number tokens
        Given a source file containing hex literals such as "0xFF" and "0x1A2B"
        When I run the tokenizer with a lexer that defines a hex number pattern
        Then each hex literal appears as a token with type "number"

Feature: Empty input handling
    As a developer
    I want the tokenizer to handle empty or whitespace-only files gracefully
    So that it does not crash on edge-case inputs

    Scenario: Empty file produces an empty token array
        Given an empty source file
        When I run the tokenizer
        Then the output is a valid JSON array
        And the array is empty

    Scenario: Whitespace-only file produces only whitespace tokens
        Given a source file containing only spaces and newlines
        When I run the tokenizer with a lexer that defines a whitespace pattern
        Then the output contains only whitespace tokens
        And the concatenation of token values equals the original input

Feature: Duplicate token deduplication
    As a lexer author
    I want duplicate token definitions to be detected
    So that I can avoid silent redundancy in my lexer configs

    Scenario: Duplicate value-based tokens are rejected during validation
        Given a lexer YAML file containing two tokens with the same value (e.g. "**" defined twice)
        When I run validate.py
        Then the exit code is non-zero
        And an error message identifies the duplicate token value

    Scenario: Bundled lexers contain no duplicate token values
        Given all YAML files in the lexers/ directory
        When I check each file for tokens with duplicate "value" fields
        Then no duplicates are found in any bundled lexer

Feature: Regex pattern precompilation
    As a developer processing large source files
    I want lexer regex patterns to be compiled once and reused
    So that tokenisation performance scales linearly with input size

    Scenario: Patterns are compiled once before tokenisation begins
        Given a lexer YAML containing regex pattern tokens
        When I tokenise a source file
        Then each regex pattern is compiled only once, not on every character position

    Scenario: Large file tokenisation completes within a reasonable time
        Given a Python source file of at least 1000 lines
        When I run the tokenizer with the Python lexer
        Then tokenisation completes within 5 seconds
        And the output is a valid JSON array

Feature: Efficient string slicing
    As a developer processing large source files
    I want the lexer to avoid O(n^2) string concatenation
    So that performance does not degrade on long tokens

    Scenario: String slice extraction uses direct slicing instead of character-by-character concatenation
        Given the lexer's get_string_slice function
        When it extracts a slice from an input string
        Then it uses index-based slicing rather than appending one character at a time

Feature: CLI error messages
    As a developer running PyLex from the command line
    I want user-friendly error messages instead of Python tracebacks
    So that I can understand problems without reading implementation details

    Scenario: File not found produces a clean error message without a traceback
        Given I run the tokenizer with a path to a file that does not exist
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"

    Scenario: Invalid YAML produces a clean error message without a traceback
        Given I run the tokenizer with a lexer config file containing invalid YAML
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"

    Scenario: Unreadable file produces a clean error message without a traceback
        Given I run the tokenizer with a file that exists but is not readable
        Then the exit code is non-zero
        And stderr contains a human-readable error message
        And stderr does not contain "Traceback (most recent call last)"
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

BDD.md

Latest commit

History

BDD.md

File metadata and controls