A token stream transformer. Pyken takes a JSON token stream — from PyLex or any compatible tokenizer — and remaps tokens according to a YAML mapping file, enabling token-level transpilation between languages or dialects.
- Language-agnostic — works with any
[{"type": "...", "value": "..."}]JSON token stream - YAML-configured mappings — define remapping rules without touching code
- Two output modes — reconstructed source text or a new JSON token stream
- Pipeline-friendly — designed to chain with PyLex and other tools
- Strict mode — fail fast if any token has no mapping rule
- Bundled mappings — Python → JavaScript, JavaScript → TypeScript, Python → Pseudocode
git clone https://github.com/dweng0/Pyken.git
cd Pyken
pip install -r requirements.txtPyken reads a JSON token stream from stdin (or --input) and writes transformed output to stdout.
# Tokenise Python source, remap to JavaScript
python3 main.py hello.py lexers/python.yaml | python3 pyken.py mappings/python-to-javascript.yaml
# Tokenise Python source, remap to pseudocode
python3 main.py hello.py lexers/python.yaml | python3 pyken.py mappings/python-to-pseudocode.yaml# Save tokens first
python3 main.py hello.py lexers/python.yaml > tokens.json
# Then remap
python3 pyken.py mappings/python-to-javascript.yaml --input tokens.jsonpython3 main.py foo.py lexers/python.yaml | python3 pyken.py mappings/python-to-javascript.yaml --tokensThis is useful for chaining multiple remapping steps:
python3 main.py foo.py lexers/python.yaml \
| python3 pyken.py mappings/python-to-javascript.yaml --tokens \
| python3 pyken.py mappings/javascript-to-typescript.yamlBy default, tokens with no matching rule pass through unchanged with a warning on stderr. Use --strict to treat unmapped tokens as errors:
python3 main.py foo.py lexers/python.yaml | python3 pyken.py mappings/my-mapping.yaml --strictInput (Python):
def greet(name):
if name:
print("Hello " + name)
return TrueAfter remapping with python-to-javascript.yaml:
function greet(name):
if name:
console.log("Hello " + name)
return trueAfter remapping with python-to-pseudocode.yaml:
FUNCTION greet(name):
IF name:
OUTPUT("Hello " + name)
RETURN YES
| File | From | To | What it does |
|---|---|---|---|
python-to-javascript.yaml |
Python | JavaScript | Remaps keywords: def→function, elif→else if, True→true, None→null, print→console.log, etc. |
javascript-to-typescript.yaml |
JavaScript | TypeScript | Remaps var→let, passes everything else through |
python-to-pseudocode.yaml |
Python | Pseudocode | Replaces keywords with plain English: def→FUNCTION, if→IF, return→RETURN, etc. |
A mapping file is a YAML file with a list of rules. Each rule has a match block and an emit block.
from: python
to: javascript
rules:
# Match by type AND value (specific)
- match:
type: keyword
value: "def"
emit:
value: "function"
# Match by type only (general — catches anything not matched above)
- match:
type: keyword
emit: pass
# Pass whitespace and identifiers through unchanged
- match:
type: whitespace
emit: pass
- match:
type: identifier
emit: passRule matching:
- Specific rules (type + value) are always tried before general rules (type only)
emit: passpasses the token through unchanged- Omitting
valueinemitkeeps the original value - Omitting
typeinemitkeeps the original type
- match:
type: identifier
emit: passUse this to drop tokens that have no equivalent in the target language.
# Remove Python's trailing colon from block statements
- match:
type: punctuation
value: ":"
emit: discardThe token is silently removed from the output. No warning is printed.
- match:
type: keyword
value: "def"
emit:
value: "function" # replace value, keep type
- match:
type: keyword
value: "True"
emit:
type: boolean # replace type, keep value
- match:
type: keyword
value: "None"
emit:
type: keyword
value: "null" # replace bothUse this when a single source token maps to multiple target tokens. The matched token is replaced by the full list.
# Python INDENT becomes "{\n" in JavaScript
- match:
type: indent
emit:
tokens:
- type: punctuation
value: " {"
- type: newline
value: "\n"
# Python DEDENT becomes a closing brace
- match:
type: dedent
emit:
tokens:
- type: punctuation
value: "}"
- type: newline
value: "\n"Add preceded_by to a rule to match a token only when a specific token immediately precedes it.
# Remove ":" only when it closes a block header (preceded by ")")
# Leaves ":" in dict literals alone
- match:
type: punctuation
value: ":"
preceded_by:
type: punctuation
value: ")"
emit: discard
# All other ":" (e.g. in dicts) pass through unchanged
- match:
type: punctuation
value: ":"
emit: passAdd followed_by to match a token only when a specific token immediately follows it. Useful for disambiguating structural tokens such as { used as a block opener vs { used to open an object literal.
# Rust/JS "{" that opens a block (followed by newline) → discard it,
# let INDENT/DEDENT handling generate the Python-style indentation
- match:
type: punctuator
value: "{"
followed_by:
type: newline
emit: discard
# "{" not followed by newline (object literal) → pass through unchanged
- match:
type: punctuator
value: "{"
emit: passpreceded_by and followed_by can be combined in a single rule:
- match:
type: punctuation
value: ","
preceded_by:
type: identifier
followed_by:
type: whitespace
emit: passUse match: sequence to match a run of consecutive tokens as a single pattern. This is necessary for multi-token constructs that have a single-token equivalent in the target language — for example not in (three tokens) → a single operator, or : i32 (a Rust type annotation) → discard both tokens at once.
The sequence must list every token in order. Use type and/or value per token; omit either to match any value/type for that position.
# Python "not in" (three tokens) → single operator token
- match:
sequence:
- type: keyword
value: "not"
- type: whitespace
- type: keyword
value: "in"
emit:
type: operator
value: "not in"
# Rust type annotation ": i32" → discard both tokens
- match:
sequence:
- type: punctuation
value: ":"
- type: identifier # matches any type name
emit: discard
# Rust "let mut" → discard both (Python just assigns directly)
- match:
sequence:
- type: keyword
value: "let"
- type: whitespace
- type: keyword
value: "mut"
emit: discardA sequence rule consumes all matched tokens and produces the emitted output once. All emit modes work with sequence rules: pass (emits the first token unchanged), discard, value/type replace, tokens: [...], and injection.
Use emit: before or emit: after to inject new tokens adjacent to the matched token without replacing it. This is how you add target-language constructs that have no source equivalent — for instance, adding TypeScript type annotations to parameters, or prepending a C++ return type before a function name.
pass_through: true keeps the matched token in the output. Omitting it, or setting it to false, replaces the matched token (same as the standard emit).
# JavaScript → TypeScript: inject ": any" after each function parameter
# The parameter identifier is kept; the annotation is added after it
- match:
type: identifier
preceded_by:
type: punctuator
value: "("
emit:
pass_through: true
after:
- type: punctuator
value: ":"
- type: whitespace
value: " "
- type: identifier
value: "any"
# Python → C++: inject return type "int " before a function name
- match:
type: identifier
preceded_by:
type: keyword
value: "def"
emit:
pass_through: true
before:
- type: identifier
value: "int"
- type: whitespace
value: " "before and after can be used together in one rule. The output order is always: before tokens → matched token (if pass_through: true) → after tokens.
When pass_through: true is set on a sequence rule, all matched tokens are kept in the output and the before/after tokens are injected around them. Without it, the sequence is consumed and replaced.
# Python → Rust: inject "let " before every "identifier =" assignment
# Keeps the identifier and "=" unchanged, just adds "let " in front
- match:
sequence:
- type: identifier
- type: whitespace
- { type: operator, value: "=" }
followed_by:
not: { type: operator, value: "=" }
emit:
pass_through: true
before:
- type: keyword
value: "let "followed_by on a sequence rule checks the token immediately after the last element of the sequence.
Add not_followed_by or not_preceded_by to exclude a rule when a specific token is adjacent. Essential for distinguishing tokens that are identical in isolation but mean different things in context — for example = assignment vs the first character of ==.
# Match "=" as assignment only — not when followed by another "="
- match:
type: operator
value: "="
not_followed_by:
type: operator
value: "="
emit:
pass_through: true
before:
- type: keyword
value: "let "
# Match standalone "->" return type arrow, not inside a string
- match:
type: operator
value: "->"
not_preceded_by:
type: string_literal
emit:
value: ":"not_preceded_by and not_followed_by can be combined with each other and with preceded_by / followed_by in the same rule.
By default emit: value: "something" replaces the token's value with a hardcoded string. Two additional forms let you derive the emitted value from the original token.
Use {{value}} anywhere in the emit value string to insert the original token's value. For sequence rules, use {{tokens[N].value}} to reference the Nth token in the matched sequence (zero-indexed).
# Python → C: "import os" → "#include <os.h>"
# tokens[0]=import tokens[1]=whitespace tokens[2]=os
- match:
sequence:
- { type: keyword, value: "import" }
- type: whitespace
- type: identifier
emit:
type: preprocessor
value: "#include <{{tokens[2].value}}.h>"
# Rust: qualify an identifier with its crate path
- match:
type: identifier
preceded_by: { type: keyword, value: "use" }
emit:
value: "crate::{{value}}"Use value_regex with pattern and replacement to apply a regex substitution to the token's original value. If the pattern does not match, the value is passed through unchanged.
# Python single-quoted strings → C double-quoted strings: 'hello' → "hello"
- match:
type: string_literal
emit:
value_regex:
pattern: "^'(.*)'$"
replacement: '"\\1"'
# Python comments → C++ line comments: "# text" → "// text"
- match:
type: comment
emit:
value_regex:
pattern: "^#"
replacement: "//"Rules are tried in this order — the first match wins:
| Priority | Rule type | When it applies |
|---|---|---|
| 1 | Sequence rule | match: sequence: [...] — matches multiple tokens |
| 2 | Context-aware + specific | type + value + any of preceded_by, followed_by, not_preceded_by, not_followed_by |
| 3 | Context-aware + general | type only + any context condition |
| 4 | Specific | type + value |
| 5 | General | type only |
python3 -m pytest tests/ -vPyken is intentionally minimal. The core pipeline is:
- Read a
[{"type": "...", "value": "..."}]JSON array from stdin or a file - For each position in the stream, find the best matching rule (sequence rules first, then context-aware, then specific, then general)
- Apply the
emittransformation — replace, discard, expand to many tokens, or inject before/after - Output either the reconstructed source text (join all values) or a new JSON token stream
Because Pyken only cares about the {type, value} contract, it works with PyLex or any other tokenizer that produces compatible output.
| Stage | Description | Status |
|---|---|---|
| Token-level remapping | Remap keyword values and types via YAML | Done |
| Bundled language mappings | Python→JS, JS→TS, Python→Pseudocode | Done |
| Pipeline chaining | --tokens output for multi-step transforms |
Done |
| Discard tokens | emit: discard to drop tokens with no target equivalent |
In progress |
| Multi-token emission | One source token expands to multiple target tokens | In progress |
| Context-aware matching | preceded_by / followed_by to disambiguate by context |
In progress |
| Sequence matching | Match N consecutive tokens as a pattern, emit as one | In progress |
| Token injection | emit: before / emit: after to add tokens without removing the original |
In progress |
| Negative context matching | not_preceded_by / not_followed_by to exclude rules by adjacent token |
In progress |
| Value transforms | {{value}} interpolation and value_regex substitution in emit |
In progress |
| Custom output language | Define a new language target from scratch | Planned |
Contributions are welcome — especially new mapping files for language pairs not yet covered.
Fork the repo, add your mapping under mappings/, and open a pull request.
MIT