-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Currently, the tokenizer and parser are hard-coded, making it difficult to extend the language for Python $1, $2, $3, and $4. Our current parser/lexer architecture is fragmented across multiple files and concepts (tokenizer.ts, tokens.ts, Grammar.gram, and generate-ast.ts). This makes it hard to evolve the language or reason about what subset of Python our DSL supports.
First, mapping of TokenType and Token strings happens in Tokenizer (src/tokenizer/tokenizer.ts):
const specialIdentifiers = new Map([
["and", TokenType.AND],
["or", TokenType.OR],
["while", TokenType.WHILE],
["for", TokenType.FOR],
["None", TokenType.NONE],
["is", TokenType.IS],
["not", TokenType.NOT],
["pass", TokenType.PASS],
["def", TokenType.DEF],
["lambda", TokenType.LAMBDA],
["from", TokenType.FROM],
["True", TokenType.TRUE],
["False", TokenType.FALSE],
["break", TokenType.BREAK],
["continue", TokenType.CONTINUE],
["return", TokenType.RETURN],
["assert", TokenType.ASSERT],
["import", TokenType.IMPORT],
["global", TokenType.GLOBAL],
["nonlocal", TokenType.NONLOCAL],
["if", TokenType.IF],
["elif", TokenType.ELIF],
["else", TokenType.ELSE],
["in", TokenType.IN],
]);Restrictions for Python $1 then happens in several places.
Level 1: Token strings -> TokenType mappings are restricted in Tokenizer (src/tokenizer/tokenizer.ts):
this.forbiddenIdentifiers = new Map([
["async", TokenType.ASYNC],
["await", TokenType.AWAIT],
["yield", TokenType.YIELD],
["with", TokenType.WITH],
["del", TokenType.DEL],
["try", TokenType.TRY],
["except", TokenType.EXCEPT],
["finally", TokenType.FINALLY],
["raise", TokenType.RAISE],
]);Level 2: TokenType declared but unused. These TokenTypes are not found in the tokenizer's scanToken() method.
//// Unusued - Found in normal Python
SEMI,
DOT,
LBRACE,
RBRACE,
TILDE,
CIRCUMFLEX,
LEFTSHIFT,
RIGHTSHIFT,
PLUSEQUAL,
MINEQUAL,
STAREQUAL,
...Level 3: Grammar.gram - Omitted Productions
The grammar file doesn't include rules for:
- List comprehensions
- Dictionary literals
{} - Class definitions
- Augmented assignment
+= - Slice operations (partially there)
Issue
There’s no central place defining which parts of Python are “in” or “out.” This redundancy creates drift between the grammar, tokenizer, and AST.
Grammar.gram is also Not Connected to the Parser (or to anything, really)
Serves as documentation, not a live grammar. The actual parser is hand-written recursive descent. The AST (generate-ast.ts) uses a separate DSL.
There’s no automated way to regenerate tokenizer or parser code from the .gram file, so manual syncing is required.
Solution
Use parser generators that work on a single grammar file (written in suitable DSL). Benefits:
- Single Source of Truth: grammar file defines everything (tokens, syntax, AST mapping).
- Faster Language Iteration: add/remove features by editing one .ne file.
- Maintainability: remove duplicated enum and keyword logic.
- Extensibility: Easier to reintroduce Python features (e.g., +=, comprehensions) as needed.