Hard-coded Tokenizer and Parser limit Python language extensibility

Currently, the tokenizer and parser are hard-coded, making it difficult to extend the language for Python $1, $2, $3, and $4. Our current parser/lexer architecture is fragmented across multiple files and concepts (`tokenizer.ts`, `tokens.ts`, `Grammar.gram`, and `generate-ast.ts`). This makes it hard to evolve the language or reason about what subset of Python our DSL supports.

**First, mapping of TokenType and Token strings happens in Tokenizer (`src/tokenizer/tokenizer.ts`):**

```46:71:src/tokenizer/tokenizer.ts
const specialIdentifiers = new Map([
    ["and", TokenType.AND],
    ["or", TokenType.OR],
    ["while", TokenType.WHILE],
    ["for", TokenType.FOR],
    ["None", TokenType.NONE],
    ["is", TokenType.IS],
    ["not", TokenType.NOT],
    ["pass", TokenType.PASS],
    ["def", TokenType.DEF],
    ["lambda", TokenType.LAMBDA],
    ["from", TokenType.FROM],
    ["True", TokenType.TRUE],
    ["False", TokenType.FALSE],
    ["break", TokenType.BREAK],
    ["continue", TokenType.CONTINUE],
    ["return", TokenType.RETURN],
    ["assert", TokenType.ASSERT],
    ["import", TokenType.IMPORT],
    ["global", TokenType.GLOBAL],
    ["nonlocal", TokenType.NONLOCAL],
    ["if", TokenType.IF],
    ["elif", TokenType.ELIF],
    ["else", TokenType.ELSE],
    ["in", TokenType.IN],
]);
```

Restrictions for Python $1 then happens in several places. 

**Level 1: Token strings -> TokenType mappings are restricted in Tokenizer (`src/tokenizer/tokenizer.ts`):**

```114:124:src/tokenizer/tokenizer.ts
this.forbiddenIdentifiers = new Map([
    ["async", TokenType.ASYNC],
    ["await", TokenType.AWAIT],
    ["yield", TokenType.YIELD],
    ["with", TokenType.WITH],
    ["del", TokenType.DEL],
    ["try", TokenType.TRY],
    ["except", TokenType.EXCEPT],
    ["finally", TokenType.FINALLY],
    ["raise", TokenType.RAISE],
]);
```

**Level 2: TokenType declared but unused. These TokenTypes are not found in the tokenizer's `scanToken()` method.**
```71:108:src/tokens.ts
//// Unusued - Found in normal Python
SEMI,
DOT,
LBRACE,
RBRACE,
TILDE,
CIRCUMFLEX,
LEFTSHIFT,
RIGHTSHIFT,
PLUSEQUAL,
MINEQUAL,
STAREQUAL,
...
```

**Level 3: Grammar.gram - Omitted Productions**
The grammar file doesn't include rules for:
- List comprehensions
- Dictionary literals `{}`
- Class definitions
- Augmented assignment `+=`
- Slice operations (partially there)

### Issue
There’s no central place defining which parts of Python are “in” or “out.” This redundancy creates drift between the grammar, tokenizer, and AST.

Grammar.gram is also Not Connected to the Parser (or to anything, really)

Serves as documentation, not a live grammar. The actual parser is hand-written recursive descent. The AST (generate-ast.ts) uses a separate DSL.

There’s no automated way to regenerate tokenizer or parser code from the .gram file, so manual syncing is required.

### Solution
Use parser generators that work on a single grammar file (written in suitable DSL). Benefits:

* Single Source of Truth: grammar file defines everything (tokens, syntax, AST mapping).
* Faster Language Iteration: add/remove features by editing one .ne file.
* Maintainability: remove duplicated enum and keyword logic.
* Extensibility: Easier to reintroduce Python features (e.g., +=, comprehensions) as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hard-coded Tokenizer and Parser limit Python language extensibility #68

Issue

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hard-coded Tokenizer and Parser limit Python language extensibility #68

Description

Issue

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions