TLAi+Bench

A dataset and benchmark suite for evaluating Large Language Models (LLMs) on TLA+ formal specification tasks, featuring logic puzzles and real-world scenarios.

📌 Context & Motivation

This repository addresses the growing need for systematic evaluation of AI-assisted formal methods development. It stems from two initiatives in the TLA+ community:

TLA+ Dataset Issue: A community proposal to create standardized benchmarks for LLM evaluation on TLA+ tasks
TLAi+ Challenge: The TLA+ Foundation's GenAI-accelerated challenge showcasing innovative work at the intersection of TLA+ and AI

🎯 Purpose

TLAi+Bench enables:

Consistent LLM Evaluation: Objective comparison of language model performance on formal specification tasks
Tool Development: Reference benchmarks for AI-assisted TLA+ development tools
Agentic Workflow Development: Foundation for building tool-supported autonomous agents that can iteratively develop, verify, and refine formal specifications
Research Advancement: A standardized dataset for formal methods and AI research
Educational Resource: Well-documented examples for learning TLA+ through practical problems

📁 Repository Structure

TLAi+Bench/
├── puzzles/                    # Problem descriptions in natural language
│   ├── CabbageGoatWolf.md     # Classic river crossing puzzle
│   ├── CatInABox.md           # Quantum mechanics inspired puzzle
│   ├── CoffeeCan.md           # Probabilistic algorithm puzzle
│   ├── DieHard.md             # Water jug measurement puzzle
│   ├── DiningPhilosophers.md  # Concurrency classic
│   ├── GameOfLife.md          # Cellular automaton
│   ├── Prisoners.md           # Logic and game theory puzzle
│   ├── RiverCrossingFlashlight.md # Optimization puzzle
│   └── TowerOfHanoi.md        # Recursive puzzle
├── gold/                      # Reference TLA+ specifications
│   ├── CatInBoxGold.tla
│   ├── CoffeeCanGold.tla
│   ├── DieHardGold.tla
│   ├── HanoiGold.tla
│   ├── PrisonersGold.tla
│   └── RiverCrossingFlashlightGold.tla
├── genaisrc/                  # AI generation scripts and utilities
│   └── translate.genai.mts    # Main translation script (NL → TLA+)

🧩 Puzzle Categories

The benchmark includes diverse problem types to test different aspects of formal specification:

Logic Puzzles: Classic problems requiring constraint modeling (Die Hard, River Crossing)
Concurrency: Multi-process coordination challenges (Dining Philosophers)
Algorithms: Formal specification of computational processes (Coffee Can)
Games & Strategy: Decision-making and game theory (Prisoners)
Mathematical: Recursive and mathematical structures (Tower of Hanoi)
Simulation: Dynamic system modeling (Game of Life)

🚀 Getting Started

Prerequisites

VSCode: Required to host the TLA+ MCP servers
- TLA+ VSCode extension (tlaplus.vscode-ide)
- X11 server (for headless environments)
Node.js 24+: For GenAIScript runtime
TLA+ Tools:
- tla2tools.jar (nightly build from tlapl.us)
- CommunityModules-deps.jar (latest release)
GenAIScript: For AI-assisted specification generation
- We use GenAIScript because it balances programmability with automation—unlike Python scripts (too low-level) or IDE chats (not scriptable). It provides JavaScript-based prompt assembly, MCP tool integration, headless operation for automated benchmarks, and broad support for multiple LLM providers (OpenAI, Azure OpenAI, Anthropic, GitHub Models, local models via Ollama, and more).
- The development team is highly responsive to issues (#1872, #1833, #1809, ...).

Running Benchmarks

To run the TLA+ benchmark suite:

# Run GenAIScript to generate and verify TLA+ specifications
DEBUG=* npx genaiscript@latest run genaisrc/translate.genai.mts

# Or run with specific model configuration
DEBUG=* npx genaiscript@latest run --provider github --model github:openai/gpt-4 genaisrc/translate.genai.mts

# Or run with a specific provider (create an inference profile first (see https://github.com/microsoft/genaiscript/issues/1918))
DEBUG=* npx --yes genaiscript@latest run translate --provider anthropic_bedrock --model anthropic_bedrock:arn:aws:bedrock:us-east-1:024871859028:inference-profile/us.anthropic.claude-sonnet-4-20250514-v1:0

If you have a GitHub Copilot Chat subscription, you can also run the translate script directly within VSCode without needing a separate LLM provider. This approach is particularly useful for users who already have GitHub Copilot access and want to leverage those models without additional LLM provider setup. Note that these models are not available from pure command-line usage. Also pay attention to your budget.

The translate GenAIScript handles the complete workflow:

Reads puzzle descriptions from puzzles/
Generates TLA+ specifications using LLMs
Automatically runs syntax checking and model verification
Compares results against gold standard specifications

Evaluation Workflow

Problem Selection: Choose a puzzle from the puzzles/ directory
Specification Development: Create a TLA+ specification based on the natural language description
Verification: Use TLC to model check your specification
Comparison: Compare against the gold standard in gold/

📊 Evaluation Criteria

Synthesized specifications are evaluated against gold standard specifications using:

Counterexample Analysis: Does the synthesized spec produce the expected counterexample when model checked?
Refinement Checking: Does the synthesized specification refine the gold standard specification?
Behavioral Equivalence: Are the allowed behaviors consistent with the gold standard specification?
Property Satisfaction: Does the spec satisfy the same safety and liveness properties as the gold standard?

🤝 Contributing

We welcome contributions to expand the benchmark:

New Puzzles: Add problem descriptions and gold standard specifications
Evaluation Tools: Improve automated scoring and comparison utilities
Documentation: Enhance problem descriptions and solution explanations
Validation: Help verify and improve existing specifications

📚 Related Work

TLA+ Homepage
TLA+ Community Modules
TLA+ Examples
Specifying Systems - Leslie Lamport's TLA+ book

🏆 Acknowledgments

This work is inspired by and contributes to:

The TLA+ Foundation's mission to advance formal methods
The TLAi+ Challenge winners who demonstrated innovative AI-assisted TLA+ development
The broader TLA+ community's collaborative efforts

📄 License

This project is released under the MIT License, following the TLA+ community's open-source principles.

For questions, suggestions, or collaboration opportunities, please open an issue in this repository or engage with the broader TLA+ community.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github		.github
.vscode		.vscode
genaisrc		genaisrc
gold		gold
puzzles		puzzles
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TLAi+Bench

📌 Context & Motivation

🎯 Purpose

📁 Repository Structure

🧩 Puzzle Categories

🚀 Getting Started

Prerequisites

Running Benchmarks

Evaluation Workflow

📊 Evaluation Criteria

🤝 Contributing

📚 Related Work

🏆 Acknowledgments

📄 License

About

Uh oh!

Languages

License

tlaplus/TLAiBench

Folders and files

Latest commit

History

Repository files navigation

TLAi+Bench

📌 Context & Motivation

🎯 Purpose

📁 Repository Structure

🧩 Puzzle Categories

🚀 Getting Started

Prerequisites

Running Benchmarks

Evaluation Workflow

📊 Evaluation Criteria

🤝 Contributing

📚 Related Work

🏆 Acknowledgments

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages