███████╗████████╗ █████╗ ██████╗██╗ ██╗██████╗ ███████╗███╗ ██╗ ██████╗██╗ ██╗
██╔════╝╚══██╔══╝██╔══██╗██╔════╝██║ ██╔╝██╔══██╗██╔════╝████╗ ██║██╔════╝██║ ██║
███████╗ ██║ ███████║██║ █████╔╝ ██████╔╝█████╗ ██╔██╗ ██║██║ ███████║
╚════██║ ██║ ██╔══██║██║ ██╔═██╗ ██╔══██╗██╔══╝ ██║╚██╗██║██║ ██╔══██║
███████║ ██║ ██║ ██║╚██████╗██║ ██╗██████╔╝███████╗██║ ╚████║╚██████╗██║ ██║
╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═════╝╚═╝ ╚═╝
Benchmark coding agents on library-specific tasks
Open source local deployment tool for benchmarking coding agents (especially Cursor) on library-specific tasks. Test how well AI coding assistants understand and work with your documentation, APIs, and domain-specific patterns.
StackBench was created in response to the challenges faced by devtool builders in understanding how AI coding agents interact with their software libraries and APIs. Through conversations with dozens of developer-focused companies, we consistently heard that coding agents often use outdated versions, call deprecated functions, or simply get things wrong. Many maintainers didn’t actually know how well these agents were using their libraries at all.
Existing code generation benchmarks typically evaluate models, not agents, and focus on producing self-contained code snippets rather than assessing real usage of library APIs. Almost none focus on library-specific generation tasks—meaning they don’t test whether an agent can solve a task using the actual methods and patterns from your library, rather than writing everything from scratch.
StackBench fills this gap by providing a tool that benchmarks coding agents on real-world, library-specific tasks. It helps maintainers and developers discover failures, spot improvement opportunities, and get actionable insights into how their documentation and APIs are being used by modern AI coding assistants.
See the StackBench Documentation and Getting Started Guide to get started.
- Python 3.10+
- Node.js 18+ (for Claude Code CLI)
- Git (for repository operations)
- OpenAI API Key - For DSPy-powered use case extraction
- Anthropic API Key - For Claude Code analysis
- Install uv (Python package manager):
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install Claude Code CLI (required for analysis):
npm install -g @anthropic-ai/claude-code
# Clone and install StackBench
git clone https://github.com/your-org/stackbench
cd stackbench
uv sync
# Configure environment variables
cp .env.sample .env
# Edit .env and add your API keys:
# - OPENAI_API_KEY=your_openai_key_here
# - ANTHROPIC_API_KEY=your_anthropic_key_here
Streamlined IDE Workflow (Recommended):
# 1. Set up repository for IDE execution (clone + extract in one command)
stackbench setup https://github.com/user/awesome-lib -a cursor -l javascript
# 2. Execute use cases manually in Cursor IDE
# ⚠️ Wait for Cursor indexing to complete before implementing!
stackbench print-prompt <run-id> -u 1 --copy
# [Implement in Cursor IDE - repeat for all use cases]
# 3. Analyze results
stackbench analyze <run-id>
Streamlined CLI Workflow (Coming Soon):
# Full automation for CLI agents (not yet implemented)
stackbench run https://github.com/user/awesome-lib -a claude-code
Setup Options:
# Focus on specific folders
stackbench setup https://github.com/user/awesome-lib -i docs,examples -a cursor -l python
# Use specific branch and language
stackbench setup https://github.com/user/awesome-lib -b develop -a cursor -l typescript
# Language aliases supported: python/py, javascript/js, typescript/ts
stackbench setup https://github.com/user/react-lib -a cursor -l js
stackbench setup <repo-url>
Set up repository for IDE execution (clone + extract use cases).
# Complete IDE setup in one command with language specification
stackbench setup https://github.com/user/awesome-lib -a cursor -l python
# Focus on specific folders with JavaScript library
stackbench setup https://github.com/user/awesome-lib -i docs,examples -a cursor -l js
# Use specific branch with TypeScript
stackbench setup https://github.com/user/awesome-lib -b develop -a cursor -l typescript
This command:
- Creates a unique run ID and directory structure
- Clones the repository to
./data/<uuid>/repo/
- Extracts use cases using DSPy analysis
- Sets up agent configuration
- Shows generated use cases and next steps
- Ready for manual IDE execution
stackbench run <repo-url>
(Coming Soon)
Full automated benchmark pipeline for CLI agents.
# Automated execution (not yet implemented)
stackbench run https://github.com/user/awesome-lib -a claude-code -i docs,examples
stackbench clone <repo-url>
Clone a repository and set up a new benchmark run.
# Clone with agent specification and language
stackbench clone https://github.com/user/awesome-lib -a cursor -i docs,examples -l python
# Clone JavaScript library with specific branch
stackbench clone https://github.com/user/react-lib -a cursor -b develop -l js
stackbench list
List all benchmark runs with their status.
stackbench list
Shows a table with:
- Run ID: Full UUID for use with other commands
- Repository: Repository name
- Phase: Current phase (created → cloned → extracted → execution → analysis_individual → analysis_overall → completed)
- Agent: Configured agent type (cursor, claude code, etc.)
- Created: Creation timestamp
- Use Cases: Number of extracted use cases (— if not extracted yet)
- Status: Progress indicators and next steps
stackbench status <run-id>
Show detailed status and progress for a specific run.
stackbench status 4a72004a-592b-49b7-9920-08cf54485f85
Displays:
- Current phase and timeline
- Individual use case execution/analysis status
- Error tracking
- Suggested next steps based on current state
stackbench extract <run-id>
Extract use cases from a cloned repository's documentation.
# Extract use cases from a run
stackbench extract 4a72004a-592b-49b7-9920-08cf54485f85
This command:
- Validates the run is in "cloned" phase
- Uses DSPy to analyze markdown documentation
- Generates library-specific use cases with:
- Functional requirements
- User stories
- System design guidance
- Target audience and complexity level
- Updates run phase to "extracted"
- Shows next steps based on agent type
stackbench print-prompt <run-id> --use-case <n>
Print formatted prompt for manual execution of a specific use case.
# Print prompt for use case 1
stackbench print-prompt 4a72004a-592b-49b7-9920-08cf54485f85 -u 1
# Print prompt and copy to clipboard automatically
stackbench print-prompt 4a72004a-592b-49b7-9920-08cf54485f85 -u 1 --copy
# Override agent type for different prompt format
stackbench print-prompt <run-id> -u 2 --agent cursor
This command:
- Validates the run has extracted use cases
- Loads the specific use case details
- Formats a comprehensive prompt for the agent type
- Displays prompt with clear start/end boundaries
- Optional clipboard copy with
--copy/-c
flag - Shows target directory and next steps
- Currently supports Cursor IDE agent
stackbench analyze <run-id>
Analyze use case implementations using Claude Code.
# Analyze all use cases in a run (default: 3 parallel workers)
stackbench analyze 4a72004a-592b-49b7-9920-08cf54485f85
# Analyze with custom number of parallel workers
stackbench analyze <run-id> --workers 5
# Analyze specific use case only
stackbench analyze <run-id> --use-case 2
# Force re-analysis even if already completed
stackbench analyze <run-id> --force
This command:
- Requires Claude Code CLI: Install with
npm install -g @anthropic-ai/claude-code
- Requires ANTHROPIC_API_KEY: Set in your environment or .env file
- Parallel Processing: Runs 3 use cases concurrently by default (configurable with
--workers
) - Resume Capability: Automatically resumes from where it left off if interrupted
- Tests code executability by running implementation files
- Analyzes library usage patterns (real vs mocked implementations)
- Evaluates documentation consultation from code comments
- Generates structured JSON results and quality assessments
- Updates run phase to "analysis_overall" or "completed"
stackbench execute <run-id>
(Coming Soon)
Execute use cases with specified CLI agent.
# Automated execution (not yet implemented)
stackbench execute <run-id> --agent claude-code
stackbench clean
Clean up old benchmark runs.
# Remove runs older than 30 days (default)
stackbench clean
# Remove runs older than specific number of days
stackbench clean --older-than 7
# Dry run - see what would be deleted
stackbench clean --dry-run
Streamlined IDE Workflow (Recommended):
# One command setup with language specification
stackbench setup https://github.com/user/lib -i docs -a cursor -l javascript
# Manual execution in IDE
stackbench print-prompt <run-id> -u 1 -c # Get formatted prompt + copy to clipboard
# Paste prompt and implement in Cursor IDE
stackbench print-prompt <run-id> -u 2 -c # Continue with remaining use cases
# Analysis
stackbench analyze <run-id> # Process results when all complete
Step-by-step Workflow:
stackbench clone https://github.com/user/lib -i docs -a cursor -l python
stackbench extract <run-id> # Generate use cases
stackbench print-prompt <run-id> -u 1 -c # Manual execution...
stackbench analyze <run-id> # Process results
Automated CLI Workflow (Future):
stackbench run https://github.com/user/lib -a claude-code
StackBench clones your target repository and creates an isolated benchmark environment:
./data/<uuid>/
├── repo/ # Cloned repository
├── data/ # Benchmark data
└── run_context.json # Complete run state
IDE Agents (Manual execution)
- Cursor
- Human interaction through IDE
- Pipeline:
clone → extract → manual execution → analyze
CLI Agents (Automated execution)
- claude-code
- Fully automated execution
- Pipeline:
clone → extract → execute → analyze
Each run progresses through seven distinct phases:
- created → cloned → extracted → execution → analysis_individual → analysis_overall → completed
The pipeline adapts based on agent type:
- IDE agents: Manual execution with generated prompts (
setup
→ manual work →analyze
) - CLI agents: Fully automated execution (
run
command - coming soon)
StackBench uses Pydantic for configuration management with environment variable support.
-
Copy the sample environment file:
cp .env.sample .env
-
Add your OpenAI API key (required for use case extraction):
OPENAI_API_KEY=your_openai_api_key_here
-
Add your Anthropic API key (required for analysis):
ANTHROPIC_API_KEY=your_anthropic_api_key_here
-
Customize other settings as needed:
# Core settings DATA_DIR=./custom-data NUM_USE_CASES=15 DEFAULT_AGENT=cursor # DSPy settings DSPY_MODEL=gpt-4o-mini DSPY_MAX_TOKENS=10000 # Analysis settings ANALYSIS_MAX_WORKERS=3 CLAUDE_MODEL=claude-sonnet-4 # Logging LOG_LEVEL=DEBUG
See .env.sample
for all available configuration options with detailed descriptions.
uv sync
uv pip install -e .
uv run pytest tests/ # Run all tests
uv run pytest tests/test_repository.py # Run specific tests
uv run pytest -k "test_clone" -v # Run filtered tests
- Use Pydantic for all data models and configuration
- Use Rich for CLI interfaces
- Use DSPy for AI-powered components
- Write comprehensive tests with fixtures and mocking
- Follow the established patterns for RunContext and RepositoryManager
stackbench/
├── src/stackbench/
│ ├── cli.py # Rich-based CLI
│ ├── config.py # Pydantic configuration
│ ├── core/
│ │ ├── run_context.py # RunContext, RunConfig, RunStatus
│ │ └── repository.py # RepositoryManager
│ ├── agents/ # Agent implementations
│ ├── extractors/ # Use case extractors
│ └── utils/ # Utilities
├── tests/ # Test files
└── data/ # Benchmark runs (git ignored)
This project aims to validate several hypotheses:
- Library maintainers prefer local deployment over SaaS solutions
- Cursor integration enables obvious failure demonstration on library-specific tasks
- Open source community will contribute to expand benchmark coverage
- Local deployment removes privacy/security barriers for enterprise adoption
We welcome contributions!
- Agent implementations: Add evaluation for more coding agents
- Benchmark tasks: Add new types of tasks to expand what the benchmark evaluates (e.g. use of APIs via API docs)
- Metrics: Enhance quality assessment by adding or improving evaluation metrics
✅ Functional - Cursor IDE agent fully implemented with complete workflow support. Looking to add more agents - Claude Code CLI agent coming next.