An advanced forecasting bot for Metaculus that leverages ensemble learning with multiple state-of-the-art LLMs and comprehensive research integration to predict future events.
includes:
- Model Ensembling: Uses GPT-5, o3, and Sonnet 4 for diverse prediction perspectives
- Research Integration: AskNews API with Perplexity fallback for real-time information gathering
- Advanced Aggregation: Multiple aggregation strategies including mean, median, and stacking approaches
- Robust Pipeline: Comprehensive question processing, research, reasoning, and prediction extraction
- Numeric/Continuous Question Enhancement: e.g. PCHIP interpolation (thanks Panshul), tail spreading
- Prompt Improvements: obviously
- Benchmarking on MC and Numeric Q's: not just binary
- Python 3.11+ with conda and poetry
- Required API keys (see Configuration section)
- Clone and navigate to the repository
git clone <repo-url>
cd metaculus-bot- Set up conda environment
conda create -n metaculus-bot python=3.11
conda activate metaculus-bot- Install dependencies
make install
# or: conda run -n metaculus-bot poetry install- Configure environment
cp .env.template .env
# Edit .env with your API keys (see Configuration section)- Run the bot
make run
# or: conda run -n metaculus-bot poetry run python main.pymain.py: Primary bot implementation using theforecasting-toolsframeworkcommunity_benchmark.py: Benchmarking CLI and Streamlit UI for performance evaluationmain_with_no_framework.py: Minimal dependencies variant for lightweight usagemetaculus_bot/: Core utilities and configurations
llm_configs.py: LLM ensemble configuration and model settingsresearch_providers.py: AskNews and search integrationaggregation_strategies.py: Multiple prediction aggregation methodsprompts.py: Specialized prompts for different question typesnumeric_*.py: Numeric question processing and validation
# Run the bot on current Metaculus questions
make run
# Run with specific question filtering
python main.py --filter-type binary --max-questions 10# Quick smoke test (1 question)
make benchmark_run_smoke_test_binary
# Small benchmark (12 mixed questions)
make benchmark_run_small
# Large benchmark (100 mixed questions)
make benchmark_run_large
# Display benchmark results
make benchmark_displayYou can analyze correlations and recompute ensembles from prior runs without re-forecasting. Simple substring-based filters let you include or exclude models in the analysis.
Examples:
# Analyze the most recent benchmark file, excluding Grok and Gemini
PYTHONPATH=. ~/miniconda3/envs/metaculus-bot/bin/python analyze_correlations.py "$(ls -t benchmarks/benchmarks_*.jsonl | head -1)" \
--exclude-models grok-4 gemini-2.5-pro
# Analyze a directory while excluding models
python analyze_correlations.py benchmarks/ --exclude-models grok-4 gemini-2.5-pro
# Include-only a subset (mutually exclusive with --exclude-models)
python analyze_correlations.py benchmarks/ --include-models qwen3-235b o3
# Apply filters to the built-in post-run analysis
python community_benchmark.py --mode run --num-questions 30 --mixed \
--exclude-models grok-4 gemini-2.5-proNotes:
- Matching is substring-only, case-insensitive (no regex or space/hyphen normalization). For example,
grok-4matchesopenrouter/x-ai/grok-4, butgrok 4will not. - Filters apply before computing correlation matrices, model stats, and ensemble search. The generated report includes a “Filters Applied” section.
# Run all tests
make test
# Run specific test file
conda run -n metaculus-bot PYTHONPATH=. poetry run pytest tests/test_specific.pyCreate a .env file based on .env.template:
# Metaculus API
METACULUS_TOKEN=your_metaculus_token
# Research APIs
ASKNEWS_CLIENT_ID=your_asknews_client_id
ASKNEWS_CLIENT_SECRET=your_asknews_secret
PERPLEXITY_API_KEY=your_perplexity_key
EXA_API_KEY=your_exa_key
# LLM APIs (via OpenRouter)
OPENROUTER_API_KEY=your_openrouter_keyModels are configured in metaculus_bot/llm_configs.py:
- Primary models: GPT-5, o3, Sonnet 4 for forecasting
- Research: AskNews with Perplexity backup
- Provider: OpenRouter for consistent API access
# Lint code
make lint
# Format code
make format
# Install pre-commit hooks
make precommit_install
# Run pre-commit on all files
make precommit_allmake install- Install dependencies via conda + poetrymake test- Run pytest suitemake run- Run the forecasting botmake lint- Run Ruff lintingmake format- Format code with Ruffmake benchmark_*- Various benchmarking options
- Focus on end-to-end integration tests for the forecasting pipeline
- Test core aggregation logic and API integrations
- All tests must pass before PRs
- Use
pytestwith async support for LLM testing
metaculus-bot/
├── main.py # Primary bot implementation
├── community_benchmark.py # Benchmarking system
├── main_with_no_framework.py # Minimal variant
├── metaculus_bot/ # Core utilities
│ ├── llm_configs.py # Model ensemble configuration
│ ├── research_providers.py # Research integration
│ ├── aggregation_strategies.py # Prediction aggregation
│ ├── prompts.py # Question-specific prompts
│ └── numeric_*.py # Numeric processing modules
├── tests/ # Test suite
├── .github/workflows/ # CI automation
├── AGENTS.md # Detailed coding guidelines
└── Makefile # Development commands
This project heavily uses the forecasting-tools framework:
GeneralLlmfor model interfacesMetaculusApifor platform integration- Question types:
BinaryQuestion,NumericQuestion,MultipleChoiceQuestion - Prediction types:
ReasonedPrediction,BinaryPrediction, etc. - Research:
AskNewsSearcher,SmartSearcher
- AGENTS.md: Comprehensive coding guidelines and repository standards
- starter_guide.md: Original template setup instructions
- forecasting_tools_readme.md: Framework documentation
- Conda environment:
metaculus-bot - Python version: 3.11+
- Code formatting: Ruff with 120-character line length
- Testing: Pytest with async support
- Development: WSL2 environment with zsh terminal