The Automated LLM Speedrunning Benchmark turns the NanoGPT Speedrun into an eval for the ability of frontier LLM agents to reproduce scientific findings. In this benchmark, an LLM agent is tasked with reproducing the innovations behind each of the NanoGPT Speedrun records, when given access to a description of the innovation. These hints can be set to any of three formats: pseudocode of the change (level 1), text description (level 2), markdown paper describing the improvement (level 3). Currently, no frontier model is capable of reproducing the human-driven speedrun, even when given the pseudocode hints.
When run without any hints, this benchmark further serves as an open-ended evaluation of an LLM's ability to produce new innovations in the important domain of language modeling.
You can read more about this benchmark in our paper, "The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements". The diagram below provides a high-level overview of the structure of the benchmark tasks.
Folder | Contents |
---|---|
config |
Hydra configuration files for every module involved in running an experiment |
core & util |
Source code that implements the agent‑scaffolding logic |
workspace_templates |
Starter workspaces for tasks (e.g., speed‑run tasks live in workspace_templates/nanogpt_speedrun ) |
data/nanogpt_speedrun_knowledge_in_levels |
Hint data used by the speed‑run tasks |
conda_envs |
Conda environment requirement files, one per speed‑run task set |
launchers |
Convenience launch scripts for different experiment settings |
analysis |
Jupyter notebooks that generate the analyses and plots presented in the paper |
git clone [email protected]:facebookresearch/llm-speedrunner.git
cd llm-speedrunner
Different sets of records (records 1 – 11, records 12 – 18, and records 19 – 21) require slightly different dependencies and therefore separate conda environments:
# For records 1 – 11
conda env create -f conda_envs/speedrunner-1-11/environment-1-11.yml
conda activate record-1-11
pip install -r pip_requirements-1-11.txt
cd
conda env create -f conda_envs/speedrunner-12-18/environment-12-18.yml
conda activate record-12-18
pip install -r pip_requirements-12-18.txt
The conda environment for records 19 – 21 requires the nightly build of torch
, which we install from conda-pack as follows:
## conda-unpack to rebuild the environment
mkdir -p ~/path/to/envs/environment-19-21
tar xzvf speedrunner-19-21.tar.gz -C ~/path/to/envs/environment-19-21
~/path/to/envs/environment-19-21/bin/conda-unpack
## Activate the environment
source ~/path/to/envs/environment-19-21/bin/activate
## Alternatively, the following command can be used to link the environment to existing conda, then conda activate can be used
conda config --append envs_dirs ~/path/to/envs
conda activate environment-19-21
Copy config/secrets/default.template.yaml
to config/secrets/default.yaml
and add your API keys for the relevant LLM providers to it. Note that config/secrets/default.yaml
is already included in .gitignore
.
Run AIDE with o3-mini on the first record for 5 search iterations:
python launch_scientist.py \
model=o3_mini \
science_runner=aide \
task=nanogpt_speedrun/speedrun_record_1 \
n_iterations=5
You can pass in external knowledge sources as a list of file paths or glob strings.
This command runs AIDE with o3-mini on the first record with the external knowledge for hint level 1 (the pseudocode of the changes needed to arrive at the next record). Note the quotations around the knowledge_src_paths
argument here are important.
python launch_scientist.py \
model=o3_mini \
task=nanogpt_speedrun/speedrun_record_1 \
knowledge_src_paths=["data/nanogpt_speedrun_knowledge_in_levels/record_1/level_1_*.txt"]
For the NanoGPT speedrunning tasks, the knowledge source for the level L hint for record R is located at data/nanogpt_speedrun_knowledge_in_levels/record_{R}/level_{L}_*.txt
.
We provide Slurm launch scripts for the various experiments from the paper in launchers/
.
After adding a model config for your model at config/model/your_model.yaml
, you can launch an experiment with your model via launch_scientist.py model=your_model
.
-
Create a new folder under
workspace_templates/
with the same name as the task name. This folder is the workspace_template containing the set of starting files available to the agent. Whatever is in the task's workspace template is copied into thev_0
workspace (the contents of the root search node) in the search scaffold. -
Create a task config at
config/task/your_task.yaml
(with# @package _global_
at the top). Refer to the existing configs for some examples, e.g.task/collatz.yaml
defines a task whose solution code runs on CPUs, whereas the tasks undertask/nanogpt_speedrun
require GPUs and is launched viatorchrun
. The exact launch configuration for the solution code can be configured via theslurm_config_args
parameter (seetask/nanogpt_speedrun/default_config.yaml
for an example).
You can then launch an experiment for your task via launch_scientist.py task=your_task
.
Coding agents are any subclass of core.agent.Agent
that implements the code
method. See the base Coder and Aider agents for examples. To integrate a custom coding agent, do the following:
- Implement your custom coding agent at
coders/your_coder.py
. - Add a default configuration for this agent at
config/coder/your_coder.yaml
.
You can then launch an experiment for any task by specifying your coder via launch_scientist.py task=nanogpt_speedrun/speedrun_record_1 coder=your_coder
.
The scientist agent scaffold implements an experimentation loop consisting of a few stages:
Ideation: Generating new ideas for hypotheses to test and implementation changes to try (We disable this stage in our reproducibility experiments by setting the ideator to dummy
, as the ideas are provided directly by the hints).
Experiment implementation: Coding the experiments that test the ideas produced in the ideation stage.
Experiment execution: Running the code that implements the experiments.
Results analysis: Extracting insights from the output of the executed experiments.
The core agent scaffold logic resides in the (BoNScienceRunner
)[core/runners/bon_science_runner.py], while ideation and implementation are handled by instances of the Ideator
and Coder
classes respectively, which all subclass Agent
(see the Agent section below). The modules core.ideators
and core.coders
serve as central registries for Ideator and Coder subclasses, making it easy to define combinations of ideator and coder strategies, which can all be set with a single line in the top-level hydra config. Moreover, the BoNScienceRunner
also receives a simple instance of Agent
(under the assistant
property), which is used for handling one-off LLM queries.
Each system-prompted LLM instance is abstracted as an agent, with an act
method, which takes a prompt (e.g. instruction) and optionally a validator function and a value for max_retries
. The validator function returns a string value, that can be an arbitrary post-processed version of the LLM response to the prompt, and it should return None
to mark the response as invalid. The method act
will then retry querying the LLM with the prompt a maximum of max_retries
times.
Each scientist run is encapsulated in its own subdirectory inside the workspaces
directory (auto-generated on first run).
Each experiment implemented by the scientist during a run corresponds to a version of an initial workspace template, a directory of files and potentially nested subdirectories. Workspace templates provides the initial project contents that serve as a starting point for the scientist to begin its experimentation.
When the workspace is first created, v_0
(version 0) is initialized by copying the specified workspace template. Each experiment iteration corresponds to branching (i.e. copying) the previous version into a new version (e.g. v_0 -> v_1
) and making changes in the new version directory. In this way, workspaces track the full history over arbitrary trees of codebases that may be created during the course of a scientist run.
This codebase uses the CC BY-NC 4.0 license.