Game Reinforcement Learning (GRL) for post‑training large language models
GRL (Game Reinforcement Learning) is an open‑source framework that post‑trains LLMs via multi‑turn reinforcement learning on games, yielding general gains across diverse benchmarks.
[2025/08/27] We release GRL to reproduce the paper’s results and to demonstrate general gains across benchmarks by post‑training LLMs via reinforcement learning.
# clone the repo
git clone --recurse-submodules https://github.com/lmgame-org/GRL.git
cd GRL
# create a conda environment
conda create --name grl python=3.10
conda activate grl
# install all dependencies
source scripts/install_submodules.sh
# avoid compiling flash-attn from source
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install -e .
# export environment variables
export WANDB_API_KEY=your_wandb_api_key
export WANDB_ENTITY=your_wandb_entity
export HF_TOKEN=your_huggingface_token
If you want to reproduce paper results and validate BIRD SQL performance or WebShop full dataset performance:
source scripts/install_dataset.sh --all
For quick experimentation: Trains on 6×6 (1‑box) Sokoban and evaluate the transferability to Tetris, Blocksworld, and GSM8K.
source quick_train_qwen_halfb.sh
Note: RL training results may fluctuate relative to reported results, but the overall trend and gains remain consistent.
Sokoban Agent Training:
source examples/sokoban_ppo/qwen_7b.sh
Tetris Agent Training:
source examples/tetris_ppo/qwen_7b.sh
Note: BirdAgent may wait on SQLite file readiness or locks; heavy SQL can stall rollouts and prolong validation.
The framework is pre‑configured for different GPU setups:
GPU Type | GPUs | Agent Groups | Group Size | Total Agents | Default Model | Task |
---|---|---|---|---|---|---|
A100 | 1 | 8 | 16 | 128 | Qwen/Qwen2.5-0.5B-Instruct | Sokoban |
L40 | 1 | 4 | 8 | 32 | Qwen/Qwen2.5-0.5B-Instruct | Sokoban |
A100 | 8 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Sokoban |
H200 | 4 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Sokoban |
A100 | 8 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Tetris |
Note: The framework automatically scales based on available hardware. Adjust parameters in the training scripts for best performance on your setup.
- Sokoban: Puzzle-solving game requiring spatial reasoning
- Tetris: decision‑making and planning
- GSM8K: Mathematical reasoning tasks
- BlocksWorld: Logical planning and manipulation
- WebShop: E‑commerce navigation and decision‑making
- BIRD: SQL query generation and database reasoning
- Tutorial - Contributing and development workflow
- System Design Overview - Architecture and design principles
- Development Guide - Contributing and development workflow
Our work is powered by VERL, an open‑source RLHF library, and draws insights from Ragen.
If you find this repository helpful, please kindly cite:
@article{hu2025lmgame,
title={lmgame-Bench: How Good are LLMs at Playing Games?},
author={Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
journal={arXiv preprint arXiv:2505.15146},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.