Skip to content
/ GRL Public

Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning

License

Notifications You must be signed in to change notification settings

lmgame-org/GRL

Repository files navigation

GRL: Game Reinforcement Learning for Post‑training LLMs

Game Reinforcement Learning (GRL) for post‑training large language models


Github Website arXiv X (Twitter) Discord

GRL (Game Reinforcement Learning) is an open‑source framework that post‑trains LLMs via multi‑turn reinforcement learning on games, yielding general gains across diverse benchmarks.

Release

[2025/08/27] We release GRL to reproduce the paper’s results and to demonstrate general gains across benchmarks by post‑training LLMs via reinforcement learning.

Installation

# clone the repo
git clone --recurse-submodules https://github.com/lmgame-org/GRL.git
cd GRL

# create a conda environment
conda create --name grl python=3.10
conda activate grl

# install all dependencies
source scripts/install_submodules.sh
# avoid compiling flash-attn from source
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install -e .

# export environment variables
export WANDB_API_KEY=your_wandb_api_key
export WANDB_ENTITY=your_wandb_entity
export HF_TOKEN=your_huggingface_token

Optional: Install Datasets

If you want to reproduce paper results and validate BIRD SQL performance or WebShop full dataset performance:

source scripts/install_dataset.sh --all

Quick Run

For quick experimentation: Trains on 6×6 (1‑box) Sokoban and evaluate the transferability to Tetris, Blocksworld, and GSM8K.

source quick_train_qwen_halfb.sh

Training Examples

General gains of LLM ability from game RL training (paper‑reported results)

Table 4: Model performance on diverse tasks

Expected Observed validation success rate curves (examples)

Examples of observed validation success rate curves

Note: RL training results may fluctuate relative to reported results, but the overall trend and gains remain consistent.

Sokoban Agent Training:

source examples/sokoban_ppo/qwen_7b.sh

Tetris Agent Training:

source examples/tetris_ppo/qwen_7b.sh

Note: BirdAgent may wait on SQLite file readiness or locks; heavy SQL can stall rollouts and prolong validation.

Hardware Configuration

The framework is pre‑configured for different GPU setups:

GPU Type GPUs Agent Groups Group Size Total Agents Default Model Task
A100 1 8 16 128 Qwen/Qwen2.5-0.5B-Instruct Sokoban
L40 1 4 8 32 Qwen/Qwen2.5-0.5B-Instruct Sokoban
A100 8 8 16 128 Qwen/Qwen2.5-7B-Instruct Sokoban
H200 4 8 16 128 Qwen/Qwen2.5-7B-Instruct Sokoban
A100 8 8 16 128 Qwen/Qwen2.5-7B-Instruct Tetris

Note: The framework automatically scales based on available hardware. Adjust parameters in the training scripts for best performance on your setup.

Supported Games and Agents

  • Sokoban: Puzzle-solving game requiring spatial reasoning
  • Tetris: decision‑making and planning
  • GSM8K: Mathematical reasoning tasks
  • BlocksWorld: Logical planning and manipulation
  • WebShop: E‑commerce navigation and decision‑making
  • BIRD: SQL query generation and database reasoning

Documentation

Acknowledgments

Our work is powered by VERL, an open‑source RLHF library, and draws insights from Ragen.

Citation

If you find this repository helpful, please kindly cite:

@article{hu2025lmgame,
  title={lmgame-Bench: How Good are LLMs at Playing Games?},
  author={Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.15146},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published