Enhancing Group Relative Policy Optimization (GRPO) with Variational Disagreement
This repository contains the official implementation and supplementary materials for the paper:
Entropic Resistance
Mawaba Pascal Dao, 2025
The Entropic Resistance integrates Bayesian Active Learning by Disagreement (BALD) into the Group Relative Policy Optimization (GRPO) framework to enhance exploration efficiency and robustness in reinforcement fine-tuning of language models. By quantifying epistemic uncertainty through entropy-based measures, this approach actively encourages exploration of less certain, information-rich token choices.
This repo imports my custom version of Hugging Face's TRL which is adapted to enable Monte Carlo dropout on GRPO group members. This has the effect of creating an ensemble from the group, which is used for calculating the Bayesian Active Learning by Disagreement (BALD) bonus.
scripts/
- Training and evaluation scripts.
- Implementation of GRPO with variational disagreement-based epistemic rewards.
- Support for different epistemic modes (
none
,per_token
,end_of_sequence
). - Configurable hyperparameters for epistemic bonus influence and ensemble size.
- Efficient parallel GPU computation using Monte Carlo Dropout.
Clone this repository:
git clone https://github.com/PascalPolygon/grpo-vdr.git
cd grpo-vdr
Install dependencies:
pip install -r requirements.txt
Train the GRPO model with Entropic Token Surge using the provided scripts:
bash scripts/grpo_train_multinode.sh
Adjust experiment configurations by editing configs/grpo_config.yaml
.
If you find our paper and code useful, please cite:
@article{mdaoentropic,
title={Entropic Resistance},
author={pdao2015},
year={2025},
journal={arXiv preprint arXiv:XXXX.XXXXX},
}
For questions or suggestions, open an issue on GitHub or reach out via email:
- Author: pdao2015
- Email: [email protected]