GRPO + Variational Disagreement Reward (VDR)

Enhancing Group Relative Policy Optimization (GRPO) with Variational Disagreement

This repository contains the official implementation and supplementary materials for the paper:

Entropic Resistance
Mawaba Pascal Dao, 2025

Overview

The Entropic Resistance integrates Bayesian Active Learning by Disagreement (BALD) into the Group Relative Policy Optimization (GRPO) framework to enhance exploration efficiency and robustness in reinforcement fine-tuning of language models. By quantifying epistemic uncertainty through entropy-based measures, this approach actively encourages exploration of less certain, information-rich token choices.

Custom TRL Library

This repo imports my custom version of Hugging Face's TRL which is adapted to enable Monte Carlo dropout on GRPO group members. This has the effect of creating an ensemble from the group, which is used for calculating the Bayesian Active Learning by Disagreement (BALD) bonus.

Key Features

Implementation of GRPO with variational disagreement-based epistemic rewards.
Support for different epistemic modes (none, per_token, end_of_sequence).
Configurable hyperparameters for epistemic bonus influence and ensemble size.
Efficient parallel GPU computation using Monte Carlo Dropout.

Installation

Clone this repository:

git clone https://github.com/PascalPolygon/grpo-vdr.git
cd grpo-vdr

Install dependencies:

pip install -r requirements.txt

Running Experiments

Train the GRPO model with Entropic Token Surge using the provided scripts:

bash scripts/grpo_train_multinode.sh

Adjust experiment configurations by editing configs/grpo_config.yaml.

Cite

If you find our paper and code useful, please cite:

@article{mdaoentropic,
  title={Entropic Resistance},
  author={pdao2015},
  year={2025},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
}

Contact

For questions or suggestions, open an issue on GitHub or reach out via email:

Author: pdao2015
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
models		models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
debug_env.sh		debug_env.sh
enropic_resistance.png		enropic_resistance.png
environment.yml		environment.yml
grpo_train_singlenode_b200_clean.sh		grpo_train_singlenode_b200_clean.sh
grpo_train_singlenode_l4_clean.sh		grpo_train_singlenode_l4_clean.sh
grpo_train_singlenode_l4_debug.sh		grpo_train_singlenode_l4_debug.sh
multinode_train_patched.py		multinode_train_patched.py
patch_chunked_intrinsic.py		patch_chunked_intrinsic.py
startup_patch.py		startup_patch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GRPO + Variational Disagreement Reward (VDR)

Overview

Custom TRL Library

Contents

Key Features

Installation

Running Experiments

Cite

Contact

About

Uh oh!

Releases

Packages

Languages

PascalPolygon/grpo-vdr

Folders and files

Latest commit

History

Repository files navigation

GRPO + Variational Disagreement Reward (VDR)

Overview

Custom TRL Library

Contents

Key Features

Installation

Running Experiments

Cite

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages