B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

B-STAR (Balanced Self-Taught Reasoner) is a framework designed to improve the self-improvement process of reasoning models by dynamically balancing exploration and exploitation throughout training. This approach is particularly effective in enhancing performance in tasks requiring complex reasoning, such as mathematical problem-solving, coding, and commonsense reasoning.

Overview

Self-improvement in reasoning models involves iterative training where models generate their own training data from outputs. However, existing methods often stagnate after a few iterations due to imbalances between two critical factors:

Exploration: The model's ability to generate diverse and high-quality responses.
Exploitation: The effectiveness of external rewards in distinguishing and leveraging high-quality responses.

B-STAR introduces an adaptive mechanism to monitor and balance these factors dynamically, ensuring consistent performance improvements over multiple training iterations

Key Features

Dynamic Configuration Adjustments: Automatically tunes exploration and exploitation configurations (e.g., sampling temperature, reward thresholds) to optimize the self-improvement process.
Balance Score Metric: Quantifies the interplay between exploration and exploitation, guiding dynamic adjustments.
Generalization Across Tasks: Demonstrates effectiveness in mathematical reasoning, coding challenges, and commonsense reasoning tasks

Results

B-STAR achieves state-of-the-art performance across various benchmarks:

Significant improvements compared to previsous self-improvement methods.
Sustained performance growth across multiple iterations, outperforming existing methods that stagnate after a few iterations.

Reproduction

Our code builds upon easy-to-hard and gpt-accelerate. Please refer to gpt-accelerate for environment setup and model weight conversion instructions.

1. Prepare Model

We first need to prepare the model checkpoint in the gpt-fast format.

export DATA_DIR=/path/to/your/data/directory
export MODEL_REPO=mistralai/Mistral-7B-v0.1

python scripts/download.py \
    --repo_id $MODEL_REPO \
    --local_dir $DATA_DIR/checkpoints

python scripts/convert_hf_checkpoint.py \
    --checkpoint_dir $DATA_DIR/checkpoints/$MODEL_REPO \
    --target_precision bf16

2. Train SFT Model

export DATA_DIR=/path/to/your/data/directory
export MODEL_REPO= $DATA_DIR/checkpoints/Mistral-7B-v0.1

export OMP_NUM_THREADS=8


SFT_TRAIN_DATA=https://huggingface.co/datasets/AndrewZeng/math-trn-format/blob/main/math_format.json

# Please download this dataset to local folder
SFT_MODEL_SAVE_NAME=math_format_11k_mistral

torchrun --standalone --nproc_per_node=8 \
    train_sft.py \
    --do_train \
    --checkpoint_path $MODEL_REPO/model.pth \
    --source_max_len 768 \
    --target_max_len 768 \
    --total_max_len 1024 \
    --per_device_train_batch_size 16 \
    --micro_train_batch_size 4 \
    --learning_rate 5e-6 \
    --lr_eta_min 2e-7 \
    --num_train_epochs 3 \
    --dataset "$SFT_TRAIN_DATA" \
    --dataset_format "metamath" \
    --add_eos_to_marked_target \
    --save_strategy "steps" \
    --save_steps 25 \
    --optim_dtype bf16 \
    --save_total_limit 40 \
    --tensor_parallel_size 1 \
    --save_dir $DATA_DIR/checkpoints/$SFT_MODEL_SAVE_NAME \
    --resume_from_checkpoint

3. Train PRM Model

We constructed the PRM training data using the math-shepherd approach and trained the reward model using a pointwise objective.

export DATA_DIR=/path/to/your/data/directory

export MODEL_REPO= $DATA_DIR/checkpoints/Mistral-7B-v0.1
export OMP_NUM_THREADS=4


RM_DATA=train_prm_math_shepherd_mistral.json
RM_MODEL_SAVE_NAME=prm_model_mistral_sample_complete

torchrun --standalone --nproc_per_node=8 \
    train_rm_pointwise.py \
    --do_train \
    --checkpoint_path $MODEL_REPO/model.pth \
    --source_max_len 768 \
    --target_max_len 768 \
    --total_max_len 1024 \
    --per_device_train_batch_size 32 \
    --micro_train_batch_size 32 \
    --learning_rate 2e-6 \
    --lr_eta_min 2e-7 \
    --num_train_epochs 2 \
    --dataset "$RM_DATA" \
    --dataset_format "prm-v4" \
    --save_strategy epoch \
    --save_total_limit 5 \
    --train_on_every_token \
    --tensor_parallel_size 1 \
    --save_only_model True \
    --optim_dtype bf16 \
    --save_dir $DATA_DIR/checkpoints/$RM_MODEL_SAVE_NAME \
    --resume_from_checkpoint

4. Train B-STaR

## This is our initial release code. 
## We are working hard to clean it to make our code more clear and more readable
cd train_code
bash train_bstar.sh

5. Evaluation

Coming Soon !

Citation

If you find B-STaR useful, please cite our paper:

@article{zeng2024bstar,
  title={B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners},
  author={Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He},
  journal={arXiv preprint arXiv:2412.17256},
  year={2024},
  url={https://arxiv.org/abs/2412.17256}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
train_code		train_code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Overview

Key Features

Results

Reproduction

1. Prepare Model

2. Train SFT Model

3. Train PRM Model

4. Train B-STaR

5. Evaluation

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

hkust-nlp/B-STaR

Folders and files

Latest commit

History

Repository files navigation

B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Overview

Key Features

Results

Reproduction

1. Prepare Model

2. Train SFT Model

3. Train PRM Model

4. Train B-STaR

5. Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages