📄 Paper
B-STAR (Balanced Self-Taught Reasoner) is a framework designed to improve the self-improvement process of reasoning models by dynamically balancing exploration and exploitation throughout training. This approach is particularly effective in enhancing performance in tasks requiring complex reasoning, such as mathematical problem-solving, coding, and commonsense reasoning.
Self-improvement in reasoning models involves iterative training where models generate their own training data from outputs. However, existing methods often stagnate after a few iterations due to imbalances between two critical factors:
- Exploration: The model's ability to generate diverse and high-quality responses.
 - Exploitation: The effectiveness of external rewards in distinguishing and leveraging high-quality responses.
 
B-STAR introduces an adaptive mechanism to monitor and balance these factors dynamically, ensuring consistent performance improvements over multiple training iterations
- Dynamic Configuration Adjustments: Automatically tunes exploration and exploitation configurations (e.g., sampling temperature, reward thresholds) to optimize the self-improvement process.
 - Balance Score Metric: Quantifies the interplay between exploration and exploitation, guiding dynamic adjustments.
 - Generalization Across Tasks: Demonstrates effectiveness in mathematical reasoning, coding challenges, and commonsense reasoning tasks
 
B-STAR achieves state-of-the-art performance across various benchmarks:
- 
Significant improvements compared to previsous self-improvement methods.

 - 
Sustained performance growth across multiple iterations, outperforming existing methods that stagnate after a few iterations.

 
Our code builds upon easy-to-hard and gpt-accelerate. Please refer to gpt-accelerate for environment setup and model weight conversion instructions.
We first need to prepare the model checkpoint in the gpt-fast format.
export DATA_DIR=/path/to/your/data/directory
export MODEL_REPO=mistralai/Mistral-7B-v0.1
python scripts/download.py \
    --repo_id $MODEL_REPO \
    --local_dir $DATA_DIR/checkpoints
python scripts/convert_hf_checkpoint.py \
    --checkpoint_dir $DATA_DIR/checkpoints/$MODEL_REPO \
    --target_precision bf16export DATA_DIR=/path/to/your/data/directory
export MODEL_REPO= $DATA_DIR/checkpoints/Mistral-7B-v0.1
export OMP_NUM_THREADS=8
SFT_TRAIN_DATA=https://huggingface.co/datasets/AndrewZeng/math-trn-format/blob/main/math_format.json
# Please download this dataset to local folder
SFT_MODEL_SAVE_NAME=math_format_11k_mistral
torchrun --standalone --nproc_per_node=8 \
    train_sft.py \
    --do_train \
    --checkpoint_path $MODEL_REPO/model.pth \
    --source_max_len 768 \
    --target_max_len 768 \
    --total_max_len 1024 \
    --per_device_train_batch_size 16 \
    --micro_train_batch_size 4 \
    --learning_rate 5e-6 \
    --lr_eta_min 2e-7 \
    --num_train_epochs 3 \
    --dataset "$SFT_TRAIN_DATA" \
    --dataset_format "metamath" \
    --add_eos_to_marked_target \
    --save_strategy "steps" \
    --save_steps 25 \
    --optim_dtype bf16 \
    --save_total_limit 40 \
    --tensor_parallel_size 1 \
    --save_dir $DATA_DIR/checkpoints/$SFT_MODEL_SAVE_NAME \
    --resume_from_checkpointWe constructed the PRM training data using the math-shepherd approach and trained the reward model using a pointwise objective.
export DATA_DIR=/path/to/your/data/directory
export MODEL_REPO= $DATA_DIR/checkpoints/Mistral-7B-v0.1
export OMP_NUM_THREADS=4
RM_DATA=train_prm_math_shepherd_mistral.json
RM_MODEL_SAVE_NAME=prm_model_mistral_sample_complete
torchrun --standalone --nproc_per_node=8 \
    train_rm_pointwise.py \
    --do_train \
    --checkpoint_path $MODEL_REPO/model.pth \
    --source_max_len 768 \
    --target_max_len 768 \
    --total_max_len 1024 \
    --per_device_train_batch_size 32 \
    --micro_train_batch_size 32 \
    --learning_rate 2e-6 \
    --lr_eta_min 2e-7 \
    --num_train_epochs 2 \
    --dataset "$RM_DATA" \
    --dataset_format "prm-v4" \
    --save_strategy epoch \
    --save_total_limit 5 \
    --train_on_every_token \
    --tensor_parallel_size 1 \
    --save_only_model True \
    --optim_dtype bf16 \
    --save_dir $DATA_DIR/checkpoints/$RM_MODEL_SAVE_NAME \
    --resume_from_checkpoint## This is our initial release code. 
## We are working hard to clean it to make our code more clear and more readable
cd train_code
bash train_bstar.shComing Soon !
If you find B-STaR useful, please cite our paper:
@article{zeng2024bstar,
  title={B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners},
  author={Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He},
  journal={arXiv preprint arXiv:2412.17256},
  year={2024},
  url={https://arxiv.org/abs/2412.17256}
}

