Model | Checkpoint | Paper | GSM8k | MATH | License |
---|---|---|---|---|---|
LEMMA-LLAMA-3-8B | 🤗 HF Link | 📃 [LEMMA] | 79.2 | 38.3 | Llama 3 |
LEMMA-LLAMA-3-70B | 🤗 HF Link | 📃 [LEMMA] | 91.5 | 51.8 | Llama 3 |
💡 Systematic analysis on error types: Categorizes common model-generated mathematical reasoning errors, revealing consistent error patterns across models and guiding targeted improvements.
💡 Error-type grounded error augmentation: Introduces diverse and meaningful errors by leveraging a teacher model to intentionally inject representative mistakes with type sampled from the analyzed distribution, enhancing the model’s ability to learn from failures.
💡 Two complementary self-correction mechanisms: Combines Fix & Continue (correcting mistakes within the original reasoning) and Fresh & Restart (restarting the reasoning process from scratch) to generate effective revision trajectories.
✅ LEMMA – A novel framework that fine-tunes LLMs on error-corrective trajectories, enabling autonomous error detection and correction during mathematical reasoning.
📊 Result – Up to 13.3% accuracy improvement for LLaMA3-8B with only 90k synthesized data.
The framework of LEMMA. LEMMA uses an error-type grounded mistake augmentation module, and explores two error correction strategies to construct the error-corrective trajectory as training corpus.
Experiments demonstrate that LEMMA significantly outperforms SOTA baselines. LEMMA-trained models also achieve strong generalization ability through evaluation on out-of-distribution (OOD) benchmarks.
LEMMA mainly requires the following two packages:-
LLaMA-Factory for model training.
-
math-evaluation-harness for evaluation. We use the version adapted from Qwen2.5-Math.
# Install LLaMA-Factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
# Install Qwen Evaluation Tool Case, which is is adapted from math-evaluation-harness.
git clone https://github.com/QwenLM/Qwen2.5-Math
cd Qwen2.5-Math
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
Run the following command to load the data:
from datasets import load_dataset
dataset = load_dataset("panzs19/LEMMA", split="train")
Download the LEMMA dataset from huggingface and convert it to the json format.
dataset = load_dataset("panzs19/LEMMA", split="train")
dataset_list = dataset.to_list()
with open('your_data_dir/dataset.json', 'w', encoding='utf-8') as f:
json.dump(dataset_list, f, indent=4, ensure_ascii=False)
Specify the data path in scripts/train.sh and LLaMA-Factory/data/dataset_info.json.
bash scripts/train.sh
We use the evaluation tool case in Qwen2.5-Math repository. We provide a shell script to launch all evaluations after training the model.
# Specify the model path in scripts/eval.sh
bash scripts/eval.sh
The inference prompt is:
"### Instruction:\n{instruction}\n\n### Response: Let's think step by step."
For evaluation on mawps and deepmind_math, we use the data provided in RefAug repository to ensure a fair comparison.
To collect your own LEMMA data, please refer to the following scripts:
# Error type and step analysis
bash scripts/error_type.sh
bash scripts/error_step.sh
# Error Augmentation
bash scripts/error_inject.sh
# Fresh & Restart Correction
bash scripts/error_connect.sh
# Fix & Continue Correction
bash scripts/error_correct.sh
# Smooth
bash scripts/smooth.sh
Thanks for the open source code of LLaMA-Factory, math-evaluation-harness and Qwen2.5-Math. Some of our codes are based on them.
Please cite the paper if you refer to our model, code, data or paper.
@article{LEMMA,
title={LEMMA: Learning from Errors for MatheMatical Advancement in LLMs},
author={Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu},
journal={arXiv preprint arXiv:2503.17439},
year={2025}
}