LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

🔍 Table of Contents

🌐 Overview
💡 Generation
📊 Evaluation
📝 Citation

🌐 Overview

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR²Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR²Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR²Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.

💡 Generation

You can edit the tasks and models for generation in launch.sh. This script including both model generation and answer extraction.

bash launch.sh

📊 Evaluation

Then run the merge.sh to get the overall performance of your model in folder ./submission.

bash merge.sh

📝 Citation

If you find this repo useful for your research, please consider citing the paper:

@article{chen2025lr,
  title={LR2 Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems},
  author={Chen, Jianghao and Wei, Zhenlin and Ren, Zhenjiang and Li, Ziyong and Zhang, Jiajun},
  journal={arXiv preprint arXiv:2502.17848},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
.gitignore		.gitignore
0-generate.py		0-generate.py
1-extract_answer.py		1-extract_answer.py
2-merge_answer.py		2-merge_answer.py
README.md		README.md
gen.sh		gen.sh
launch.sh		launch.sh
merge.sh		merge.sh
openai_api.py		openai_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

🔍 Table of Contents

🌐 Overview

💡 Generation

📊 Evaluation

📝 Citation

About

Uh oh!

Releases

Packages

Languages

ZNLP/LR2Bench

Folders and files

Latest commit

History

Repository files navigation

LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

🔍 Table of Contents

🌐 Overview

💡 Generation

📊 Evaluation

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Packages