GitHub - lupantech/ineqmath: Solving Inequality Proofs with Large Language Models.

Solving Inequality Proofs with Large Language Models

Welcome to the official repository for the paper "Solving Inequality Proofs with Large Language Models". With this toolkit you can evaluate your own models on our benchmark, plug in our improvement strategies to boost performance, reproduce the paper's experiments end-to-end, and train more effectively with our data.

Dive in and push the frontier of inequality and mathematics reasoning with us! 🚀

📑 Table of Contents

💥 News
📖 Introduction
📊 Dataset Examples
🏆 Leaderboard
📝 Evaluations on IneqMath
🎯 Strategies on IneqMath
🧪 Other experiments on IneqMath
- Taking Annotated Theorems as Hints
🤖 Supported LLM Engines
📂 Data Curation
- Informal Reformulation of Inequality Proving
- Training Data Enhancement
🤗 Dataset Overview
🧐 Fine-grained Informal Judges
📈 Evaluation Results
- Results of Leading LLMs
- Scaling Law in Model Size
🔍 In-depth Study
- Retrieving Relevant Theorems as Hints
- Self-improvement via Critic as Feedback
📜 License
📚 Citation

💥 News

[2025.08.18] 🎯 We've just released our novel reformulation pipeline that transforms inequality proofs into informal yet verifiable subtasks. Plus, our brand-new training data enhancement scripts are here to supercharge your model training. Check out our Data Curation section and start building better data and models today! ✨
[2025.08.17] 🚀 Supercharge your IneqMath runs with our new improvement strategies—Frequent Theorems as Hints, Frequent Training Problems & Solutions as Hints, and Few-shot Evaluation. Give them a try and let us know if there are other strategies you'd like to see!
[2025.08.16] 🚀 Updated example scripts across all supported model families—explore them here!
[2025.08.14] ✨ Released the dev set evaluation scripts and our Final Answer Judge—try them now! Dev set evaluation | Final Answer Judge.
[2025.08.08] 💥 GPT-5 (medium, 30K) 🥇 Sets New SOTA on IneqMath with overall accuracy 47.0%! Read more on the OpenAI Platform.
[2025.07.25] ✨ Released our dev set evaluation platform to help you fine-tune models more effectively! Get started: Dev set evaluation platform.
[2025.07.19] 📄 Added support for vLLM. You can now use any vLLM-supported models and your local checkpoint models. Check out the example scripts for more details.
[2025.07.18] 🎉 Our paper has been accepted to the 2nd AI for Math Workshop at ICML 2025—read the paper and view the poster!
[2025.06.16] 💥 o3-pro (40K) 🥇 Sets New SOTA on IneqMath with overall accuracy 46.0%! Read more on the OpenAI Platform.
[2025.06.11] Our work is featured by Pan Lu on Twitter!
[2025.06.09] Our paper is now accessible at https://arxiv.org/abs/2506.07927.

📖 Introduction

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. LLM's progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction.

Building on this, we propose IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and a training corpus enriched with stepwise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four stepwise judges designed to detect common reasoning flaws. Thus, we can assess LLMs’ informal reasoning abilities, where they generally perform well, even though they consistently struggle with formal problem solving.

A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under stepwise scrutiny; this is a drop of up to 65.5% from their accuracy when considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

📊 Dataset Examples

Below are training and testing examples from IneqMath. Each problem belongs to one of two automatically checkable subtasks: bound estimation or relation prediction. Each training problem includes stepwise solutions, with up to four solutions per problem, and 76.8% (962 problems) are annotated with relevant theorems. The test problems are each crafted and reviewed by IMO-level medalists to ensure both originality and difficulty.

Training examples of IneqMath:

Testing examples of IneqMath:

🏆 Leaderboard

The leaderboard of chat and reasoning LLMs on the IneqMath benchmark (the test set) is shown below.

The interactive leaderboard for the IneqMath is available here.

Rank	Model	Size	Type	Source	Date	Overall Acc ↓	Answer Acc	Step Acc (NTC)	Step Acc (NLG)	Step Acc (NAE)	Step Acc (NCE)
1 🥇	GPT-5 (medium, 30K)	UNK	🧠	🔒	2025-08-07	47.0%	66.5%	95.5%	69.0%	93.5%	98.0%
2 🥈	o3-pro (medium, 40K)	UNK	🧠	🔒	2025-06-10	46.0%	68.5%	95.5%	73.5%	86.0%	94.5%
3 🥉	Gemini 2.5 Pro Preview (40K)	UNK	🧠	🔒	2025-06-05	46.0%	66.0%	85.0%	65.0%	92.5%	97.5%
4	o3-pro (medium, 10K)	UNK	🧠	🔒	2025-06-10	45.5%	68.0%	98.5%	67.0%	87.0%	97.5%
5	Gemini 2.5 Pro (30K)	UNK	🧠	🔒	2025-03-25	43.5%	68.0%	87.5%	63.0%	91.0%	98.0%
6	o3 (medium, 40K)	UNK	🧠	🔒	2025-04-16	37.0%	72.0%	96.5%	56.0%	86.5%	94.0%
7	GPT-5 mini (medium, 10K)	UNK	🧠	🔒	2025-08-07	30.5%	55.5%	95.5%	50.0%	91.0%	98.5%
8	GPT-5 (medium, 10K)	UNK	🧠	🔒	2025-08-07	28.0%	36.5%	94.5%	47.0%	93.5%	97.0%
9	Gemini 2.5 Flash Preview 05-20 (40K)	UNK	🧠	🔒	2025-05-20	27.5%	45.0%	83.0%	44.0%	89.0%	96.0%
10	gpt-oss-120b (10K)	120B	🧠	🌐	2025-08-05	23.5%	66.5%	95.5%	40.0%	90.5%	93.5%
11	Gemini 2.5 Flash (40K)	UNK	🧠	🔒	2025-04-17	23.5%	44.5%	81.0%	36.5%	93.5%	97.5%
12	o3 (medium, 10K)	UNK	🧠	🔒	2025-04-16	21.0%	37.0%	93.5%	39.5%	91.5%	97.0%
13	DeepSeek-V3.1 (Thinking Mode) (30K)	671B	🧠	🌐	2025-08-21	15.5%	73.0%	52.0%	31.0%	64.5%	94.5%
14	o4-mini (medium, 10K)	UNK	🧠	🔒	2025-04-16	15.5%	65.0%	62.0%	26.0%	86.5%	93.0%
15	Gemini 2.5 Flash Preview 05-20 (10K)	UNK	🧠	🔒	2025-05-20	14.5%	17.5%	82.0%	31.0%	96.5%	95.5%
16	DeepSeek-V3.1 (Thinking Mode) (10K)	671B	🧠	🌐	2025-08-21	12.0%	32.5%	76.5%	27.0%	88.0%	97.5%
17	Gemini 2.5 Pro Preview (10K)	UNK	🧠	🔒	2025-06-05	10.0%	13.0%	91.0%	20.0%	98.0%	98.5%
18	DeepSeek-R1-0528 (40K)	671B	🧠	🌐	2025-05-28	9.5%	73.0%	49.0%	52.0%	20.0%	92.5%
19	o3-mini (medium, 10K)	UNK	🧠	🔒	2025-01-31	9.5%	62.5%	37.0%	22.0%	77.5%	95.0%
20	Kimi K2 Instruct	1000B	📝	🌐	2025-07-12	9.0%	62.5%	33.5%	25.5%	52.5%	93.0%
21	Grok 4 (40K)	UNK	🧠	🔒	2025-07-09	8.0%	76.0%	57.0%	14.0%	93.0%	96.5%
22	o1 (medium, 10K)	UNK	🧠	🔒	2024-12-17	8.0%	62.5%	34.5%	17.5%	86.5%	99.5%
23	o1 (medium, 40K)	UNK	🧠	🔒	2024-12-17	7.5%	68.0%	28.5%	19.0%	83.5%	95.5%
24	DeepSeek-V3-0324	671B	📝	🌐	2025-03-25	7.0%	54.5%	17.5%	15.0%	63.0%	94.5%
25	Grok 3 mini (medium, 10K)	UNK	🧠	🔒	2025-02-19	6.0%	71.5%	24.0%	19.5%	53.5%	91.0%
26	Qwen3-235B-A22B (10K)	235B	🧠	🌐	2025-04-28	6.0%	41.0%	35.0%	36.0%	31.0%	92.5%
27	Gemini 2.5 Pro (10K)	UNK	🧠	🔒	2025-03-25	6.0%	7.0%	88.5%	19.0%	100.0%	99.5%
28	Claude Opus 4 (10K)	UNK	🧠	🔒	2025-05-14	5.5%	47.5%	25.0%	9.0%	89.5%	94.0%
29	Qwen3-4B	4B	📝	🌐	2025-04-27	5.5%	44.0%	21.0%	21.0%	32.0%	85.5%
30	DeepSeek-R1 (10K)	671B	🧠	🌐	2025-01-19	5.0%	49.5%	57.0%	17.5%	81.0%	95.0%
31	DeepSeek-R1 (Qwen-14B) (10K)	14B	🧠	🌐	2025-01-20	5.0%	40.5%	21.0%	21.0%	35.5%	85.0%
32	DeepSeek-R1-0528 (10K)	671B	🧠	🌐	2025-05-28	4.5%	12.0%	34.0%	32.5%	29.0%	90.5%
33	Gemini 2.5 Flash (10K)	UNK	🧠	🔒	2025-04-17	4.5%	5.5%	88.0%	13.5%	100.0%	100.0%
34	Grok 3	UNK	📝	🔒	2025-02-19	3.5%	54.5%	17.0%	16.0%	36.0%	93.0%
35	DeepSeek-R1 (Llama-70B) (10K)	70B	🧠	🌐	2025-01-20	3.5%	53.5%	23.0%	26.0%	35.5%	87.0%
36	Gemini 2.0 Flash	UNK	📝	🔒	2025-02-05	3.0%	49.0%	15.5%	13.5%	55.5%	94.5%
37	Claude Sonnet 4 (10K)	UNK	🧠	🔒	2025-05-14	3.0%	44.0%	19.5%	6.5%	86.5%	95.5%
38	GPT-4o	UNK	📝	🔒	2024-08-06	3.0%	37.5%	32.0%	3.5%	92.5%	94.0%
39	Qwen2.5-7B	7B	📝	🌐	2024-09-16	3.0%	35.0%	44.5%	4.5%	92.5%	93.0%
40	Qwen2.5-72B	72B	📝	🌐	2024-09-16	2.5%	42.0%	54.5%	5.0%	91.0%	95.0%
41	GPT-4.1	UNK	📝	🔒	2025-04-14	2.5%	40.5%	16.0%	10.0%	59.5%	93.5%
42	Llama-4-Maverick	128 x 17B	📝	🌐	2025-04-05	2.5%	40.5%	42.5%	4.0%	89.0%	95.0%
43	QwQ-32B (10K)	32B	🧠	🌐	2025-03-05	2.0%	49.5%	26.0%	29.5%	21.0%	87.0%
44	QwQ-32B-preview (10K)	32B	🧠	🌐	2024-11-27	2.0%	43.5%	28.0%	30.0%	22.5%	87.5%
45	Claude 3.7 Sonnet (10K)	UNK	🧠	🔒	2025-02-19	2.0%	42.0%	49.0%	4.0%	93.5%	93.0%
46	GPT-4o mini	UNK	📝	🔒	2024-07-18	2.0%	39.5%	29.0%	2.5%	90.0%	93.0%
47	Qwen2.5-Coder-32B	32B	📝	🌐	2024-11-10	1.5%	40.5%	36.0%	3.0%	90.5%	88.5%
48	Llama-4-Scout	16 x 17B	📝	🌐	2025-04-05	1.5%	33.5%	30.5%	3.5%	93.0%	92.5%
49	Gemini 2.0 Flash-Lite	UNK	📝	🔒	2025-02-25	1.5%	33.0%	11.5%	3.5%	73.0%	90.5%
50	Claude 3.7 Sonnet (8K)	UNK	🧠	🔒	2025-02-19	1.0%	41.5%	49.0%	2.5%	93.5%	92.0%
51	DeepSeek-R1 (Qwen-1.5B) (10K)	1.5B	🧠	🌐	2025-01-20	0.5%	14.5%	20.0%	6.0%	48.0%	83.5%
52	Gemma-2-9B (6K)	9B	📝	🌐	2024-06-25	0.0%	15.5%	83.5%	0.5%	100.0%	99.0%
53	Llama-3.1-8B	8B	📝	🌐	2024-07-18	0.0%	14.5%	90.5%	0.0%	99.0%	92.0%
54	Llama-3.2-3B	3B	📝	🌐	2024-09-25	0.0%	11.0%	82.0%	0.0%	98.5%	88.5%
55	Gemma-2B (6K)	2B	📝	🌐	2024-02-21	0.0%	7.5%	73.5%	0.0%	99.0%	95.0%

The content in parentheses next to the model's name represents reasoning effort and the max tokens, respectively, with the default value for max tokens being 10K.

Icons Explanation:

Type: 🧠 = Reasoning Model, 📝 = Chat Model, 🔧 = Tool-augmented Model
Source: 🔒 = Proprietary Model, 🌐 = Open-source Model

Step Accuracy Abbreviations:

NTC: No Toy Case - Step accuracy excluding the use of toy cases for general conclusions
NLG: No Logical Gap - Step accuracy without logical reasoning gaps
NAE: No Approximation Error - Step accuracy excluding approximation errors
NCE: No Calculation Error - Step accuracy excluding all calculation errors

📝 Evaluations on IneqMath

Environment Setup

Set up conda environment:

conda create --name ineq python=3.10

conda activate ineq

If you fail to activate the environment, please try:

# Alternatively, use source activate
source activate ineq

Install dependencies and make .env file:

pip install -r requirements.txt
touch .env

Set your API keys in the .env file. For example:

OPENAI_API_KEY=your-OpenAI-api-key-here
DEEPSEEK_API_KEY=your-DeepSeek-api-key-here
ANTHROPIC_API_KEY=your-Anthropic-api-key-here

Evaluate Models on IneqMath Test Set

This section evaluates models in a zero-shot setting, which is the standard configuration in most LLM benchmark studies and the setup used in our paper. For few-shot evaluation, please see the few-shot evaluation section for more details.

Change the directory to scripts:

cd scripts

Run run_test_data_proprietary_all.sh, run_test_data_open_source_all.sh, and run_test_data_gemma.sh to generate the outputs of all the models used in our paper's experiments on the test set.

./run_test_data_proprietary_all.sh
./run_test_data_open_source_all.sh
./run_test_data_gemma.sh

To run other supported models, see the Supported LLM Engines section for details and example scripts. For example, to run the OpenAI‑family models, run the example script run_test_data_openai.sh:

./run_test_data_openai.sh

You can customize the max token limits by changing the TOKENS variable in the script files. For example, to set a 30,000 max tokens limit, modify TOKENS=30000 in the script before running.

If the dataset can't be loaded automatically, please download the JSON-formatted dataset manually by:

cd ../data/json
wget https://huggingface.co/datasets/AI4Math/IneqMath/resolve/main/json/test.json

Then, you can find the model's output in results/models_results_test_data/MODEL-LABEL/scores.json, where MODEL-LABEL is the label of the model consisting of the model name and the max tokens budget. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit would be gpt-4o-mini_tokens_10000.

To evaluate your model's output and obtain the answer accuracy and step-wise accuracy reported in our paper, please submit your outputs to the leaderboard. See the leaderboard submission instructions for more details.

Evaluate Models on IneqMath Dev Set

We also provide scripts to evaluate models on the dev set in zero-shot settings. In addition, we include our Final Answer Judge to compute answer accuracy, helping you tune your models more effectively.

To run the supported models, see the Supported LLM Engines section for details and dev example scripts. For example, to run the OpenAI‑family models on the dev set, run the dev example script run_dev_data_openai.sh in the scripts directory:

./run_dev_data_openai.sh

You can customize the max token limits by changing the TOKENS variable in the script files. For example, to set a 30,000 max tokens limit, modify TOKENS=30000 in the script before running.

If the dataset can't be loaded automatically, please download the JSON-formatted dataset manually by:

cd ../data/json
wget https://huggingface.co/datasets/AI4Math/IneqMath/resolve/main/json/dev.json

You can then find the answer accuracy in results/models_results_dev_data/MODEL-LABEL/scores.json and model's output in results/models_results_dev_data/MODEL-LABEL/results.json, where MODEL-LABEL is the label of the model consisting of the model name and the max tokens budget. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit would be gpt-4o-mini_tokens_10000.

To further evaluate your model's output and obtain the step-wise accuracy reported in our paper, please submit your generated file results/models_results_dev_data/MODEL-LABEL/results.json to the Dev set evaluation platform. See the leaderboard submission instructions for more details.

Evaluate with the Final Answer Judge

To run the Final Answer Judge on its own for a generated results file, change the directory to models/utils first:

cd models/utils

Then, run:

python compute_score.py \
  --direct_input \
  --use_cache \
  --results_file YOUR_RESULTS_FILE

Replace YOUR_RESULTS_FILE with the path to the results.json file for which you want to compute answer accuracy.

For example, to evaluate the results file ../../results/models_results_dev_data/gpt-4o-mini_tokens_10000/results.json, run:

python compute_score.py \
  --direct_input \
  --use_cache \
  --results_file ../../results/models_results_dev_data/gpt-4o-mini_tokens_10000/results.json

The score will be saved as ../../results/models_results_dev_data/gpt-4o-mini_tokens_10000/scores.json in the same directory as --results_file.

Submit the Results to the Leaderboard

🏆 The leaderboard for the IneqMath is available here.

To submit the results to the leaderboard, and get the answer accuracy and step-wise accuracy reported in our paper, please follow the following instructions:

If you run the model by our scripts, you can upload the generated results.json of the model to the corresponding section (test/dev) of the leaderboard.
If you run the model on your own, please check your data format before your submission. The submitted data should be compiled in a single JSON file with at least five keys listed below:

{
    "data_id": [integer or string] The ID of the data of each split,
    "problem": [string] The question text,
    "type": [string] The type of question: 'relation' or 'bound',
    "prompt": [string] The prompt used for the problem,
    "response": [string] The response of the model
}

🎯 Strategies on IneqMath

In this section, we provide some potential improvement strategies mentioned in our paper to help you improve the performance of your model on the IneqMath dataset.

Frequent Theorems as Hints

As shown in our paper, providing top-k most frequent theorems can significantly improve the performance of LLMs on the IneqMath dataset. Therefore, we provide the scripts to evaluate LLMs with top-k most frequent theorems as hints on both test and dev set.

Please change the directory to strategies/frequent_theorem_as_hints and run the following script to evaluate LLMs with top-k most frequent theorems as hints on test or dev set.

cd strategies/frequent_theorem_as_hints
./run_test_data_no_vllm.sh # or ./run_dev_data_no_vllm.sh # This is for non-vLLM models.
./run_test_data_vllm.sh # or ./run_dev_data_vllm.sh # This is for vLLM models.

If you want to run other models from the supported model family, you could substitute or add the model engine name in ENGINES of the example scripts, and then run it.

You can customize the max token limits and the number of most frequent theorems by changing the TOKENS and THEOREM_NUM variables in the script files. For example, to set a 30,000 max tokens limit and 3 most frequent theorems, modify TOKENS=30000 and THEOREM_NUM=3 in the script before running.

Then, you can find the model's output in results/frequent_theorems_as_hints_results_dev_data/MODEL-LABEL/results.json or results/frequent_theorems_as_hints_results_test_data/MODEL-LABEL/results.json, where MODEL-LABEL is the label of the model consisting of the model name, the max tokens budget and the number of most frequent theorems. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit and 3 most frequent theorems would be gpt-4o-mini_tokens_10000_theorem_num_3.

Results examples and evaluation instructions

We have provided the results generated by gpt-4o-mini with 3 most frequent theorems as hints and 10,000 max tokens limit here. We also visualize the query and response pairs in markdown format here for each problem. (For example, the visualization of problem 1 is here)

To evaluate your model's output and obtain the answer accuracy and step-wise accuracy reported in our paper, please submit your outputs to the leaderboard. See the leaderboard submission instructions for more details.

Frequent Training Problems and Solutions as Hints

As shown in our paper, providing top-k most frequent training problems and their solutions can also improve the performance of LLMs on the IneqMath dataset. Therefore, we provide the scripts to evaluate LLMs with top-k most frequent training problems and solutions as hints on both test and dev set.

Please change the directory to strategies/frequent_solution_as_hints and run the following script to evaluate LLMs with top-k most frequent training problems and solutions as hints on test or dev set.

cd strategies/frequent_solution_as_hints
./run_test_data_no_vllm.sh # or ./run_dev_data_no_vllm.sh # This is for non-vLLM models.
./run_test_data_vllm.sh # or ./run_dev_data_vllm.sh # This is for vLLM models.

If you want to run other models from the supported model family, you could substitute or add the model engine name in ENGINES of the example scripts, and then run it.

You can customize the max token limits and the number of most frequent training problems and solutions by changing the TOKENS and SOLUTION_NUM variables in the script files. For example, to set a 30,000 max tokens limit and 3 most frequent training problems and solutions, modify TOKENS=30000 and SOLUTION_NUM=3 in the script before running.

Then, you can find the model's output in results/frequent_solution_as_hints_results_dev_data/MODEL-LABEL/results.json or results/frequent_solution_as_hints_results_test_data/MODEL-LABEL/results.json, where MODEL-LABEL is the label of the model consisting of the model name, the max tokens budget and the number of most frequent training problems and solutions. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit and 3 most frequent training problems and solutions would be gpt-4o-mini_tokens_10000_solution_num_3.

Results examples and evaluation instructions

We have provided the results generated by gpt-4o-mini with 3 most frequent training problems and solutions as hints and 10,000 max tokens limit here. We also visualize the query and response pairs in markdown format here for each problem. (For example, the visualization of problem 1 is here)

To evaluate your model's output and obtain the answer accuracy and step-wise accuracy reported in our paper, please submit your outputs to the leaderboard. See the leaderboard submission instructions for more details.

Few-shot Evaluation

To further evaluate models on the IneqMath dataset, we provide the scripts to evaluate LLMs with few-shot settings on both test and dev set.

Please change the directory to strategies/few_shots and run the following script to evaluate LLMs with few-shot settings on test or dev set.

cd strategies/few_shots
./run_test_data_no_vllm.sh # or ./run_dev_data_no_vllm.sh # This is for non-vLLM models.
./run_test_data_vllm.sh # or ./run_dev_data_vllm.sh # This is for vLLM models.

If you want to run other models from the supported model family, you could substitute or add the model engine name in ENGINES of the example scripts, and then run it.

For the few-shot examples, you can customize it in data/json/few_shot_examples.json.

You can also customize the max token limits and the number of few-shot examples by changing the TOKENS and SHOT_NUM variables in the script files. For example, to set a 30,000 max tokens limit and 3-shot examples, modify TOKENS=30000 and SHOT_NUM=3 in the script before running.

Then, you can find the model's output in results/few_shot_results_dev_data/MODEL-LABEL/results.json or results/few_shot_results_test_data/MODEL-LABEL/results.json, where MODEL-LABEL is the label of the model consisting of the model name, the max tokens budget and the number of few-shot examples. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit and 3-shot examples would be gpt-4o-mini_tokens_10000_shot_num_3.

Results examples and evaluation instructions

We have provided the results generated by gpt-4o-mini with 3-shot examples and 10,000 max tokens limit here. We also visualize the query and response pairs in markdown format here for each problem. (For example, the visualization of problem 1 is here)

To evaluate your model's output and obtain the answer accuracy and step-wise accuracy reported in our paper, please submit your outputs to the leaderboard. See the leaderboard submission instructions for more details.

🧪 Other experiments on IneqMath

Taking Annotated Theorems as Hints

We demonstrated the experiment of taking the annotated theorems as hints on the training set in our paper. To run the experiment, please change the directory to experiments/training_theorem_as_hints and run the following script:

cd experiments/training_theorem_as_hints
./run_train_sampled_data_10k.sh # for Grok 3 mini and o3-mini
./run_train_sampled_data_30k.sh # for Gemini 2.5 Pro and o4-mini

Then, you can find the model's answer accuracy in results/training_theorems_as_hints_results_train_data/MODEL-LABEL/scores.json, where MODEL-LABEL is the label of the model consisting of the model name, the max tokens budget. For example, the MODEL-LABEL for gpt-4o-mini with a 10K max tokens limit would be gpt-4o-mini_tokens_10000.

Supported LLM Engines

We support a broad range of LLM engines, including OpenAI family, Anthropic family, Gemini family, DeepSeek family, Grok family, vLLM family, and more.

Model Family	Engines	Official Model List	Example Scripts
OpenAI	`gpt-5`, `gpt-5-mini`,`gpt-5-nano`, `o4-mini`, `o3`, `o3-mini`, `o3-pro`, `o1`, `o1-pro`, `gpt-4.1`, `gpt-4.1-mini`, `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo` etc.	OpenAI Models	Test / Dev
Anthropic	`claude-opus-4-20250514`,`claude-sonnet-4-20250514`, `claude-3-7-sonnet-20250219`, `claude-3-5-sonnet-20240620`, `claude-3-5-sonnet-20241022`, `claude-3-5-haiku-20241022`, `claude-3-opus-20240229`, `claude-3-haiku-20240307`	Anthropic Models	Test / Dev
Gemini	`gemini-2.5-pro`, `gemini-2.5-pro-preview-06-05`, `gemini-2.5-pro-preview-05-06`, `gemini-2.5-flash`, `gemini-2.5-flash-preview-05-20`, `gemini-2.5-flash-lite-preview-06-17` etc.	Gemini Models	Test / Dev
DeepSeek	`deepseek-chat`, `deepseek-reasoner`	DeepSeek Models	Test / Dev
Grok	`grok-4-0709`, `grok-3`, `grok-3-mini` etc.	Grok Models	Test / Dev
TogetherAI	`meta-llama/Llama-4-Scout-17B-16E-Instruct`, `Qwen/QwQ-32B`, `Qwen/Qwen2-VL-72B-Instruct` etc.	TogetherAI Models	Test / Dev
vLLM	Various vLLM-supported models, for example, `Qwen3-4B` and `QwQ-32B`. You can also use local checkpoint models for customization and local inference.	vLLM Models	Test / Dev

Note: If you want to run other models from the supported model family, you could substitute or add the model engine name in ENGINES of the example scripts, and then run it. If you are using TogetherAI models, please ensure you have the prefix 'together-' in the model string, for example, together-meta-llama/Llama-4-Scout-17B-16E-Instruct for model meta-llama/Llama-4-Scout-17B-16E-Instruct. For other custom engines, you can edit the factory.py file and add its interface file to add support for your engine. Your pull request will be warmly welcomed!

📂 Data Curation

Informal Reformulation of Inequality Proving

In our paper, we raise a novel informal reformulation of inequality proving tasks. By using this method, every inequality proving task can be converted into two informal yet verifiable subtasks: bound estimation and relation prediction.

Now, we release our automatic reformulation pipeline which can generate more training data for IneqMath from existing inequality proving datasets efficiently. Here is the step-by-step guide to run the pipeline:

Step 1: Prepare for the inequality proving dataset. The dataset should be a *.json file with list of problems. Each problem should be a dictionary with the following keys:

{
    "problem": [string] The inequality proving problem,
    "solution": [string] The solution or proof of the problem
}

Please see the example file which is sampled from AI-MO/NuminaMath-1.5 for reference.

Step 2: Please change the directory to data_curation/data_reformulation:

cd data_curation/data_reformulation

Open the data_reformulation.sh file and change the INPUT_FILE variable to the path of the inequality proving dataset you prepared and set the LABEL variable to your customized label of the dataset. For example, if you want to generate the reformulated data from the example file data/numina_sampled_example.json, you can set the INPUT_FILE variable to ./data/numina_sampled_example.json and the LABEL variable to numina_sampled_example.

Step 3: Run the following script to generate the reformulated data:

./task_reformulation.sh

Then, you can find the reformulated data in the reformulated_data/LABEL/reformulated_data.json file, where LABEL is the label you set in the data_reformulation.sh file.

We are excited to see more data generated from this pipeline as an extension of the IneqMath dataset! Please feel free to share your generated data with us!

Training Data Enhancement

Based on our experiments, enhancing the solution of the training data with detailed reasoning steps can help train the model more effectively and efficiently. Therefore, we provide the scripts to enhance the training data with detailed reasoning steps. Note that this is only one possible way of enhancement - we are excited to see more data enhancement approaches!

Step 1: Please download the IneqMath training data first:

cd data/json
wget https://huggingface.co/datasets/AI4Math/IneqMath/resolve/main/json/train.json

Step 2: Please change the directory to data_curation/training_data_enhancement:

cd ../../data_curation/training_data_enhancement

Step 3: Run the following script to enhance the training data with detailed reasoning steps:

./sft_data_generation.sh

Then, you can find the enhanced training data in the enhanced_training_data/training_data/ directory. There are two files in the directory:

enhanced_data_training_data.json: The enhanced training data in JSON format that maintains the original structure of the training data, with the enhanced solution added as a new field enhanced_solution. Please see the example.
sft_data_training_data.jsonl: The enhanced training data in JSONL format that is ready for supervised fine-tuning on the platform such as OpenAI. Please see the example.

We provide the example of the enhanced 200 training data sampled from the IneqMath training data (data/json/training_data_sampled_200.json) here. We also visualize the enhanced solution and the original solution for each problem in this directory. (For example, the visualization of problem 1 is here.)

🤗 Dataset Overview

The IneqMath dataset comprises 200 test problems for benchmarking, 100 development problems with public ground truth, and 1,252 training problems split evenly between bound estimation and relation prediction tasks as shown in the table below. The dataset also features 83 named theorems across 29 categories, with their distribution illustrated in the figure below.

Statistic	Number	Bnd.	Rel.
Theorem categories	29	–	–
Named theorems	83	–	–
Training problems (for training)	1252	626	626
- With theorem annotations	962	482	480
- With solution annotations	1252	626	626
- Avg. solutions per problem	1.05	1.06	1.05
- Max solutions per problem	4	4	4
Dev problems (for development)	100	50	50
Test problems (for benchmarking)	200	96	104

The table below compares datasets for inequalities and theorem proving. IneqMath provides expert-annotated training and test/dev sets, featuring high-quality named theorems and stepwise solutions for model development. Unlike prior datasets that use synthesis or autoformalization, IneqMath presents problems in informal language across both multiple-choice (MC) and open-ended (Open) formats, and employs LLM-as-judge for evaluation.

🧐 Fine-grained Informal Judges

Traditional evaluation methods fall short in this setting: expert annotation is accurate but prohibitively labor-intensive, while automated techniques such as string matching or value equivalence fail to capture step-by-step correctness—an essential aspect of inequality problem solving. To evaluate the correctness of IneqMath solutions, we propose a fine-grained LLM-as-judge framework, consisting of a final-answer judge for verifying the predicted answer and four specialized step-wise judges targeting common reasoning flaws. A solution is deemed correct overall only if it passes all five judges. As shown in the following table and confusion matrix, these judges achieve strong alignment with human annotations (F1 = 0.93), providing a scalable yet reliable alternative to manual evaluation.

📈 Evaluation Results

Results of Leading LLMs

This table shows the Final-answer accuracy versus overall accuracy for leading LLMs across different categories on the IneqMath benchmark of Olympiad-level inequality problems. Overall accuracy, measuring both answer correctness and step soundness, is substantially lower than final-answer accuracy for all model types. This highlights a critical gap: while LLMs may find correct final answers to these inequality problems, their reasoning is often unsound. Each model used its optimal maximal tokens.

Scaling Law in Model Size

The following two figures show how final-answer accuracy (which evaluates only the correctness of the final predicted answer) and overall accuracy (which requires both a correct answer and valid intermediate reasoning steps) scale with model size for LLMs.

The figure below shows how final-answer accuracy (which evaluates only the correctness of the final predicted answer) scales with model size for LLMs. As model size increases, we observe a steady improvement in answer accuracy, reflecting an empirical scaling law that larger models are better at inferring correct bounds and inequality relationships.

However, the trend for answer accuracy does not hold well when considering overall accuracy—which requires both a correct answer and valid intermediate reasoning steps—as shown in the figure below. In this case, the scaling curve flattens, indicating that increased model size alone is insufficient to eliminate step-by-step reasoning errors.

🔍 In-depth Study

Retrieving Relevant Theorems as Hints

As shown in the figure, providing one or two such theorems decreases overall accuracy for weaker models (e.g., Grok 3 mini, o3-mini, o4-mini), likely due to misapplication or distraction by potentially irrelevant information. Conversely, stronger models like Gemini 2.5 Pro benefit from these hints, suggesting advanced reasoning is crucial to effectively use such guidance. These results underscore the potential of theorem-guided reasoning but also highlight the critical need for more sophisticated theorem retrieval mechanisms (e.g., RAG) to reliably enhance LLM performance in inequality proving.

Self-improvement via Critic as Feedback

As the figure shows, self-critique consistently improves performance—e.g., Gemini 2.5 Pro's overall accuracy rises from 43% to 48%. This upward trend underscores self-critique as a promising, supervision-free method to enhance logical rigor and solution quality of LLMs in inequality reasoning.

📜 License

The new contributions to our dataset are distributed under the CC BY-SA 4.0 license.

The copyright of the images and the questions belongs to the original authors. Alongside this license, the following conditions apply:

Purpose: The test split was primarily designed for use as a test set.
Commercial Use: The test split can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the CC BY-SA 4.0 license.

📚 Citation

If you use the IneqMath dataset in your work, please kindly cite the paper using this BibTeX:

@article{sheng2025solving,
  title={Solving Inequality Proofs with Large Language Models},
  author={Sheng, Jiayi and Lyu, Luna and Jin, Jikai and Xia, Tony and Gu, Alex and Zou, James and Lu, Pan},
  journal={arXiv preprint arXiv:2506.07927},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
assets		assets
data/json		data/json
data_curation		data_curation
experiments/training_theorem_as_hints		experiments/training_theorem_as_hints
models		models
results		results
scripts		scripts
strategies		strategies
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
requirements.txt		requirements.txt

License

Licenses found

lupantech/ineqmath

Folders and files

Latest commit

History

Repository files navigation

Solving Inequality Proofs with Large Language Models

📑 Table of Contents

💥 News

📖 Introduction

📊 Dataset Examples

🏆 Leaderboard

📝 Evaluations on IneqMath

Environment Setup

Evaluate Models on IneqMath Test Set

Evaluate Models on IneqMath Dev Set

Evaluate with the Final Answer Judge

Submit the Results to the Leaderboard

🎯 Strategies on IneqMath

Frequent Theorems as Hints

Frequent Training Problems and Solutions as Hints

Few-shot Evaluation

🧪 Other experiments on IneqMath

Taking Annotated Theorems as Hints

Supported LLM Engines

📂 Data Curation

Informal Reformulation of Inequality Proving

Training Data Enhancement

🤗 Dataset Overview

🧐 Fine-grained Informal Judges

📈 Evaluation Results

Results of Leading LLMs

Scaling Law in Model Size

🔍 In-depth Study

Retrieving Relevant Theorems as Hints

Self-improvement via Critic as Feedback

📜 License

📚 Citation

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages