[📃Paper] [🌐Project Page] [🤗Hugging Face] [🛠️Evaluation]
- [2025.9.16] We have released the v2 dataset (annotated mainly by GPT-4o) in ChrisDing1105/MMIF-23k, feel free to use it!
- [2025.4.26] We have included both the SFT and DPO data in ChrisDing1105/MMIF-23k as part of version 1.0 of the dataset. Feel free to download it! We are also planning to release version 1.1 soon, scheduled for May! 🎉🎉🎉
- [2025.4.24] MM-IFEval has been merged into VLMEvalkit. You can directly evaluate your model on MM-IFEval with it! Usage see Evaluation using VLMEvalkit or more on the Official repo of VLMEvalkit! 🎉🎉🎉
- [2025.4.11] Our MM-IFEngine Paper is released! Check it at 📃Arxiv: MM-IFEngine ! Our Dataset will be open-sourced soon! 🎉🎉🎉
- An MM-IFEngine pipeline for generating multimodal constraint-rich image-instruction pairs;
- A large-scale training dataset MM-IFInstruct-23k and preference optimization dataset MM-IFDPO-23k de- rived from MM-IFEngine;
- A challenging multimodal instruction following benchmark MM-IFEval with diverse constraints and comprehensive evaluation approaches;
- Empirical evidence showing significant performance gains on both our MM-IFEval and existing benchmarks when training MLLMs on MM-IFInstruct-23k via SFT and MM-IFDPO-23k via DPO.
Performance of existing MLLMs on MM-IFEval. We report the accuracy of easy and difficult problems and the average accuracy across all problems. The C-Level and P-Level refer to the compose-level and perception-level problems, respectively. The best performance in each section is highlighted in bold.
Option 1 (Recommended): Evaluation using VLMEvalkit
# Note: Default snapshot of judge model (gpt-4o) in VLMEvalkit is currently gpt-4o-2024-05-13.
# When running with `python`, only one VLM instance is instantiated.
# API MODEL
python run.py --data MM-IFEval --model GPT4o_MINI --reuse --verbose --api-nproc 8
# HF MODEL
python run.py --data MM-IFEval --model Qwen2.5-VL-7B-Instruct --reuse --verbose --api-nproc 8
# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
# HF MODEL
torchrun --nproc-per-node=2 run.py --data MM-IFEval --model Qwen2.5-VL-7B-Instruct --reuse --verbose --api-nproc 8
# Set custom judge model and work-dir
torchrun --nproc-per-node=2 run.py --data MM-IFEval --model Qwen2-VL-7B-Instruct --judge gpt-4.1 --reuse --verbose --api-nproc 8 --work-dir ./outputs_gpt_4_1
see requirements.txt
# Step1: finish the config below in eval_MM-IFEval/sh_scripts/multi_run_inf_and_score.sh
# <---- param settings ---->
PROJECT_DIR=
CONDA_ACTIVATE_PATH=
export HF_HOME=
model_bench_pairs=(
"Qwen2-VL-7B-Instruct C-Level 8 qwen_vl HF"
"Qwen2-VL-7B-Instruct P-Level 8 qwen_vl HF"
)
# <---- param settings ---->
# Step2: run the script
zsh eval_MM-IFEval/sh_scripts/multi_run_inf_and_score.sh
@article{ding2025mm,
title={MM-IFEngine: Towards Multimodal Instruction Following},
author={Ding, Shengyuan and Wu, Shenxi and Zhao, Xiangyu and Zang, Yuhang and Duan, Haodong and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi},
journal={arXiv preprint arXiv:2504.07957},
year={2025}
}