ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing* · Qidong Huang* · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jinsong Li · Shuangrui Ding · Weiming Zhang · Nenghai Yu · Jiaqi Wang · Feng Wu · Dahua Lin

📖Paper | 🤗Datasets | 🤗Daily Paper

🌈We introduce ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community. Our key observations highlight two inherent biases in LVLMs: multimodal bias resulting in imbalanced descriptive granularity; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose two novel components: heuristic question answering and contrastive sentence rating. Extensive experiments demonstrate the effectiveness of ScaleCap.

📢 News

🚀 [06/25/2025] We release ScaleCap repository, training code and dataset.

💡 Highlights

🔥 A plug-and-play pipeline improving caption quality: ScaleCap can be used simply by calling either open-source or closed-source model APIs, making it extremely convenient to use.
🔥 450k Image-Caption Dataset: With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community.
🔥 Extensive Experiments: We conduct extensive experiments on various tasks to demonstrate effectiveness of ScaleCap.
🔥 Open Source: We fully open-source the training code, training data, and evaluation scripts on Github to facilitate further research.

👨‍💻 Todo

Support more VQA benchmarks and complete evaluation.

🛠️ Setup

git clone https://github.com/Cooperx521/ScaleCap.git
conda create -n ScaleCap python=3.10
conda activate ScaleCap
bash setup.sh

⭐️ Quick Start

To quickly get started with generating captions using ScaleCap, we provide an example script. Simply run the following command:

bash scripts/launch_example.sh

Notes

In our setup, Qwen and its VL series are deployed using the vLLM framework.
We have verified that this setup works on NVIDIA A100 GPUs.
If your GPU has limited memory, we recommend doubling the number of devices specified by CUDA_VISIBLE_DEVICES to avoid out-of-memory issues. You may need to modify the following lines in the script to better suit your hardware configuration:
- Line 7
- Line 16

Pretraining

Datasets

Our ScaleCap450k dataset is available on : 🔗 Hugging Face

This dataset contains 450,000 images along with their corresponding captions generated by ScaleCap.

Reproducing Pretraining Experiments

To reproduce the pretraining experiments presented in our paper:

Initialize Qwen2.5-VL. Follow the steps in the notebook initiallize_vlm_3b.ipynb to set up the Qwen2.5-VL model for training.
Training. You can then use LLaMAFactory directly to run the training process.

Comparing Caption Quality via VQA

We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:

First, a model generates a caption for the image.
Then, a language model answers questions based solely on the generated caption.

This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.

Inference Code

The inference pipeline can be found in the prism_benchmark directory.

Evaluation Code

The scripts used to compute evaluation metrics are located in the eval directory.

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

❤️ Acknowledgments

Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
eval		eval
grocery_file		grocery_file
prism_benchmark		prism_benchmark
scripts		scripts
utils		utils
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

📢 News

💡 Highlights

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Notes

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via VQA

Inference Code

Evaluation Code

📄 License

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Cooperx521/ScaleCap

Folders and files

Latest commit

History

Repository files navigation

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

📢 News

💡 Highlights

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Notes

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via VQA

Inference Code

Evaluation Code

📄 License

❤️ Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages