Long Xing* · Qidong Huang* · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jinsong Li · Shuangrui Ding · Weiming Zhang · Nenghai Yu · Jiaqi Wang · Feng Wu · Dahua Lin
📖Paper | 🤗Datasets | 🤗Daily Paper
🌈We introduce ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community. Our key observations highlight two inherent biases in LVLMs: multimodal bias resulting in imbalanced descriptive granularity; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose two novel components: heuristic question answering and contrastive sentence rating. Extensive experiments demonstrate the effectiveness of ScaleCap.
- 🚀 [06/25/2025] We release ScaleCap repository, training code and dataset.
- 🔥 A plug-and-play pipeline improving caption quality: ScaleCap can be used simply by calling either open-source or closed-source model APIs, making it extremely convenient to use.
- 🔥 450k Image-Caption Dataset: With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community.
- 🔥 Extensive Experiments: We conduct extensive experiments on various tasks to demonstrate effectiveness of ScaleCap.
- 🔥 Open Source: We fully open-source the training code, training data, and evaluation scripts on Github to facilitate further research.
- Support more VQA benchmarks and complete evaluation.
git clone https://github.com/Cooperx521/ScaleCap.git
conda create -n ScaleCap python=3.10
conda activate ScaleCap
bash setup.sh
To quickly get started with generating captions using ScaleCap, we provide an example script. Simply run the following command:
bash scripts/launch_example.sh
-
In our setup, Qwen and its VL series are deployed using the vLLM framework.
-
We have verified that this setup works on NVIDIA A100 GPUs.
-
If your GPU has limited memory, we recommend doubling the number of devices specified by
CUDA_VISIBLE_DEVICESto avoid out-of-memory issues. You may need to modify the following lines in the script to better suit your hardware configuration:
Our ScaleCap450k dataset is available on : 🔗 Hugging Face
This dataset contains 450,000 images along with their corresponding captions generated by ScaleCap.
To reproduce the pretraining experiments presented in our paper:
-
Initialize Qwen2.5-VL. Follow the steps in the notebook
initiallize_vlm_3b.ipynbto set up the Qwen2.5-VL model for training. -
Training. You can then use LLaMAFactory directly to run the training process.
We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:
- First, a model generates a caption for the image.
- Then, a language model answers questions based solely on the generated caption.
This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.
The inference pipeline can be found in the prism_benchmark directory.
The scripts used to compute evaluation metrics are located in the eval directory.
Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!