Skip to content

Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’

Notifications You must be signed in to change notification settings

Cooperx521/ScaleCap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing* · Qidong Huang* · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jinsong Li · Shuangrui Ding · Weiming Zhang · Nenghai Yu · Jiaqi Wang · Feng Wu · Dahua Lin

📖Paper | 🤗Datasets | 🤗Daily Paper

🌈We introduce ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community. Our key observations highlight two inherent biases in LVLMs: multimodal bias resulting in imbalanced descriptive granularity; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose two novel components: heuristic question answering and contrastive sentence rating. Extensive experiments demonstrate the effectiveness of ScaleCap.

Logo

📢 News

  • 🚀 [06/25/2025] We release ScaleCap repository, training code and dataset.
Logo

💡 Highlights

  • 🔥 A plug-and-play pipeline improving caption quality: ScaleCap can be used simply by calling either open-source or closed-source model APIs, making it extremely convenient to use.
  • 🔥 450k Image-Caption Dataset: With ScaleCap, we construct a dataset containing 450k image-caption pairs for use by the open-source community.
  • 🔥 Extensive Experiments: We conduct extensive experiments on various tasks to demonstrate effectiveness of ScaleCap.
  • 🔥 Open Source: We fully open-source the training code, training data, and evaluation scripts on Github to facilitate further research.
Logo

👨‍💻 Todo

  • Support more VQA benchmarks and complete evaluation.

🛠️ Setup

git clone https://github.com/Cooperx521/ScaleCap.git
conda create -n ScaleCap python=3.10
conda activate ScaleCap
bash setup.sh

⭐️ Quick Start

To quickly get started with generating captions using ScaleCap, we provide an example script. Simply run the following command:

bash scripts/launch_example.sh

Notes

  • In our setup, Qwen and its VL series are deployed using the vLLM framework.

  • We have verified that this setup works on NVIDIA A100 GPUs.

  • If your GPU has limited memory, we recommend doubling the number of devices specified by CUDA_VISIBLE_DEVICES to avoid out-of-memory issues. You may need to modify the following lines in the script to better suit your hardware configuration:

Pretraining

Datasets

Our ScaleCap450k dataset is available on : 🔗 Hugging Face

This dataset contains 450,000 images along with their corresponding captions generated by ScaleCap.

Reproducing Pretraining Experiments

To reproduce the pretraining experiments presented in our paper:

  1. Initialize Qwen2.5-VL. Follow the steps in the notebook initiallize_vlm_3b.ipynb to set up the Qwen2.5-VL model for training.

  2. Training. You can then use LLaMAFactory directly to run the training process.

Comparing Caption Quality via VQA

We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:

  1. First, a model generates a caption for the image.
  2. Then, a language model answers questions based solely on the generated caption.

This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.

Inference Code

The inference pipeline can be found in the prism_benchmark directory.

Evaluation Code

The scripts used to compute evaluation metrics are located in the eval directory.

📄 License

Code License Data License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

❤️ Acknowledgments

  • Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
  • VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!

About

Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published