This project demonstrates how to generate captions for images using the BLIP (Bootstrapped Language-Image Pretraining) model by Salesforce, powered by the 🤗 Hugging Face Transformers library.
It is designed to run in Google Colab and uses a dataset of images (such as a subset of Flickr8k) to generate natural language captions.
- ✅ Uses
Salesforce/blip-image-captioning-base
for image captioning - ✅ Automatically loads and processes images from a ZIP file
- ✅ GPU-accelerated via Google Colab
- ✅ Shows sample outputs using
matplotlib
- ✅ Clean and modular Python code
The dataset used is a 2,000-image subset of the Flickr8k dataset.
📥 Download here:
https://www.kaggle.com/datasets/sanjeetbeniwal/flicker8k-2k
Expected structure inside the ZIP file:
Flickr8k\_2k.zip
└── Flicker8k\_2kDataset/
├── image1.jpg
├── image2.jpg
└── ...
Upload this ZIP file to your Colab environment before running the notebook.
The following Python packages are required:
pip install torch torchvision torchaudio
pip install transformers
pip install matplotlib
All dependencies are automatically installed in the Colab notebook.
- Setup: Install required libraries and enable GPU runtime.
- Dataset Unzipping: Upload and extract the dataset in Colab.
- Model Loading: Load BLIP processor and model to GPU.
- Captioning: Select and caption random images.
- Visualization: Display images with generated captions using
matplotlib
.
Below is an example of the model generating a caption for an image from the dataset:
Image: screenshot_20250721_235853.jpg
Generated Caption: `a child sitting in a play area'
- Model:
Salesforce/blip-image-captioning-base
- Library: Hugging Face Transformers
- Pretrained for general image-to-text tasks.
-
Open the notebook in Google Colab.
-
Upload your dataset ZIP file to Colab (
Flickr8k_2k.zip
). -
Set runtime to GPU:
Runtime
→Change runtime type
→GPU
-
Run all cells sequentially.
-
View the images and their generated captions.
This project is for educational and research purposes. It uses publicly available pretrained models under their respective licenses.