Skip to content
/ VLM Public

Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle.

Notifications You must be signed in to change notification settings

FaNa-AI/VLM

Repository files navigation

🖼️ Image Captioning with BLIP (Vision-Language Model)

This project demonstrates how to generate captions for images using the BLIP (Bootstrapped Language-Image Pretraining) model by Salesforce, powered by the 🤗 Hugging Face Transformers library.

It is designed to run in Google Colab and uses a dataset of images (such as a subset of Flickr8k) to generate natural language captions.


📌 Features

  • ✅ Uses Salesforce/blip-image-captioning-base for image captioning
  • ✅ Automatically loads and processes images from a ZIP file
  • ✅ GPU-accelerated via Google Colab
  • ✅ Shows sample outputs using matplotlib
  • ✅ Clean and modular Python code

📁 Dataset

The dataset used is a 2,000-image subset of the Flickr8k dataset.

📥 Download here:
https://www.kaggle.com/datasets/sanjeetbeniwal/flicker8k-2k

Expected structure inside the ZIP file:


Flickr8k\_2k.zip
└── Flicker8k\_2kDataset/
├── image1.jpg
├── image2.jpg
└── ...

Upload this ZIP file to your Colab environment before running the notebook.


🛠️ Dependencies

The following Python packages are required:

pip install torch torchvision torchaudio
pip install transformers
pip install matplotlib

All dependencies are automatically installed in the Colab notebook.


🚀 How It Works

  1. Setup: Install required libraries and enable GPU runtime.
  2. Dataset Unzipping: Upload and extract the dataset in Colab.
  3. Model Loading: Load BLIP processor and model to GPU.
  4. Captioning: Select and caption random images.
  5. Visualization: Display images with generated captions using matplotlib.

📸 Sample Output

Below is an example of the model generating a caption for an image from the dataset:

Image: screenshot_20250721_235853.jpg Generated Caption: `a child sitting in a play area'

Generated caption sample


💡 Model Info


▶️ Usage Instructions

  1. Open the notebook in Google Colab.

  2. Upload your dataset ZIP file to Colab (Flickr8k_2k.zip).

  3. Set runtime to GPU:

    • RuntimeChange runtime typeGPU
  4. Run all cells sequentially.

  5. View the images and their generated captions.


📄 License

This project is for educational and research purposes. It uses publicly available pretrained models under their respective licenses.

About

Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published