🖼️ Image Captioning with BLIP (Vision-Language Model)

This project demonstrates how to generate captions for images using the BLIP (Bootstrapped Language-Image Pretraining) model by Salesforce, powered by the 🤗 Hugging Face Transformers library.

It is designed to run in Google Colab and uses a dataset of images (such as a subset of Flickr8k) to generate natural language captions.

📌 Features

✅ Uses Salesforce/blip-image-captioning-base for image captioning
✅ Automatically loads and processes images from a ZIP file
✅ GPU-accelerated via Google Colab
✅ Shows sample outputs using matplotlib
✅ Clean and modular Python code

📁 Dataset

The dataset used is a 2,000-image subset of the Flickr8k dataset.

📥 Download here:
https://www.kaggle.com/datasets/sanjeetbeniwal/flicker8k-2k

Expected structure inside the ZIP file:


Flickr8k\_2k.zip
└── Flicker8k\_2kDataset/
├── image1.jpg
├── image2.jpg
└── ...

Upload this ZIP file to your Colab environment before running the notebook.

🛠️ Dependencies

The following Python packages are required:

pip install torch torchvision torchaudio
pip install transformers
pip install matplotlib

All dependencies are automatically installed in the Colab notebook.

🚀 How It Works

Setup: Install required libraries and enable GPU runtime.
Dataset Unzipping: Upload and extract the dataset in Colab.
Model Loading: Load BLIP processor and model to GPU.
Captioning: Select and caption random images.
Visualization: Display images with generated captions using matplotlib.

📸 Sample Output

Below is an example of the model generating a caption for an image from the dataset:

Image: screenshot_20250721_235853.jpg Generated Caption: `a child sitting in a play area'

💡 Model Info

Model: Salesforce/blip-image-captioning-base
Library: Hugging Face Transformers
Pretrained for general image-to-text tasks.

▶️ Usage Instructions

Open the notebook in Google Colab.
Upload your dataset ZIP file to Colab (Flickr8k_2k.zip).
Set runtime to GPU:
- Runtime → Change runtime type → GPU
Run all cells sequentially.
View the images and their generated captions.

📄 License

This project is for educational and research purposes. It uses publicly available pretrained models under their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Flickr8k_2k.zip		Flickr8k_2k.zip
Image-captioning.rar		Image-captioning.rar
README.md		README.md
screenshot_20250721_235853.jpg		screenshot_20250721_235853.jpg
vlm.ipynb		vlm.ipynb
vlm.py		vlm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖼️ Image Captioning with BLIP (Vision-Language Model)

📌 Features

📁 Dataset

🛠️ Dependencies

🚀 How It Works

📸 Sample Output

💡 Model Info

▶️ Usage Instructions

📄 License

About

Uh oh!

Releases

Packages

Languages

FaNa-AI/VLM

Folders and files

Latest commit

History

Repository files navigation

🖼️ Image Captioning with BLIP (Vision-Language Model)

📌 Features

📁 Dataset

🛠️ Dependencies

🚀 How It Works

📸 Sample Output

💡 Model Info

▶️ Usage Instructions

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages