Skip to content

blackteck/Multimodal-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Multimodal-RAG

๐Ÿ”ง Features ๐Ÿ“„ Upload PDF documents and automatically convert pages into images.

๐Ÿ” Ask text-based queries about the content of the document.

๐Ÿ–ผ๏ธ Ask image-based queries (e.g., screenshots or diagrams).

๐Ÿง  Combine text + image to perform deep visual-question answering.

๐Ÿ—‚๏ธ Retrieve top-3 relevant document pages using ColSmolVLM-based semantic search.

โœ๏ธ Get a concise summary or caption generated by the SmolVLM model.

๐ŸŽ›๏ธ Clean and interactive Gradio GUI.

๐Ÿš€ Installation & Setup This project is designed to run in Google Colab with GPU acceleration.

๐Ÿง  Models Used ColSmolVLM (vidore/colsmolvlm-v0.1) โ†’ Used for image-based semantic retrieval from documents.

SmolVLM (Idefics3) (HuggingFaceTB/SmolVLM-Instruct) โ†’ Used for generating answers, captions, and summaries.

๐Ÿ’ก How It Works PDF Upload: The uploaded PDF is split into images using pdf2image.

Indexing: Each page image is indexed using ColSmolVLM.

Querying:

For text-only queries, it retrieves top-k matching page images and generates answers.

For image-only queries, it captions the image and retrieves related pages.

For multimodal (text + image), it first captions the image, retrieves matching document pages, and then uses the text query to generate answers.

Answer Aggregation: Multiple answers are summarized into one concise response using SmolVLM.

๐Ÿ–ฅ๏ธ Gradio UI

Component Purpose
File Upload Upload PDF to process
Text Query Enter your natural language question
Image Query Upload an image related to the document
Submit Button Run the multimodal RAG process
Output Gallery Show top-3 relevant document pages
Answer Box Display generated response

About

Multimodal RAG using Colsmolvlm in colab free-tier GPU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published