Multimodal-RAG

🔧 Features 📄 Upload PDF documents and automatically convert pages into images.

🔍 Ask text-based queries about the content of the document.

🖼️ Ask image-based queries (e.g., screenshots or diagrams).

🧠 Combine text + image to perform deep visual-question answering.

🗂️ Retrieve top-3 relevant document pages using ColSmolVLM-based semantic search.

✍️ Get a concise summary or caption generated by the SmolVLM model.

🎛️ Clean and interactive Gradio GUI.

🚀 Installation & Setup This project is designed to run in Google Colab with GPU acceleration.

🧠 Models Used ColSmolVLM (vidore/colsmolvlm-v0.1) → Used for image-based semantic retrieval from documents.

SmolVLM (Idefics3) (HuggingFaceTB/SmolVLM-Instruct) → Used for generating answers, captions, and summaries.

💡 How It Works PDF Upload: The uploaded PDF is split into images using pdf2image.

Indexing: Each page image is indexed using ColSmolVLM.

Querying:

For text-only queries, it retrieves top-k matching page images and generates answers.

For image-only queries, it captions the image and retrieves related pages.

For multimodal (text + image), it first captions the image, retrieves matching document pages, and then uses the text query to generate answers.

Answer Aggregation: Multiple answers are summarized into one concise response using SmolVLM.

🖥️ Gradio UI

Component	Purpose
File Upload	Upload PDF to process
Text Query	Enter your natural language question
Image Query	Upload an image related to the document
Submit Button	Run the multimodal RAG process
Output Gallery	Show top-3 relevant document pages
Answer Box	Display generated response

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Multimodal_RAG_using_colsmol.ipynb		Multimodal_RAG_using_colsmol.ipynb
README.md		README.md

Provide feedback