๐ง Features ๐ Upload PDF documents and automatically convert pages into images.
๐ Ask text-based queries about the content of the document.
๐ผ๏ธ Ask image-based queries (e.g., screenshots or diagrams).
๐ง Combine text + image to perform deep visual-question answering.
๐๏ธ Retrieve top-3 relevant document pages using ColSmolVLM-based semantic search.
โ๏ธ Get a concise summary or caption generated by the SmolVLM model.
๐๏ธ Clean and interactive Gradio GUI.
๐ Installation & Setup This project is designed to run in Google Colab with GPU acceleration.
๐ง Models Used ColSmolVLM (vidore/colsmolvlm-v0.1) โ Used for image-based semantic retrieval from documents.
SmolVLM (Idefics3) (HuggingFaceTB/SmolVLM-Instruct) โ Used for generating answers, captions, and summaries.
๐ก How It Works PDF Upload: The uploaded PDF is split into images using pdf2image.
Indexing: Each page image is indexed using ColSmolVLM.
Querying:
For text-only queries, it retrieves top-k matching page images and generates answers.
For image-only queries, it captions the image and retrieves related pages.
For multimodal (text + image), it first captions the image, retrieves matching document pages, and then uses the text query to generate answers.
Answer Aggregation: Multiple answers are summarized into one concise response using SmolVLM.
๐ฅ๏ธ Gradio UI
Component | Purpose |
---|---|
File Upload | Upload PDF to process |
Text Query | Enter your natural language question |
Image Query | Upload an image related to the document |
Submit Button | Run the multimodal RAG process |
Output Gallery | Show top-3 relevant document pages |
Answer Box | Display generated response |