Skip to content

A RAG based streamlit application that allows users to index multiple PDFs and query them independently.

Notifications You must be signed in to change notification settings

anishka07/Intellidocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IntelliDocs

Overview

IntelliDocs is a Retrieval-Augmented Generation (RAG) based project designed to assist users in querying and extracting information from their PDF documents. By leveraging advanced natural language processing techniques, IntelliDocs enables users to efficiently retrieve relevant content from large volumes of text within PDFs.

Project Objectives

  1. PDF Extraction: Implement methods to extract text from PDF files, ensuring the preservation of formatting and structure.
  2. Chunking: Divide the text into manageable chunks to facilitate efficient querying.
  3. Embedding: Use Sentence Transformers to generate embeddings for the text chunks, enabling semantic similarity searches.
  4. Querying: Develop a retrieval system that allows users to input queries and receive relevant chunks of text based on semantic similarity.
  5. Structuring: Structure the generated response with the help of a LLM.

Technologies Used

  • Programming Language: Python
  • Libraries:
    • fitz: For PDF text extraction.
    • sentence-transformers: For embedding text chunks.
    • Streamlit: For creating the user interface.
    • Chromadb: For vector database.

Step-by-Step Guide to Clone and Run IntelliDocs

Prerequisites

Ensure you have the following installed on your system:

  • Python (version 3.12)
  • uv (Python package installer)
  • Git

Step 1: Clone the Repository

Open your terminal or command prompt and run the following command:

git clone https://github.com/anishka07/intellidocs.git

Step 2: Create a runnable environment automatically with uv

Run the following command:

uv sync 

Step 3: Run IntelliDocs using gRPC or streamlit

Run IntelliDocs gRPC server and client:

uv run python server.py 
uv run python client.py process *your pdf's name*

uv run python client.py query *your pdf key* *your query*

To run IntelliDocs from it's streamlit UI:

uv run streamlit run ui.py

Streamlit Interface

User Interface: User Interface

Indexing multiple PDFs as input: Indexing multiple PDFs as input

Query Response (Both Structured and Relevant Chunks): Query Response (Both structured and relevant chunks)

Usage

  1. Input PDF: Upload your PDF/PDFs using the Streamlit interface.
  2. Querying: Select the PDF you want to query using the unique generated PDF key and query the PDF.
  3. Results: The system will return the most relevant text chunks extracted from the PDF selected.

TODOs

  1. modify the gRPC code to make it more robust
  2. Make the code more dynamic
  3. Web application with FastAPI
  4. dockerize the whole thing

About

A RAG based streamlit application that allows users to index multiple PDFs and query them independently.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •