Created throughout Day 80-83 of my self-studying journey.
📁pdf-rag-from-scratch
└── 📁dev
└── dev_preprocess_pdf.ipynb -> preprocess pdf using llama-index
└── dev_rag.ipynb -> runs rag using llama-index preprocessed pdf
└── lbg_relationship_tnc.pdf
└── lbg_relationship_tnc_locked.pdf
└── preprocess_pdf.py -> preprocess pdf using langchain
└── rag.py -> run local rag chat using langchain preprocessed pdf
└── requirements.txt
git clone https://github.com/divakaivan/pdf-rag-from-scratch.git
pip install -r requirements.txt
python preprocess_pdf.py
-> PDF must be saved in the same directory as the file, then it reads and processes the PDF for you, outputs a csv with the embeddings (Note! use for up to 100k embeddings)python rag.py
-> downloads gemma-2b-it, runs the RAG, and lets you have a chat with your PDF- (Optional) Run the dev versions (
dev_preprocess_pdf.ipynb
anddev_rag.ipynb
) which uses llama-index as PDF reader and see the difference in the answer quality
In the dev folder, I use the files for development, but also am using llama-index, at the time of writing using it requires an API key, which is free, but we do not know in the future~
In preprocess_pdf.py and rag.py I use just local, pip install and run libraries.
rag_video_demo.mp4
- embedding model: mixedbread-ai/mxbai-embed-large-v1
- LLM: google/gemma-2b-it