A modular information retrieval system built from scratch with traditional IR, neural reranking, vector search, and RAG-style generation.
The system exposes a CLI-based client–server interface.
This tool uses GEMINI-api so please create an .env file with GEMINI_API=YOUR_API_KEY
-
Crawler
- Starts from manual seeds or expands dynamically based on queries.
- Extracts and stores page text into
docs.jsonl.
-
Indexing
- Builds inverted index (
index.json). - Tracks term frequency, document frequency, and lengths.
- Builds inverted index (
-
Ranking
- Implements TF-IDF and BM25.
- Phrase queries and fuzzy matching supported.
-
Vector Search
- Stores dense embeddings using FAISS.
- Query embeddings compared to document embeddings for semantic search.
-
Hybrid Scoring
- Combines BM25 and FAISS scores with weighted merging.
- Produces top-k candidates.
-
Reranking
- Uses a cross-encoder for fine-grained scoring of top candidates.
-
RAG-style Generation
- Top retrieved documents passed with query to a generative model (e.g., Gemini).
- Produces coherent, context-aware answers.
-
Text Preprocessing
- Pre-process query and document text
- Tokenization + Stemmerization
-
Client–Server Architecture
- Server: Hosts index, embeddings, and search logic.
- Client: CLI that sends queries, receives results, and displays answers.
-
Documents and queries are converted into dense embeddings using a language model.
-
This allows semantic search (matching meaning, not just words).
-
A transformer-based model scores the relevance of query–document pairs.
-
Unlike BM25 (bag-of-words), this uses deep contextual understanding of language.
-
A generative model (e.g., Gemini) is given the query + retrieved context.
-
It produces a natural-language answer, simulating an intelligent assistant.
git clone https://github.com/kaifkh20/ai-engine
cd ai-engine
pip install -r req.txt
Make sure Python 3.8.20 is installed. Use pyenv to use this version locally
python -m packages.crawler #to run the crawler run this only if its first time
# or you update the crawler_config.json
python server.py
python client.py "your query here"
As stated in req.text
-
Scale crawling for larger datasets.
-
Web Client
-
Extend client with more commands.
-
Benchmark against standard IR datasets.