A very basic search engine from scratch in Python to learn the fundamentals of how systems like Google Search work — including crawling, indexing, and keyword-based ranking using TF-IDF.
I started building this tiny project because I was wondering how Safari always used Google Search as its default search engine, and how any search engine works. To understand it better, I decided to build this project to enhance my understanding of a search engine.
- Crawl webpages (using
requests
+BeautifulSoup
) - Extract and clean up main text content
- Index documents using
TfidfVectorizer
- Search and rank results by relevance (TF-IDF)
- Simple CLI (command-line interface) for querying
crawler.py
: This file handles downloading web pages (through HTTP GET request) and extracting readable text from HTML. In a real search engine, crawling is the first step: take Google as example, the crawlers (e.g. Googlebot) are used to visit billions of pages to gather content.
indexer.py
: This file builds the index: the core data structure that enables fast and relevant searches. We use TF-IDF (Term Frequency-Inverse Document Frequency) to rank the results on how relevant it is to the search term. We do this because search engines don't scan raw text at query time as that's too slow. Instead, we pre-process documents into an index so we can quickly match queries to relevant content.
main.py
: This is the script that controls the entire workflow of the search engine as well as provide the CLI for querying. It contains a small dictionary of URLs (this is our "web") and uses the crawler to get the content and add them to the index.
-
Clone this repository
git clone https://github.com/JasonSu14/Mini_Search_Engine.git cd Mini_Search_Engine
-
Create a virtual environment
python3 -m venv venv
-
Activate the virtual environment
source venv/bin/activate
-
Install required dependencies
pip install -r requirements.txt
-
Deactivate when you're done
deactivate
-
Run the code
python3 main.py
-
Then type in a search query like:
Enter your search query (or 'exit'): python language
-
See the results
Now it will return a list of the top matching pages and their relevance scores.
-
Every time you want to work on the project:
source venv/bin/activate
-
When you're done:
deactivate
- Jason Su ([email protected]) - Code & Implementation