Mini Search Engine

A very basic search engine from scratch in Python to learn the fundamentals of how systems like Google Search work — including crawling, indexing, and keyword-based ranking using TF-IDF.

I started building this tiny project because I was wondering how Safari always used Google Search as its default search engine, and how any search engine works. To understand it better, I decided to build this project to enhance my understanding of a search engine.

Features

Crawl webpages (using requests + BeautifulSoup)
Extract and clean up main text content
Index documents using TfidfVectorizer
Search and rank results by relevance (TF-IDF)
Simple CLI (command-line interface) for querying

What Does Each File Do?

crawler.py: This file handles downloading web pages (through HTTP GET request) and extracting readable text from HTML. In a real search engine, crawling is the first step: take Google as example, the crawlers (e.g. Googlebot) are used to visit billions of pages to gather content.

indexer.py: This file builds the index: the core data structure that enables fast and relevant searches. We use TF-IDF (Term Frequency-Inverse Document Frequency) to rank the results on how relevant it is to the search term. We do this because search engines don't scan raw text at query time as that's too slow. Instead, we pre-process documents into an index so we can quickly match queries to relevant content.

main.py: This is the script that controls the entire workflow of the search engine as well as provide the CLI for querying. It contains a small dictionary of URLs (this is our "web") and uses the crawler to get the content and add them to the index.

First-time Setup Guide (Linux / MacOS)

Clone this repository

git clone https://github.com/JasonSu14/Mini_Search_Engine.git
cd Mini_Search_Engine

Create a virtual environment
```
python3 -m venv venv
```
Activate the virtual environment
```
source venv/bin/activate
```
Install required dependencies
```
pip install -r requirements.txt
```
Deactivate when you're done
```
deactivate
```

How to Run?

Run the code
```
python3 main.py
```
Then type in a search query like:

Enter your search query (or 'exit'): python language
See the results

Now it will return a list of the top matching pages and their relevance scores.

For Developers

Every time you want to work on the project: source venv/bin/activate
When you're done: deactivate

Contributors

Jason Su ([email protected]) - Code & Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
indexer.py		indexer.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini Search Engine

Features

What Does Each File Do?

First-time Setup Guide (Linux / MacOS)

How to Run?

For Developers

Contributors

About

Uh oh!

Releases

Packages

Languages

JasonSu14/Mini_Search_Engine

Folders and files

Latest commit

History

Repository files navigation

Mini Search Engine

Features

What Does Each File Do?

First-time Setup Guide (Linux / MacOS)

How to Run?

For Developers

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages