
- Problem Statement
- Dataset
- Workflow
- Setup
- Testing
- Dockerization
- Deployment
- Application
- Model Training & Evaluation
- Challenges & Solutions
- Impact
- Folder Structure
- License
- In the used car market, buyers and sellers often struggle to determine a fair price for their vehicle.
- This project aims to provide accurate and transparent pricing for used cars by analyzing real-world data.
- It will assist both buyers and sellers make data-driven decisions and ensure fair transactions.
- To train the model, I collected real-world used car listings data directly from the Cars24 website.
- Since Cars24 uses dynamically loaded content, a static scraper would not capture all the data.
- Instead, I implemented an automated Selenium + BeautifulSoup Python Script.
Input : URL of a Cars24 listing page to scrape.
- Script uses
ChromeDriverManager
to install and manage the drivers without manual setup.
- Loads the given URL in a real browser session.
- Scrolls down the page in increments, with short random pauses (2-4 seconds) between scrolls.
- This ensures all dynamically loaded listings are fetched.
- Stops scrolling when the bottom of the page is reached or no new content loads.
- Once fully loaded, it retrieves the complete DOM (including dynamically injected elements).
- Returns a BeautifulSoup object containing the entire page's HTML for later parsing and data extraction.
Note
At this stage, no data is extracted, the output is just the complete HTML source.
It which will be parsed to a separate script to extract features like price, model, year, transmission, etc.
Input : BeautifulSoup object (soup
) containing the fully-rendered HTML of a Cars24 listing page.
- Looks for
<span>
elements with classsc-braxZu kjFjan
. - Extracts the text using
.text
into a list calledmodel_name
. - The code only keeps those model that start with
2
and stores them inclean_model_name
.
Important
Inspect the HTML Element : <span id class="sc-braxZu kjFjan">2016 Maruti Wagon R 1.0</span>
Tag : <span>
→ id (empty) → class : sc-braxZu kjFjan
(two classes, separated by space)
However when you hover over it in the browser, it shows : span.sc-braxZu.kjFjan
CSS uses a dot .
to indicate classes. The dot is not a part of the class name itself.
It just means "this is a class", it is not the part of the class name.
This might look confusing for someone with little HTML/CSS knowledge, so it's better to clarify it.
- Looks for
<p>
elements with classsc-braxZu kvfdZL
(each holds one specification value). - These are appended to
specs
list.
['69.95k km',
'Petrol',
'Manual',
'1st owner',
'DL-1C',
'70.72k km',
'Diesel',
'Manual',
'2nd owner',
'UP-14',
'15.96k km',
'CNG',
'Manual',
'1st owner',
'UP-16',...]
- The flat
specs
list is split into consecutive groups of 5 (clean_specs.appendspecs[i:i+5])
). - Each group corresponds to one listing's set of specification value.
[['69.95k km', 'Petrol', 'Manual', '1st owner', 'DL-1C'],
['70.72k km', 'Diesel', 'Manual', '2nd owner', 'UP-14'],
['15.96k km', 'CNG', 'Manual', '1st owner', 'UP-16'],...]
- From each 5-item group, the script extracts :
clean_specs[0]
→km_driven
clean_specs[1]
→fuel_type
clean_specs[2]
→transmission
clean_specs[3]
→owner
clean_specs[4]
→number_plate
exists but is of no use.
soup.find_all('p', 'sc-braxZu cyPhJl')
collects price elements intoprice
list.- The script then slices
price = price[2:]
, removing the first two entries (non-listing elements on the page). - So the remaining prices align with the listings.
['₹3.09 lakh',
'₹5.71 lakh',
'₹7.37 lakh',...]
soup.find_all('a', 'styles_carCardWrapper__sXLIp')
collects anchor tag for each card and extractshref
.
['https://www.cars24.com/buy-used-honda-amaze-2018-cars-noida-11068642783/',
'https://www.cars24.com/buy-used-ford-ecosport-2020-cars-noida-11234948707/',
'https://www.cars24.com/buy-used-tata-altroz-2024-cars-noida-10563348767/',...]
- All lists are assembled into a
pandas.DataFrame
. - The column names are
model_name
,km_driven
,fuel_type
,transmission
,owner
,price
,link
.
- Finally, function returns the above DataFrame for further cleaning, analysis and modelling.
Input : List of URLs for individual car listings (link
from the previous DataFrame).
- Loops over the list of individual car listing page URL.
- Uses the
requests
library to retrieve each page's HTML content. - Adds a User-Agent header to simulate a real browser and reduce blocking risk.
- Applies a random timeout (4-8 seconds) between requests to avoid overloading the server.
- Converts the response into a BeautifulSoup object using the
lxml
parser for fast, reliable parsing.
- Searches for all
<p>
tags with the classsc-braxZu jjIUAi
. - Checks if the text exactly matches "Engine capacity".
- If the label is found, grab the value from the next sibling element (
1197 cc
). - Marks the entry as successfully found.
- If no engine capacity value is found, store
None
to maintain positional consistency.
- Outputs a list of engine capacities in the same order as the input URLs.
# Parsing HTML Content of Hyderabad City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-hyderabad/')
# Extracting Car Details of Hyderabad City
hyderabad = get_car_details(soup)
# Parsing HTML Content of Bangalore City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-bangalore/')
# Extracting Car Details of Bangalore City
bangalore = get_car_details(soup)
# Parsing HTML Content of Mumbai City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-mumbai/')
# Extracting Car Details of Mumbai City
mumbai = get_car_details(soup)
# Parsing HTML Content of Delhi City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-delhi-ncr/')
# Extracting Car Details of Delhi City
delhi = get_car_details(soup)
# Concatenating Car Details of Different Cities into Single DataFrame
df = pd.concat([hyderabad, bangalore, mumbai, delhi], ignore_index=True)
df.head()
# Extracting engine capacity of each car using its car listing link from Cars24 Website
engine_capacity = get_engine_capacity(df['link'])
# Adding "engine_capacity" column in the DataFrame
df['engine_capacity'] = engine_capacity
# Final DataFrame after Web Scrapping
df.head()
The final dataset consists of 2,800+ unique car listings, with each record containing :
model_name
: Model name of the car (2014 Hyundai Grand i10, etc).fuel_type
: Type of fuel the car uses (Petrol, Diesel, CNG, Electric).transmission
: Type of transmission the car has (Automatic or Manual).owner
: Number of previous owners (1st owner, 2nd owner, 3rd owner, etc).engine_capacity
: Size of the engine (in cc).km_driven
: Total distance traveled by the car (in km).price
: Selling price of the car (target variable).
Tip
Scraping code in the repository depends on the current structure of the target website.
Websites often update their HTML, element IDs or class names which can break the scraping logic.
So before running the scraper, inspect the website to ensure the HTML structure matches the code.
Update any selectors or parsing logic if the website has changed.

Follow these steps carefully to setup and run the project on your local machine :
First, you need to clone the project from GitHub to your local system.
git clone https://github.com/TheMrityunjayPathak/AutoIQ.git
Docker allows you to package the application with all its dependencies.
docker build -t your_image_name .
Tip
Make sure Docker is installed and running on your machine before executing this command.
This project uses a .env
file to store configuration settings like model paths, allowed origins, etc.
- Stores environment variables in plain text.
# .env
ENV=environment_name
MAE=mean_absolute_error
PIPE_PATH=pipeline_path
MODEL_FREQ_PATH=model_freq_path
ALLOWED_ORIGINS=list_of_URLs_that_are_allowed_to_access_the_API
Important
Never commit .env
to GitHub / Docker.
Add .env
to .gitignore
and .dockerignore
to keep it private.
- Load and validate environment variables from
.env
. - Uses Pydantic
BaseSettings
to read environment variables, validate types and provide easy access.
# api/config.py
import os
from pathlib import Path
from typing import List
from pydantic_settings import BaseSettings
# Required Environment Variables
class Settings(BaseSettings):
ENV: str = "dev"
MAE: int
PIPE_PATH: Path
MODEL_FREQ_PATH: Path
ALLOWED_ORIGINS: str # Comma-separated
# Convert ALLOWED_ORIGINS string into a list
@property
def cors_origins(self) -> List[str]:
return [origin.strip() for origin in self.ALLOWED_ORIGINS.split(",")]
# Load .env locally (development), but skips in Render (deployment)
class Config:
env_file = ".env" if not os.getenv("RENDER") else None
# Create an object of Settings class
settings = Settings()
- Uses
settings
fromconfig.py
in FastAPI. - Imports the
settings
object to provide API's metadata dynamically from.env
.
# api/main.py
import pickle
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from api.config import settings
app = FastAPI(title="AutoIQ by Motor.co")
app.add_middleware(
CORSMiddleware,
allow_origins=settings.cors_origins,
allow_credentials=True,
allow_methods=["GET", "POST"],
allow_headers=["*"],
)
with open(settings.PIPE_PATH, "rb") as f:
pipe = pickle.load(f)
with open(settings.MODEL_FREQ_PATH, "rb") as f:
model_freq = pickle.load(f)
Start the application using Docker. This will run the FastAPI server and handle all the dependencies automatically.
docker run --env-file .env -p 8000:8000 your_image_name /
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
Note
api.main
→ Refers to the main.py file inside the api folder.
app
→ The FastAPI instance defined in your code.
--reload
→ Automatically reloads when code changes (development only).
Once the container is running, open your browser and navigate to :
http://localhost:8000/docs
or
http://127.0.0.1:8000/docs
This opens the Swagger UI for testing the API endpoints.
Access the live API here or Click on the Image below.
When you're done using the application, stop the running container.
docker stop your_image_name
Once the FastAPI server is running, you can test the API endpoints in Postman or any similar software.
- Launch the Postman application on your computer.
- Click on the "New" button, then select "HTTP" requests.

- Retrieve information from the server without modifying any data.
- Open Postman and create a new request.
- Set the HTTP method to "GET" from the dropdown menu.
- Enter the endpoint URL you want to query.
http://127.0.0.1:8000
- Click the "Send" button to submit the request.
- Status Code : It indicates that the request was successful and the server responded with the requested data.
200 OK
- Response Body (JSON) : This confirms that the API is running and returns the result of your API call.
{
"message":"Pipeline is live"
}

- Send data to a server to create/update a resource.
- Open Postman and create a new request.
- Set the HTTP method to "POST" from the dropdown menu.
- Enter the endpoint URL you want to query.
http://127.0.0.1:8000/predict
- Navigate to the "Headers" tab and add the following : Key →
Content-Type
, Value →application/json
- Go to the "Body" tab, Select "raw", then choose "JSON" from the format dropdown menu.
- Enter the request payload in JSON format.
{
"brand": "MG",
"model": "HECTOR",
"km_driven": 80000,
"engine_capacity": 1498,
"fuel_type": "Petrol",
"transmission": "Manual",
"year": 2022,
"owner": "1st owner"
}
- Click the "Send" button to submit the request.
- Status Code : It indicates that the server successfully processed the request and generated a prediction.
200 OK
- Response Body (JSON) : This confirms that the API is running and returns the result of your API call.
{
"output": "₹9,69,000 to ₹11,50,000"
}

Follow these steps carefully to containerize your project with Docker :
- Before starting, make sure Docker is installed on your system.
- Visit Docker → Click on Download Docker Desktop → Choose Windows / Mac / Linux

- Open Docker Desktop → Make sure Docker Engine is Running

- Create a
Dockerfile
and place it in the root folder of your Repository.
# Start with the official Python 3.11 image.
# -slim means this is a smaller Debian-based image with fewer preinstalled packages, which makes it lighter.
FROM python:3.11-slim
# Install required system packages for Python libraries.
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
python3-dev \
libopenblas-dev \
liblapack-dev \
gfortran \
&& rm -rf /var/lib/apt/lists/*
# Set the working directory to /app inside the container.
# All future commands (COPY, RUN, CMD) will be executed from here.
WORKDIR /app
# Copies your local requirements.txt into the container's /app folder.
COPY requirements.txt .
# Install all the dependencies from requirements.txt.
# --no-cache-dir prevents pip from keeping installation caches, making the image smaller.
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copies all the remaining project files (Flask API, HTML, CSS, JS, etc.) into /app.
COPY . .
# Expose FastAPI port, so it can be accessed from outside the container.
EXPOSE 8000
# Default command to run the FastAPI app with Uvicorn in production mode.
# --host 0.0.0.0 allows external connections (necessary in Docker).
# --port 8000 specifies the port.
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
- This file tells Docker which files and folders to exclude from the image.
- This keeps the image small and prevents unnecessary files from being copied.
- A
.dockerignore
file is used to exclude all files and folders that are not required to run your application.
# Virtual Environment
.venv/
# Jupyter Notebooks
*.ipynb
# Jupyter Notebook Checkpoints
.ipynb_checkpoints/
# Python Cache
__pycache__/
*.pyc
*.pyo
*.pyd
# Environment File
.env
*.env
# Dataset (Parquet & CSV Files)
*.parquet
*.csv
# Python Package (utils)
utils/
# Local/Temporary Files
*.log
*.tmp
*.bak
# Version Control Files
.git/
.gitignore
# IDE/Editor Configs
.vscode/
.idea/
.DS_Store
# Python Package Build Artifacts
*.egg-info/
build/
dist/
- A Docker image is essentially a read-only template that contains everything needed to run an application.
- You can think of a Docker image as a blueprint or snapshot of an environment. It doesn't run anything.
docker build -t your_image_name .
- When you run a Docker image, it becomes a Docker container.
- It is a live instance of that image, running your application in an isolated environment.
docker run --env-file .env -p 8000:8000 your_image_name /
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
docker run --env-file .env -p 8000:8000 your_image_name
- After the container starts, you can access your API.
http://localhost:8000
or
http://127.0.0.1:8000
- Once your Docker image is ready, you can push it to Docker Hub.
- It allows anyone to pull and run it without building it themselves.
Access the Docker Hub here or Click on the Image below.
- Prompts you to enter your Docker Hub username and password.
- This authenticates your local Docker client with your Docker Hub account.
docker login
- Tagging prepares the image for upload to Docker Hub.
docker tag your_image_name your-dockerhub-username/your_image_name:latest
- Uploads your image to your Docker Hub Repository.
- Once pushed, your image is publicly available.
- Anyone can now pull and run the image without building it locally.
docker push your-dockerhub-username/your_image_name:latest
- Once pushed, anyone can pull your image from Docker Hub and run it.
- This ensures that the application behaves the same way across all systems.
docker pull your-dockerhub-username/your_image_name:latest
- After pulling the Docker image, you can run it to create Docker container from it.
docker run --env-file .env -p 8000:8000 your-dockerhub-username/your_image_name:latest
- Lists all the running containers with
container_id
.
docker ps
- Stops the running container safely.
container_id
can be obtained fromdocker ps
output.
docker stop container_id
Follow these steps carefully to deploy your FastAPI application on Render :

- Link your GitHub Repository / Existing Docker Image

- Add details about your API

- Add Environment Variables in Render Dashboard (same as
.env
)

- Deploy the Web Service
The frontend application files are in the project root :
index.html
→ This file defines the structure and layout of the web page.style.css
→ This file handles the visual appearance of the web page.script.js
→ This file communicates between the web page and the API.
You can open index.html
directly in your browser or serve it via a local HTTP server (like VS Code Live Server).
Note
Remember to update the API URL in script.js
when deploying on GitHub Pages to get real-time predictions.
Change from :
const fetchPromise = fetch("http://127.0.0.1:8000/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(data),
});
To :
const fetchPromise = fetch("https://your_api_name.onrender.com/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(data),
});
Access the live Website here or Click on the Image below.
Important
The API for this project is deployed using the free tier on Render.
As a result, it may go to sleep after periods of inactivity.
Please start the API first by visiting the API URL. Then, navigate to the website to make predictions.
If the API was inactive, the first prediction may take a few seconds while the server spins back up.
# Importing load_parquet function from read_data module
from read_data import load_parquet
cars = load_parquet('clean_data', 'clean_data_after_eda.parquet')
cars.head()
# Creating Features and Target Variable
X = cars.drop('price', axis=1)
y = cars['price']
# Splitting Data into Training and Testing Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Pipeline for Nominal Column
nominal_cols = ['fuel_type','transmission','brand']
nominal_trf = Pipeline(steps=[
('ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
# Pipeline for Ordinal Column
ordinal_cols = ['owner']
ordinal_categories = [['Others','3rd owner','2nd owner','1st owner']]
ordinal_trf = Pipeline(steps=[
('oe', OrdinalEncoder(categories=ordinal_categories))
])
# Pipeline for Numerical Column
numerical_cols = ['km_driven','year','engine_capacity']
numerical_trf = Pipeline(steps=[
('scaler', RobustScaler())
])
# Adding Everything into ColumnTransformer
ctf = ColumnTransformer(transformers=[
('nominal', nominal_trf, nominal_cols),
('ordinal', ordinal_trf, ordinal_cols),
('scaling', numerical_trf, numerical_cols)
], remainder='passthrough', n_jobs=-1)
# Models Dictionary
models = {
'LR' : LinearRegression(n_jobs=-1),
'KNN' : KNeighborsRegressor(n_jobs=-1),
'DT' : DecisionTreeRegressor(random_state=42),
'RF' : RandomForestRegressor(random_state=42, n_jobs=-1),
'GB' : GradientBoostingRegressor(random_state=42),
'XGB' : XGBRegressor(random_state=42, n_jobs=-1)
}
# Computing Average Error and R2-Score through Cross-Validation
results = {}
for name, model in models.items():
pipe = Pipeline(steps=[
('preprocessor', ctf),
('model', model)
])
k = KFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1, return_train_score=False)
results[name] = {'avg_error': -cv_results['test_mae'].mean(),'avg_score': cv_results['test_r2'].mean()}
print()
print(f'Model : {name}')
print('-'*40)
print(f'Average Error : {-cv_results['test_mae'].mean():.2f}')
print(f'Standard Deviation of Error : {cv_results['test_mae'].std():.2f}')
print(f'Average R2-Score : {cv_results['test_r2'].mean():.2f}')
print(f'Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}')
Model : LR
----------------------------------------
Average Error : 123190.02
Standard Deviation of Error : 6445.18
Average R2-Score : 0.77
Standard Deviation of R2-Score : 0.01
Model : KNN
----------------------------------------
Average Error : 115572.16
Standard Deviation of Error : 3883.19
Average R2-Score : 0.79
Standard Deviation of R2-Score : 0.00
Model : DT
----------------------------------------
Average Error : 118466.64
Standard Deviation of Error : 4490.62
Average R2-Score : 0.76
Standard Deviation of R2-Score : 0.03
Model : RF
----------------------------------------
Average Error : 90811.20
Standard Deviation of Error : 2335.09
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.01
Model : GB
----------------------------------------
Average Error : 98056.52
Standard Deviation of Error : 3001.29
Average R2-Score : 0.85
Standard Deviation of R2-Score : 0.01
Model : XGB
----------------------------------------
Average Error : 91595.94
Standard Deviation of Error : 2640.02
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.02
# Plotting Metric Comparision Graph
results_df = pd.DataFrame(results)
fig, ax = plt.subplots(ncols=1, nrows=2, figsize=(12,8))
sns.barplot(x=results_df.iloc[0,:].sort_values().index.to_list(), y=results_df.iloc[0,:].sort_values().values, ax=ax[0])
ax[0].set_title('Average Error Comparision (Lower is Better)')
ax[0].set_ylabel('Error')
for container in ax[0].containers:
ax[0].bar_label(container, fmt='%.0f')
sns.barplot(x=results_df.iloc[1,:].sort_values().index.to_list(), y=results_df.iloc[1,:].sort_values().values, ax=ax[1])
ax[1].set_title('Average R2-Score Comparision (Higher is Better)')
ax[1].set_ylabel('R2-Score')
for container in ax[1].containers:
ax[1].bar_label(container, fmt='%.2f')
plt.tight_layout()
plt.show()

# Assigning Base Model for StackingRegressor
base_model = [('rf', rf),('xgb', xgb),('gb', gb)]
# Structure of StackingRegressor
stack = StackingRegressor(
estimators=base_model,
final_estimator=meta_model,
passthrough=False,
cv=k, n_jobs=-1
)
# Final Pipeline with StackingRegressor
pipe = Pipeline(steps=[
('preprocessor', ctf),
('model', stack)
])
# Average Error and R2-Score through Cross-Validation
cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1)
print(f"Average Error : {-cv_results['test_mae'].mean():.2f}")
print(f"Standard Deviatacion of Error : {cv_results['test_mae'].std():.2f}")
print(f"Average R2-Score : {cv_results['test_r2'].mean():.2f}")
print(f"Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}")
Average Error : 87885.34
Standard Deviatacion of Error : 1279.54
Average R2-Score : 0.87
Standard Deviation of R2-Score : 0.01

![]() |
![]() |
---|
R2-Score Curve | Error Curve |
---|---|
![]() |
![]() |
# Parameter Distribution
param_dist = {
'model__rf__n_estimators': [200, 300],
'model__rf__max_depth': [10, 20],
'model__rf__min_samples_leaf': [3, 5],
'model__rf__min_samples_split': [5, 7],
'model__xgb__n_estimators': [200, 300],
'model__xgb__learning_rate': [0.05, 0.1],
'model__xgb__max_depth': [2, 4],
'model__xgb__subsample': [0.5, 0.75],
'model__xgb__colsample_bytree': [0.5, 0.75],
'model__gb__n_estimators': [100, 200],
'model__gb__learning_rate': [0.05, 0.1],
'model__gb__max_depth': [2, 4],
'model__gb__subsample': [0.5, 0.75],
'model__final_estimator__alpha': [0.1, 10.0],
'model__final_estimator__l1_ratio': [0.0, 1.0]
}
# RandomizedSearch Object with Cross-Validation
rcv = RandomizedSearchCV(estimator=pipe, param_distributions=param_dist, cv=k, scoring='neg_mean_absolute_error', n_iter=30, n_jobs=-1, random_state=42)
# Fitting the RandomizedSearch Object
rcv.fit(X_train, y_train)
# Best Estimator
best_model = rcv.best_estimator_
Before Tuning | After Tuning |
---|---|
![]() |
![]() |
R2-Score Curve (Before Tuning) | R2-Score Curve (After Tuning) |
---|---|
![]() |
![]() |
Error Curve (Before Tuning) | Error Curve (After Tuning) |
---|---|
![]() |
![]() |
- I wanted to use real-world data instead of pre-cleaned toy dataset, as it represents messy, real-life scenarios.
- However, Cars24 loads its content dynamically using JS, meaning a simple HTTP request won't be enough.
- I used Selenium to simulate a real browser, ensuring that the page was fully loaded before scraping.
- Once the content was fully rendered, I used BeautifulSoup to efficiently parse the HTML.
- This approach allowed me to reliably capture all the details.
- The raw scraped dataset was large, taking up unnecessary space.
- Loading it repeatedly during experimentation became inefficient.
- I optimized memory consumption by downcasting data types to reduce memory usage.
- I also stored the dataset in Parquet format, which compresses data without losing information.
- It allows for much faster read/write speeds as compared to CSV.
- If preprocessing is applied to the entire dataset, test data can leak into the training process.
- This creates overly optimistic results and reduces the model's ability to generalize.
- I implemented Scikit-learn Pipeline and ColumnTransformer to apply preprocessing only on training data.
- This keeps the test data completely unseen during preprocessing, preventing data leakage.
- Even after building the machine learning pipeline, it remained offline and could only be used locally.
- There is no way to provide inputs and get predictions over the web or from other applications.
- The model is dependent on the local system and could not serve predictions to external users or services.
- I deployed the machine learning model as an API using FastAPI.
- This allows users and applications to send inputs and receive predictions online in real time.
- I added a
/predict
endpoint for serving predictions and a/health
endpoint to monitor API status. - I also implemented rate limiting and input validation to prevent misuse and ensure stability under load.
- These measures made the model accessible, reliable and ready for production use.
- Even if the API works perfectly fine, non-technical users may still find it difficult to test and use.
- This limits accessibility.
- I created a HTML/CSS/JS frontend that sends requests to the API and displays predictions instantly.
- I also included a example payload in Swagger UI, so that users can test it without any extra effort.
- Installing dependencies and setting up the environment manually is time-consuming and error-prone.
- Especially when running on different machines with different operating system.
- This also made sharing the project with others more difficult, as they would have to replicate the exact setup.
- I created a multi-stage Dockerfile.
- It builds the FastAPI application, installs dependencies and copies only the required files into the final image.
- I used a
.dockerignore
file to exclude unnecessary files to keep the image small and deployment fast. - This allowed me to run the project consistently on any system with Docker installed.
- It eliminates the hassle of worrying about dependency mismatches or operating system specific issues.
- Same Docker image can be used to deploy on Render, Docker Hub or run locally with a single docker command.
- Built and deployed a complete machine learning pipeline as a FastAPI service.
- Enabled real-time used car price prediction from a dataset of 2,800+ cars.
- Reduced dataset memory usage by 90% through data type optimization.
- Converted dataset to Parquet format, significantly improving preprocessing speed.
- Used cross-validation to evaluate multiple regression models.
- Ensured only top performers made it to production.
- Delivered 30% lower MAE and 12% higher R2-Score compared to the baseline model.
- Achieved these gains by implementing stacking ensemble and hyperparameter tuning.
- Improved model stability by 70%, ensuring more consistent and reliable output in production.
.
├── api/ # FastAPI Code for making Predictions
│ ├── main.py
│ └── config.py
│
├── clean_data/ # Cleaned Dataset (Parquet Format)
│ └── clean_data.parquet
| └── ...
│
├── images/ # Images for Frontend Interface
│ ├── favicon.png
│ └── hero_image.png
│
├── models/ # Serialized Components for Prediction
│ ├── pipe.pkl
│ └── model_freq.pkl
│
├── notebooks/ # Jupyter Notebook for Project Developement
│ └── data_cleaning.ipynb
| └── ...
│
├── scrape_code/ # Web Scraping Notebook
│ └── scrape_code.ipynb
│
├── scrape_data/ # Scraped Dataset (CSV Format)
│ └── scrape_data.csv
│
├── utils/ # Reusable Python Functions (utils Package)
│ ├── __init__.py
│ ├── web_scraping.py
│ └── helpers.py
| └── ...
│
├── .dockerignore # All files and folders ignored by Docker while building Docker Image
├── .gitignore # All files and folders ignored by Git while pushing code to GitHub
├── Dockerfile # Instructions for building the Docker Image
├── index.html # Frontend HTML File
├── style.css # Frontend CSS File
├── script.js # Frontend JS File
├── requirements.txt # List of required libraries for the Project
├── LICENSE # License specifying permissions and usage rights
└── README.md # Detailed documentation of the Project
This project is licensed under the MIT License. You are free to use and modify the code as needed.