GitHub - TheMrityunjayPathak/AutoIQ: AutoIQ by Motor.co

Problem Statement

In the used car market, buyers and sellers often struggle to determine a fair price for their vehicle.
This project aims to provide accurate and transparent pricing for used cars by analyzing real-world data.
It will assist both buyers and sellers make data-driven decisions and ensure fair transactions.

Dataset

To train the model, I collected real-world used car listings data directly from the Cars24 website.
Since Cars24 uses dynamically loaded content, a static scraper would not capture all the data.
Instead, I implemented an automated Selenium + BeautifulSoup Python Script.

Web Scraping Script (`scrape_car_listing`)

Input : URL of a Cars24 listing page to scrape.

1. Launch Chrome Automatically

Script uses ChromeDriverManager to install and manage the drivers without manual setup.

2. Open Cars24 Website

Loads the given URL in a real browser session.

3. Simulate Scrolling

Scrolls down the page in increments, with short random pauses (2-4 seconds) between scrolls.
This ensures all dynamically loaded listings are fetched.

4. Check for End of Page

Stops scrolling when the bottom of the page is reached or no new content loads.

5. Capture Rendered HTML

Once fully loaded, it retrieves the complete DOM (including dynamically injected elements).

6. Parse HTML with BeautifulSoup

Returns a BeautifulSoup object containing the entire page's HTML for later parsing and data extraction.

Note

At this stage, no data is extracted, the output is just the complete HTML source.

It which will be parsed to a separate script to extract features like price, model, year, transmission, etc.

Data Extraction Script (`get_car_details`)

Input : BeautifulSoup object (soup) containing the fully-rendered HTML of a Cars24 listing page.

1. Find Raw Model Name Texts

Looks for  elements with class sc-braxZu kjFjan.
Extracts the text using .text into a list called model_name.
The code only keeps those model that start with 2 and stores them in clean_model_name.

Click to view the HTML Element Snapshot

Important

Inspect the HTML Element : 2016 Maruti Wagon R 1.0

Tag :  → id (empty) → class : sc-braxZu kjFjan (two classes, separated by space)

However when you hover over it in the browser, it shows : span.sc-braxZu.kjFjan

CSS uses a dot . to indicate classes. The dot is not a part of the class name itself.

It just means "this is a class", it is not the part of the class name.

This might look confusing for someone with little HTML/CSS knowledge, so it's better to clarify it.

2. Collect Specification Text Blocks

Looks for  elements with class sc-braxZu kvfdZL (each holds one specification value).
These are appended to specs list.

['69.95k km',
 'Petrol',
 'Manual',
 '1st owner',
 'DL-1C',
 '70.72k km',
 'Diesel',
 'Manual',
 '2nd owner',
 'UP-14',
 '15.96k km',
 'CNG',
 'Manual',
 '1st owner',
 'UP-16',...]

Click to view the HTML Element Snapshot

3. Group Specifications

The flat specs list is split into consecutive groups of 5 (clean_specs.appendspecs[i:i+5])).
Each group corresponds to one listing's set of specification value.

[['69.95k km', 'Petrol', 'Manual', '1st owner', 'DL-1C'],
 ['70.72k km', 'Diesel', 'Manual', '2nd owner', 'UP-14'],
 ['15.96k km', 'CNG', 'Manual', '1st owner', 'UP-16'],...]

4. Map Groups into Fields

From each 5-item group, the script extracts :
- clean_specs[0] → km_driven
- clean_specs[1] → fuel_type
- clean_specs[2] → transmission
- clean_specs[3] → owner
clean_specs[4] → number_plate exists but is of no use.

5. Extract Price Values

soup.find_all('p', 'sc-braxZu cyPhJl') collects price elements into price list.
The script then slices price = price[2:], removing the first two entries (non-listing elements on the page).
So the remaining prices align with the listings.

['₹3.09 lakh',
 '₹5.71 lakh',
 '₹7.37 lakh',...]

Click to view the HTML Element Snapshot

6. Extract Listing Links

soup.find_all('a', 'styles_carCardWrapper__sXLIp') collects anchor tag for each card and extracts href.

['https://www.cars24.com/buy-used-honda-amaze-2018-cars-noida-11068642783/',
 'https://www.cars24.com/buy-used-ford-ecosport-2020-cars-noida-11234948707/',
 'https://www.cars24.com/buy-used-tata-altroz-2024-cars-noida-10563348767/',...]

Click to view the HTML Element Snapshot

7. Combine into a DataFrame

All lists are assembled into a pandas.DataFrame.
The column names are model_name, km_driven, fuel_type, transmission, owner, price, link.

8. Return the DataFrame

Finally, function returns the above DataFrame for further cleaning, analysis and modelling.

Engine Capacity Script (`get_engine_capacity`)

Input : List of URLs for individual car listings (link from the previous DataFrame).

1. Iterate through each Car Listing URL

Loops over the list of individual car listing page URL.

2. Send an HTTP Request

Uses the requests library to retrieve each page's HTML content.
Adds a User-Agent header to simulate a real browser and reduce blocking risk.
Applies a random timeout (4-8 seconds) between requests to avoid overloading the server.

3. Parse the HTML Content

Converts the response into a BeautifulSoup object using the lxml parser for fast, reliable parsing.

4. Locate Engine Capacity Label

Searches for all  tags with the class sc-braxZu jjIUAi.
Checks if the text exactly matches "Engine capacity".

Click to view the HTML Element Snapshot

5. Extract the Value

If the label is found, grab the value from the next sibling element (1197 cc).
Marks the entry as successfully found.
If no engine capacity value is found, store None to maintain positional consistency.

6. Return the List

Outputs a list of engine capacities in the same order as the input URLs.

Combine Data from Multiple Cities

# Parsing HTML Content of Hyderabad City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-hyderabad/')

# Extracting Car Details of Hyderabad City
hyderabad = get_car_details(soup)

# Parsing HTML Content of Bangalore City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-bangalore/')

# Extracting Car Details of Bangalore City
bangalore = get_car_details(soup)

# Parsing HTML Content of Mumbai City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-mumbai/')

# Extracting Car Details of Mumbai City
mumbai = get_car_details(soup)

# Parsing HTML Content of Delhi City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-delhi-ncr/')

# Extracting Car Details of Delhi City
delhi = get_car_details(soup)

# Concatenating Car Details of Different Cities into Single DataFrame
df = pd.concat([hyderabad, bangalore, mumbai, delhi], ignore_index=True)
df.head()

# Extracting engine capacity of each car using its car listing link from Cars24 Website
engine_capacity = get_engine_capacity(df['link'])

# Adding "engine_capacity" column in the DataFrame
df['engine_capacity'] = engine_capacity

# Final DataFrame after Web Scrapping
df.head()

Dataset Description

The final dataset consists of 2,800+ unique car listings, with each record containing :

model_name : Model name of the car (2014 Hyundai Grand i10, etc).
fuel_type : Type of fuel the car uses (Petrol, Diesel, CNG, Electric).
transmission : Type of transmission the car has (Automatic or Manual).
owner : Number of previous owners (1st owner, 2nd owner, 3rd owner, etc).
engine_capacity : Size of the engine (in cc).
km_driven : Total distance traveled by the car (in km).
price : Selling price of the car (target variable).

Tip

Scraping code in the repository depends on the current structure of the target website.

Websites often update their HTML, element IDs or class names which can break the scraping logic.

So before running the scraper, inspect the website to ensure the HTML structure matches the code.

Update any selectors or parsing logic if the website has changed.

Workflow

Setup

Follow these steps carefully to setup and run the project on your local machine :

1. Clone the Repository

First, you need to clone the project from GitHub to your local system.

git clone https://github.com/TheMrityunjayPathak/AutoIQ.git

2. Build the Docker Image

Docker allows you to package the application with all its dependencies.

docker build -t your_image_name .

Tip

Make sure Docker is installed and running on your machine before executing this command.

3. Setup Environment Variables

This project uses a .env file to store configuration settings like model paths, allowed origins, etc.

`.env` file

Stores environment variables in plain text.

# .env
ENV=environment_name
MAE=mean_absolute_error
PIPE_PATH=pipeline_path
MODEL_FREQ_PATH=model_freq_path
ALLOWED_ORIGINS=list_of_URLs_that_are_allowed_to_access_the_API

Important

Never commit .env to GitHub / Docker.

Add .env to .gitignore and .dockerignore to keep it private.

`config.py` file

Load and validate environment variables from .env.
Uses Pydantic BaseSettings to read environment variables, validate types and provide easy access.

# api/config.py
import os
from pathlib import Path
from typing import List
from pydantic_settings import BaseSettings

# Required Environment Variables
class Settings(BaseSettings):
    ENV: str = "dev"
    MAE: int
    PIPE_PATH: Path
    MODEL_FREQ_PATH: Path
    ALLOWED_ORIGINS: str  # Comma-separated

    # Convert ALLOWED_ORIGINS string into a list
    @property
    def cors_origins(self) -> List[str]:
        return [origin.strip() for origin in self.ALLOWED_ORIGINS.split(",")]

     # Load .env locally (development), but skips in Render (deployment)
    class Config:
        env_file = ".env" if not os.getenv("RENDER") else None

# Create an object of Settings class
settings = Settings()

`main.py` file

Uses settings from config.py in FastAPI.
Imports the settings object to provide API's metadata dynamically from .env.

# api/main.py
import pickle
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from api.config import settings

app = FastAPI(title="AutoIQ by Motor.co")

app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.cors_origins,
    allow_credentials=True,
    allow_methods=["GET", "POST"],
    allow_headers=["*"],
)

with open(settings.PIPE_PATH, "rb") as f:
    pipe = pickle.load(f)

with open(settings.MODEL_FREQ_PATH, "rb") as f:
    model_freq = pickle.load(f)

4. Run the Docker Container

Start the application using Docker. This will run the FastAPI server and handle all the dependencies automatically.

docker run --env-file .env -p 8000:8000 your_image_name /
   uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

Note

api.main → Refers to the main.py file inside the api folder.

app → The FastAPI instance defined in your code.

--reload → Automatically reloads when code changes (development only).

5. Access the FastAPI Server

Once the container is running, open your browser and navigate to :

http://localhost:8000/docs

or

http://127.0.0.1:8000/docs

This opens the Swagger UI for testing the API endpoints.

Access the live API here or Click on the Image below.

6. Stop the Docker Container

When you're done using the application, stop the running container.

docker stop your_image_name

Testing

Once the FastAPI server is running, you can test the API endpoints in Postman or any similar software.

1. Launch Postman and Create a New HTTP Request

Launch the Postman application on your computer.
Click on the "New" button, then select "HTTP" requests.

2. Using GET and POST Methods in Postman

→ GET Method

Retrieve information from the server without modifying any data.

→ Steps

Open Postman and create a new request.
Set the HTTP method to "GET" from the dropdown menu.
Enter the endpoint URL you want to query.

http://127.0.0.1:8000

Click the "Send" button to submit the request.

→ Expected Response

Status Code : It indicates that the request was successful and the server responded with the requested data.

200 OK

Response Body (JSON) : This confirms that the API is running and returns the result of your API call.

{
    "message":"Pipeline is live"
}

→ POST Method

Send data to a server to create/update a resource.

→ Steps

Open Postman and create a new request.
Set the HTTP method to "POST" from the dropdown menu.
Enter the endpoint URL you want to query.

http://127.0.0.1:8000/predict

Navigate to the "Headers" tab and add the following : Key → Content-Type, Value → application/json
Go to the "Body" tab, Select "raw", then choose "JSON" from the format dropdown menu.
Enter the request payload in JSON format.

{
  "brand": "MG",
  "model": "HECTOR",
  "km_driven": 80000,
  "engine_capacity": 1498,
  "fuel_type": "Petrol",
  "transmission": "Manual",
  "year": 2022,
  "owner": "1st owner"
}

Click the "Send" button to submit the request.

→ Expected Response

Status Code : It indicates that the server successfully processed the request and generated a prediction.

200 OK

Response Body (JSON) : This confirms that the API is running and returns the result of your API call.

{
  "output": "₹9,69,000 to ₹11,50,000"
}

Dockerization

Follow these steps carefully to containerize your project with Docker :

1. Install Docker

Before starting, make sure Docker is installed on your system.
Visit Docker → Click on Download Docker Desktop → Choose Windows / Mac / Linux

2. Verify the Installation

Open Docker Desktop → Make sure Docker Engine is Running

3. Create the Dockerfile

Create a Dockerfile and place it in the root folder of your Repository.

# Start with the official Python 3.11 image.
# -slim means this is a smaller Debian-based image with fewer preinstalled packages, which makes it lighter.
FROM python:3.11-slim

# Install required system packages for Python libraries.
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    python3-dev \
    libopenblas-dev \
    liblapack-dev \
    gfortran \
 && rm -rf /var/lib/apt/lists/*

# Set the working directory to /app inside the container.
# All future commands (COPY, RUN, CMD) will be executed from here.
WORKDIR /app

# Copies your local requirements.txt into the container's /app folder.
COPY requirements.txt .

# Install all the dependencies from requirements.txt.
# --no-cache-dir prevents pip from keeping installation caches, making the image smaller.
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copies all the remaining project files (Flask API, HTML, CSS, JS, etc.) into /app.
COPY . .

# Expose FastAPI port, so it can be accessed from outside the container.
EXPOSE 8000

# Default command to run the FastAPI app with Uvicorn in production mode.
# --host 0.0.0.0 allows external connections (necessary in Docker).
# --port 8000 specifies the port.
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]

4. Create the `.dockerignore` File

This file tells Docker which files and folders to exclude from the image.
This keeps the image small and prevents unnecessary files from being copied.
A .dockerignore file is used to exclude all files and folders that are not required to run your application.

# Virtual Environment
.venv/

# Jupyter Notebooks
*.ipynb

# Jupyter Notebook Checkpoints
.ipynb_checkpoints/

# Python Cache
__pycache__/
*.pyc
*.pyo
*.pyd

# Environment File
.env
*.env

# Dataset (Parquet & CSV Files)
*.parquet
*.csv

# Python Package (utils)
utils/

# Local/Temporary Files
*.log
*.tmp
*.bak

# Version Control Files
.git/
.gitignore

# IDE/Editor Configs
.vscode/
.idea/
.DS_Store

# Python Package Build Artifacts
*.egg-info/
build/
dist/

5. Build the Docker Image

A Docker image is essentially a read-only template that contains everything needed to run an application.
You can think of a Docker image as a blueprint or snapshot of an environment. It doesn't run anything.

docker build -t your_image_name .

6. Create the Docker Container

When you run a Docker image, it becomes a Docker container.
It is a live instance of that image, running your application in an isolated environment.

For Development

docker run --env-file .env -p 8000:8000 your_image_name /
    uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

For Production

docker run --env-file .env -p 8000:8000 your_image_name

After the container starts, you can access your API.

http://localhost:8000

or

http://127.0.0.1:8000

7. Push to Docker Hub

Once your Docker image is ready, you can push it to Docker Hub.
It allows anyone to pull and run it without building it themselves.

Access the Docker Hub here or Click on the Image below.

Login to Docker Hub

Prompts you to enter your Docker Hub username and password.
This authenticates your local Docker client with your Docker Hub account.

docker login

Tag the Docker Image

Tagging prepares the image for upload to Docker Hub.

docker tag your_image_name your-dockerhub-username/your_image_name:latest

Push the Image to Docker Hub

Uploads your image to your Docker Hub Repository.
Once pushed, your image is publicly available.
Anyone can now pull and run the image without building it locally.

docker push your-dockerhub-username/your_image_name:latest

8. Pull and Run Anywhere

Once pushed, anyone can pull your image from Docker Hub and run it.
This ensures that the application behaves the same way across all systems.

docker pull your-dockerhub-username/your_image_name:latest

After pulling the Docker image, you can run it to create Docker container from it.

docker run --env-file .env -p 8000:8000 your-dockerhub-username/your_image_name:latest

9. Verify the Container is Running

Lists all the running containers with container_id.

docker ps

10. Stop the Container

Stops the running container safely.
container_id can be obtained from docker ps output.

docker stop container_id

Deployment

Follow these steps carefully to deploy your FastAPI application on Render :

Push your code to GitHub
Login to Render
Create a New Web Service

Link your GitHub Repository / Existing Docker Image

Add details about your API

Add Environment Variables in Render Dashboard (same as .env)

Deploy the Web Service

Application

The frontend application files are in the project root :

index.html → This file defines the structure and layout of the web page.
style.css → This file handles the visual appearance of the web page.
script.js → This file communicates between the web page and the API.

You can open index.html directly in your browser or serve it via a local HTTP server (like VS Code Live Server).

Note

Remember to update the API URL in script.js when deploying on GitHub Pages to get real-time predictions.

Change from :

const fetchPromise = fetch("http://127.0.0.1:8000/predict", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(data),
});

To :

const fetchPromise = fetch("https://your_api_name.onrender.com/predict", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(data),
});

Access the live Website here or Click on the Image below.

Important

The API for this project is deployed using the free tier on Render.

As a result, it may go to sleep after periods of inactivity.

Please start the API first by visiting the API URL. Then, navigate to the website to make predictions.

If the API was inactive, the first prediction may take a few seconds while the server spins back up.

Model Training & Evaluation

1. Load the Data

# Importing load_parquet function from read_data module
from read_data import load_parquet
cars = load_parquet('clean_data', 'clean_data_after_eda.parquet')
cars.head()

2. Split the Data

# Creating Features and Target Variable
X = cars.drop('price', axis=1)
y = cars['price']

# Splitting Data into Training and Testing Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

3. Build Preprocessing Pipeline

# Pipeline for Nominal Column
nominal_cols = ['fuel_type','transmission','brand']
nominal_trf = Pipeline(steps=[
    ('ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

# Pipeline for Ordinal Column
ordinal_cols = ['owner']
ordinal_categories = [['Others','3rd owner','2nd owner','1st owner']]
ordinal_trf = Pipeline(steps=[
    ('oe', OrdinalEncoder(categories=ordinal_categories))
])

# Pipeline for Numerical Column
numerical_cols = ['km_driven','year','engine_capacity']
numerical_trf = Pipeline(steps=[
    ('scaler', RobustScaler())
])

# Adding Everything into ColumnTransformer
ctf = ColumnTransformer(transformers=[
    ('nominal', nominal_trf, nominal_cols),
    ('ordinal', ordinal_trf, ordinal_cols),
    ('scaling', numerical_trf, numerical_cols)
], remainder='passthrough', n_jobs=-1)

4. Evaluate Multiple Models

# Models Dictionary
models = {
    'LR' : LinearRegression(n_jobs=-1),
    'KNN' : KNeighborsRegressor(n_jobs=-1),
    'DT' : DecisionTreeRegressor(random_state=42),
    'RF' : RandomForestRegressor(random_state=42, n_jobs=-1),
    'GB' : GradientBoostingRegressor(random_state=42),
    'XGB' : XGBRegressor(random_state=42, n_jobs=-1)
}

# Computing Average Error and R2-Score through Cross-Validation
results = {}

for name, model in models.items():
    
    pipe = Pipeline(steps=[
        ('preprocessor', ctf),
        ('model', model)
    ])

    k = KFold(n_splits=5, shuffle=True, random_state=42)

    cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1, return_train_score=False)

    results[name] = {'avg_error': -cv_results['test_mae'].mean(),'avg_score': cv_results['test_r2'].mean()}

    print()
    print(f'Model : {name}')
    print('-'*40)
    print(f'Average Error : {-cv_results['test_mae'].mean():.2f}')
    print(f'Standard Deviation of Error : {cv_results['test_mae'].std():.2f}')
    print(f'Average R2-Score : {cv_results['test_r2'].mean():.2f}')
    print(f'Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}')

Model : LR
----------------------------------------
Average Error : 123190.02
Standard Deviation of Error : 6445.18
Average R2-Score : 0.77
Standard Deviation of R2-Score : 0.01

Model : KNN
----------------------------------------
Average Error : 115572.16
Standard Deviation of Error : 3883.19
Average R2-Score : 0.79
Standard Deviation of R2-Score : 0.00

Model : DT
----------------------------------------
Average Error : 118466.64
Standard Deviation of Error : 4490.62
Average R2-Score : 0.76
Standard Deviation of R2-Score : 0.03

Model : RF
----------------------------------------
Average Error : 90811.20
Standard Deviation of Error : 2335.09
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.01

Model : GB
----------------------------------------
Average Error : 98056.52
Standard Deviation of Error : 3001.29
Average R2-Score : 0.85
Standard Deviation of R2-Score : 0.01

Model : XGB
----------------------------------------
Average Error : 91595.94
Standard Deviation of Error : 2640.02
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.02

# Plotting Metric Comparision Graph
results_df = pd.DataFrame(results)

fig, ax = plt.subplots(ncols=1, nrows=2, figsize=(12,8))

sns.barplot(x=results_df.iloc[0,:].sort_values().index.to_list(), y=results_df.iloc[0,:].sort_values().values, ax=ax[0])
ax[0].set_title('Average Error Comparision (Lower is Better)')
ax[0].set_ylabel('Error')
for container in ax[0].containers:
    ax[0].bar_label(container, fmt='%.0f')

sns.barplot(x=results_df.iloc[1,:].sort_values().index.to_list(), y=results_df.iloc[1,:].sort_values().values, ax=ax[1])
ax[1].set_title('Average R2-Score Comparision (Higher is Better)')
ax[1].set_ylabel('R2-Score')
for container in ax[1].containers:
    ax[1].bar_label(container, fmt='%.2f')

plt.tight_layout()
plt.show()

5. Creating Stacking Regressor

# Assigning Base Model for StackingRegressor
base_model = [('rf', rf),('xgb', xgb),('gb', gb)]

# Structure of StackingRegressor
stack = StackingRegressor(
    estimators=base_model, 
    final_estimator=meta_model, 
    passthrough=False, 
    cv=k, n_jobs=-1
)

# Final Pipeline with StackingRegressor
pipe = Pipeline(steps=[
    ('preprocessor', ctf), 
    ('model', stack) 
])

# Average Error and R2-Score through Cross-Validation
cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1)
print(f"Average Error : {-cv_results['test_mae'].mean():.2f}")
print(f"Standard Deviatacion of Error : {cv_results['test_mae'].std():.2f}")
print(f"Average R2-Score : {cv_results['test_r2'].mean():.2f}")
print(f"Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}")

Average Error : 87885.34
Standard Deviatacion of Error : 1279.54
Average R2-Score : 0.87
Standard Deviation of R2-Score : 0.01

6. Performance Evaluation Graphs

Actual vs Predicted Plot

Learning Curve

R2-Score Curve	Error Curve

7. Hyperparameter Tuning

# Parameter Distribution
param_dist = {
    'model__rf__n_estimators': [200, 300],
    'model__rf__max_depth': [10, 20],
    'model__rf__min_samples_leaf': [3, 5],
    'model__rf__min_samples_split': [5, 7],
    'model__xgb__n_estimators': [200, 300],
    'model__xgb__learning_rate': [0.05, 0.1],
    'model__xgb__max_depth': [2, 4],
    'model__xgb__subsample': [0.5, 0.75],
    'model__xgb__colsample_bytree': [0.5, 0.75],
    'model__gb__n_estimators': [100, 200],   
    'model__gb__learning_rate': [0.05, 0.1],  
    'model__gb__max_depth': [2, 4],       
    'model__gb__subsample': [0.5, 0.75],
    'model__final_estimator__alpha': [0.1, 10.0],
    'model__final_estimator__l1_ratio': [0.0, 1.0]
}

# RandomizedSearch Object with Cross-Validation
rcv = RandomizedSearchCV(estimator=pipe, param_distributions=param_dist, cv=k, scoring='neg_mean_absolute_error', n_iter=30, n_jobs=-1, random_state=42)

# Fitting the RandomizedSearch Object
rcv.fit(X_train, y_train)

# Best Estimator
best_model = rcv.best_estimator_

8. Performance Evaluation Comparison

Actual vs Predicted Plot

Before Tuning	After Tuning

Learning Curve

R2-Score Curve (Before Tuning)	R2-Score Curve (After Tuning)

Error Curve (Before Tuning)	Error Curve (After Tuning)

Challenges & Solutions

Challenge 1 → Getting Real World Data

Problem

I wanted to use real-world data instead of pre-cleaned toy dataset, as it represents messy, real-life scenarios.
However, Cars24 loads its content dynamically using JS, meaning a simple HTTP request won't be enough.

Solution

I used Selenium to simulate a real browser, ensuring that the page was fully loaded before scraping.
Once the content was fully rendered, I used BeautifulSoup to efficiently parse the HTML.
This approach allowed me to reliably capture all the details.

Challenge 2 → Handling Large Datasets Efficiently

Problem

The raw scraped dataset was large, taking up unnecessary space.
Loading it repeatedly during experimentation became inefficient.

Solution

I optimized memory consumption by downcasting data types to reduce memory usage.
I also stored the dataset in Parquet format, which compresses data without losing information.
It allows for much faster read/write speeds as compared to CSV.

Challenge 3 → Avoiding Data Leakage

Problem

If preprocessing is applied to the entire dataset, test data can leak into the training process.
This creates overly optimistic results and reduces the model's ability to generalize.

Solution

I implemented Scikit-learn Pipeline and ColumnTransformer to apply preprocessing only on training data.
This keeps the test data completely unseen during preprocessing, preventing data leakage.

Challenge 4 → Deploying Model as API

Problem

Even after building the machine learning pipeline, it remained offline and could only be used locally.
There is no way to provide inputs and get predictions over the web or from other applications.
The model is dependent on the local system and could not serve predictions to external users or services.

Solution

I deployed the machine learning model as an API using FastAPI.
This allows users and applications to send inputs and receive predictions online in real time.
I added a /predict endpoint for serving predictions and a /health endpoint to monitor API status.
I also implemented rate limiting and input validation to prevent misuse and ensure stability under load.
These measures made the model accessible, reliable and ready for production use.

Challenge 5 → Accessibility for Non-Technical Users

Problem

Even if the API works perfectly fine, non-technical users may still find it difficult to test and use.
This limits accessibility.

Solution

I created a HTML/CSS/JS frontend that sends requests to the API and displays predictions instantly.
I also included a example payload in Swagger UI, so that users can test it without any extra effort.

Challenge 6 → Consistent Deployment Across Environments

Problem

Installing dependencies and setting up the environment manually is time-consuming and error-prone.
Especially when running on different machines with different operating system.
This also made sharing the project with others more difficult, as they would have to replicate the exact setup.

Solution

I created a multi-stage Dockerfile.
It builds the FastAPI application, installs dependencies and copies only the required files into the final image.
I used a .dockerignore file to exclude unnecessary files to keep the image small and deployment fast.
This allowed me to run the project consistently on any system with Docker installed.
It eliminates the hassle of worrying about dependency mismatches or operating system specific issues.
Same Docker image can be used to deploy on Render, Docker Hub or run locally with a single docker command.

Impact

End-to-End Deployment

Built and deployed a complete machine learning pipeline as a FastAPI service.
Enabled real-time used car price prediction from a dataset of 2,800+ cars.

Dataset Optimization

Reduced dataset memory usage by 90% through data type optimization.
Converted dataset to Parquet format, significantly improving preprocessing speed.

Data-Driven Model Selection

Used cross-validation to evaluate multiple regression models.
Ensured only top performers made it to production.

Significant Performance Gains

Delivered 30% lower MAE and 12% higher R2-Score compared to the baseline model.
Achieved these gains by implementing stacking ensemble and hyperparameter tuning.

Greater Prediction Reliability

Improved model stability by 70%, ensuring more consistent and reliable output in production.

Folder Structure

.
├── api/                      # FastAPI Code for making Predictions
│   ├── main.py              
│   └── config.py            
│
├── clean_data/               # Cleaned Dataset (Parquet Format)
│   └── clean_data.parquet
|   └── ...
│
├── images/                   # Images for Frontend Interface
│   ├── favicon.png
│   └── hero_image.png
│
├── models/                   # Serialized Components for Prediction
│   ├── pipe.pkl
│   └── model_freq.pkl
│
├── notebooks/                # Jupyter Notebook for Project Developement
│   └── data_cleaning.ipynb
|   └── ...
│
├── scrape_code/              # Web Scraping Notebook
│   └── scrape_code.ipynb
│
├── scrape_data/              # Scraped Dataset (CSV Format)
│   └── scrape_data.csv
│
├── utils/                    # Reusable Python Functions (utils Package)
│   ├── __init__.py
│   ├── web_scraping.py
│   └── helpers.py
|   └── ...
│
├── .dockerignore             # All files and folders ignored by Docker while building Docker Image
├── .gitignore                # All files and folders ignored by Git while pushing code to GitHub
├── Dockerfile                # Instructions for building the Docker Image
├── index.html                # Frontend HTML File
├── style.css                 # Frontend CSS File
├── script.js                 # Frontend JS File
├── requirements.txt          # List of required libraries for the Project
├── LICENSE                   # License specifying permissions and usage rights
└── README.md                 # Detailed documentation of the Project

License

This project is licensed under the MIT License. You are free to use and modify the code as needed.

^ Scroll to Top ^

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
api		api
clean_data		clean_data
images		images
models		models
notebooks		notebooks
scrape_code		scrape_code
scrape_data		scrape_data
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
script.js		script.js
style.css		style.css

License

TheMrityunjayPathak/AutoIQ

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Problem Statement

Dataset

Web Scraping Script (scrape_car_listing)

1. Launch Chrome Automatically

2. Open Cars24 Website

3. Simulate Scrolling

4. Check for End of Page

5. Capture Rendered HTML

6. Parse HTML with BeautifulSoup

Data Extraction Script (get_car_details)

1. Find Raw Model Name Texts

2. Collect Specification Text Blocks

3. Group Specifications

4. Map Groups into Fields

5. Extract Price Values

6. Extract Listing Links

7. Combine into a DataFrame

8. Return the DataFrame

Engine Capacity Script (get_engine_capacity)

1. Iterate through each Car Listing URL

2. Send an HTTP Request

3. Parse the HTML Content

4. Locate Engine Capacity Label

5. Extract the Value

6. Return the List

Combine Data from Multiple Cities

Dataset Description

Workflow

Setup

1. Clone the Repository

2. Build the Docker Image

3. Setup Environment Variables

.env file

config.py file

main.py file

4. Run the Docker Container

5. Access the FastAPI Server

6. Stop the Docker Container

Testing

1. Launch Postman and Create a New HTTP Request

2. Using GET and POST Methods in Postman

→ GET Method

→ Steps

→ Expected Response

→ POST Method

→ Steps

→ Expected Response

Dockerization

1. Install Docker

2. Verify the Installation

3. Create the Dockerfile

4. Create the .dockerignore File

5. Build the Docker Image

6. Create the Docker Container

For Development

For Production

7. Push to Docker Hub

Login to Docker Hub

Tag the Docker Image

Push the Image to Docker Hub

8. Pull and Run Anywhere

9. Verify the Container is Running

10. Stop the Container

Deployment

Application

Model Training & Evaluation

1. Load the Data

2. Split the Data

3. Build Preprocessing Pipeline

4. Evaluate Multiple Models

5. Creating Stacking Regressor

6. Performance Evaluation Graphs

Actual vs Predicted Plot

Web Scraping Script (`scrape_car_listing`)

Data Extraction Script (`get_car_details`)

Engine Capacity Script (`get_engine_capacity`)

`.env` file

`config.py` file

`main.py` file

4. Create the `.dockerignore` File

Packages