Skip to content

🛒 End-to-end data mining of Amazon reviews (2023) using clustering and sentiment analysis. Features rigorous preprocessing, EDA, and ML workflows to extract e-commerce insights.

License

Notifications You must be signed in to change notification settings

VassTs/amazon-ecommerce-data-mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

e‑Commerce Data Mining with Amazon Reviews 2023

Overview

This project leverages the up‑to‑date McAuley-Lab/Amazon-Reviews-2023 dataset to explore e‑commerce trends and demonstrate end‑to‑end data mining workflows. We select five distinct product categories and apply a range of techniques—from data visualization to clustering and sentiment analysis—to extract actionable insights.

Detailed description

Data Exploration

We first select 5 product categories from those available in the dataset. We then explore the obtained data by creating word clouds, plots, and histograms using Matplotlib and Seaborn to answer basic questions and gain a good grasp of the data. An example of such question is: Identify products with a high number of reviews but low ratings. What are some common keywords or phrases in the reviews for these products?

Preprocessing & Clustering Analysis

We apply a comprehensive 15-step data-cleaning pipeline (preprocessing) and vectorization (TF-IDF). We then perform clustering using our preferred method to group similar products within categories based on price, review descriptions, and ratings. We achieve this by applying k-means and using the elbow method to determine the optimal number of clusters (k) for each category. We visualize the clusters in both PCA and t-SNE space, using the silhouette score to evaluate cluster quality.

Sentiment Classification

Next, we perform sentiment classification on the obtained reviews, categorizing them as negative, neutral, or positive. We use two feature extraction methods (TF-IDF & Word2Vec) and three well-known classifiers (Naive Bayes, KNN, Random Forests). We generate ground truth labels by using the pretrained RoBERTa model to extract sentiment scores from review text. Final labels are calculated using a weighted average: Final Sentiment Score = w₁ × Text Sentiment (RoBERTa) + w₂ × Normalized Rating, where w₁ and w₂ are weights we experiment with. We evaluate classifiers using metrics like F1-score, precision, and recall, and perform 10-fold cross-validation on the training data to estimate model generalization. Aforementioned results are summarized in tables within the Jupyter notebooks.

License

This project is licensed under the MIT License. See LICENSE for details.

About

🛒 End-to-end data mining of Amazon reviews (2023) using clustering and sentiment analysis. Features rigorous preprocessing, EDA, and ML workflows to extract e-commerce insights.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published