e‑Commerce Data Mining with Amazon Reviews 2023

Overview

This project leverages the up‑to‑date McAuley-Lab/Amazon-Reviews-2023 dataset to explore e‑commerce trends and demonstrate end‑to‑end data mining workflows. We select five distinct product categories and apply a range of techniques—from data visualization to clustering and sentiment analysis—to extract actionable insights.

Detailed description

Data Exploration

We first select 5 product categories from those available in the dataset. We then explore the obtained data by creating word clouds, plots, and histograms using Matplotlib and Seaborn to answer basic questions and gain a good grasp of the data. An example of such question is: Identify products with a high number of reviews but low ratings. What are some common keywords or phrases in the reviews for these products?

Preprocessing & Clustering Analysis

We apply a comprehensive 15-step data-cleaning pipeline (preprocessing) and vectorization (TF-IDF). We then perform clustering using our preferred method to group similar products within categories based on price, review descriptions, and ratings. We achieve this by applying k-means and using the elbow method to determine the optimal number of clusters (k) for each category. We visualize the clusters in both PCA and t-SNE space, using the silhouette score to evaluate cluster quality.

Sentiment Classification

Next, we perform sentiment classification on the obtained reviews, categorizing them as negative, neutral, or positive. We use two feature extraction methods (TF-IDF & Word2Vec) and three well-known classifiers (Naive Bayes, KNN, Random Forests). We generate ground truth labels by using the pretrained RoBERTa model to extract sentiment scores from review text. Final labels are calculated using a weighted average: Final Sentiment Score = w₁ × Text Sentiment (RoBERTa) + w₂ × Normalized Rating, where w₁ and w₂ are weights we experiment with. We evaluate classifiers using metrics like F1-score, precision, and recall, and perform 10-fold cross-validation on the training data to estimate model generalization. Aforementioned results are summarized in tables within the Jupyter notebooks.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
assignment-instructions.pdf		assignment-instructions.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

e‑Commerce Data Mining with Amazon Reviews 2023

Overview

Detailed description

Data Exploration

Preprocessing & Clustering Analysis

Sentiment Classification

License

About

Uh oh!

Releases

Packages

Languages

License

VassTs/amazon-ecommerce-data-mining

Folders and files

Latest commit

History

Repository files navigation

e‑Commerce Data Mining with Amazon Reviews 2023

Overview

Detailed description

Data Exploration

Preprocessing & Clustering Analysis

Sentiment Classification

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages