This project leverages the up‑to‑date McAuley-Lab/Amazon-Reviews-2023
dataset to explore e‑commerce trends and demonstrate end‑to‑end data mining workflows. We select five distinct product categories and apply a range of techniques—from data visualization to clustering and sentiment analysis—to extract actionable insights.
We first select 5 product categories from those available in the dataset. We then explore the obtained data by creating word clouds, plots, and histograms using Matplotlib and Seaborn to answer basic questions and gain a good grasp of the data. An example of such question is: Identify products with a high number of reviews but low ratings. What are some common keywords or phrases in the reviews for these products?
We apply a comprehensive 15-step data-cleaning pipeline (preprocessing) and vectorization (TF-IDF). We then perform clustering using our preferred method to group similar products within categories based on price, review descriptions, and ratings. We achieve this by applying k-means and using the elbow method to determine the optimal number of clusters (k) for each category. We visualize the clusters in both PCA and t-SNE space, using the silhouette score to evaluate cluster quality.
Next, we perform sentiment classification on the obtained reviews, categorizing them as negative, neutral, or positive. We use two feature extraction methods (TF-IDF & Word2Vec) and three well-known classifiers (Naive Bayes, KNN, Random Forests). We generate ground truth labels by using the pretrained RoBERTa model to extract sentiment scores from review text. Final labels are calculated using a weighted average: Final Sentiment Score = w₁ × Text Sentiment (RoBERTa) + w₂ × Normalized Rating
, where w₁ and w₂ are weights we experiment with. We evaluate classifiers using metrics like F1-score, precision, and recall, and perform 10-fold cross-validation on the training data to estimate model generalization. Aforementioned results are summarized in tables within the Jupyter notebooks.
This project is licensed under the MIT License. See LICENSE for details.