Handwritten digit recognition is a classic problem in machine learning. While often tackled with supervised methods, this project takes an unsupervised approach, using clustering algorithms to group the images without access to their true labels. The goal is to analyze and benchmark the performance of different clustering models in discovering the underlying structure of the data.
The project uses the MNIST (Modified National Institute of Standards and Technology) database, a widely-used collection of 70,000 grayscale images of handwritten digits (0-9). Each image is size-normalized and centered, with a resolution of 28x28 pixels.
The core of this project is the implementation and comparison of several clustering algorithms. To handle the high dimensionality of the image data (784 features per image), Principal Component Analysis (PCA) was used for dimensionality reduction.
The clustering models analyzed are:
Gaussian Mixture Models (GMM): A probabilistic model that assumes data points are generated from a mixture of a finite number of Gaussian distributions.
Mean Shift: A non-parametric, density-based clustering algorithm that seeks to find the modes (peaks) of the data's probability density function.
Normalized Cut (N-Cut): A graph-based algorithm that treats the dataset as a graph and partitions it by finding cuts that minimize the dissimilarity between subgraphs.
The quality of the clustering results was evaluated using the Adjusted Rand Index (ARI), which measures the similarity between the predicted clusters and the ground-truth labels.
The experiments involved applying each clustering algorithm to the MNIST data after reducing its dimensionality with PCA. The number of components for PCA and the key parameters for each model were varied to observe their impact on performance and execution time.
Normalized Cut (N-Cut): Achieved the highest clustering quality, with an Adjusted Rand Index of up to 0.90. However, its computational complexity (growing quadratically with the number of samples) made it feasible only on a smaller subset of the dataset.
Gaussian Mixture (GMM): Proved to be the most balanced model in terms of performance. It was the only algorithm capable of processing the entire dataset within a reasonable time, achieving a respectable Rand Index of 0.52. It offers a good trade-off between cluster quality and computational efficiency.
Mean Shift: Was found to be inadequate for this specific task. The model was highly sensitive to its bandwidth parameter, making it difficult to tune. It either produced too many clusters (overfitting) or collapsed all data into a single cluster, yielding unsatisfactory results.
Based on the analysis, Normalized Cut is the most effective algorithm for this task in terms of raw clustering accuracy, provided that computational resources are sufficient for its quadratic complexity. For larger datasets where efficiency is a concern, Gaussian Mixture Models represent a practical and effective alternative.
Ca' Foscari University of Venice Master's Degree in Computer Science and Information Technology