This project was inspired by my Masters Thesis. It implements a reliability-weighted K-means clustering algorithm that incorporates data quality metrics through Coefficient of Variation (CV) values. Designed for use with American Community Survey (ACS-5) data, this approach gives higher weight to observations with lower measurement error. It's supposed to be used with GeoDataFrames and can be used in combination with my ACS-5 project, by setting include_cv_columns=True
. This will automatically calcuate the CV for each observation, which can than be used in the weighting process. The following features are included:
-
Reliability-Weighted Clustering
This algorithm allows you to incorporate data quality directly into the clustering process. Observations with lower Coefficient of Variation (CV) values — indicating more reliable measurements — receive higher weights. This ensures that clusters are driven more by high-quality data rather than noisy inputs. -
Customizable Feature Scaling
Choose fromstandard
(z-score),minmax
, or no scaling to preprocess your data before clustering. Feature scaling is critical when variables are on different scales, and this option provides flexibility depending on your dataset's characteristics. -
Silhouette Analysis & WCSS for Optimal Cluster Selection
The silhouette analysis method is integrated to help you identify the most appropriate number of clusters. This feature generates silhouette scores and WCSS for a user-defined range of cluster counts and visualizes them, aiding in model selection. -
Map-Based Cluster Visualization
Designed for use with spatial data, the class can generate map plots usingGeoDataFrames
. It overlays clusters on a basemap (viacontextily
), making it easier to interpret spatial patterns in the results. -
Descriptive Cluster Statistics
After fitting the model, you can retrieve a summary table with cluster-wise statistics. This includes the mean value of each feature (in original scale), the WCSS per Cluster (percentage of total and absolute value), the number of observations per cluster, and their percentage share — useful for interpreting and comparing clusters.
A standardized measure of estimate reliability:
- MOE: Margin of Error (at 90% confidence level)
- Threshold: CV > 30% considered unreliable (per convention)
We assign higher weights to more reliable estimates (lower CV):
Where:
MeanCVᵢ
: Mean coefficient of variation for observation i
ε
: Small constant to avoid division by zero
Normalize weights to [0, 1]:
- 1 = highest reliability (lowest CV)
- 0 = lowest reliability (highest CV)
Measures how tightly grouped the points in a cluster are.
For each cluster k, compute the weighted squared distance to its centroid:
Where:
Cₖ
: Observations in cluster kwᵢ
: Reliability weight for observation ixᵢ
: Scaled feature vector of observation iμₖ
: Centroid of cluster k
- High WCSS%: Cluster has higher internal variability
- Low WCSS%: Cluster is more homogeneous
- Standardizes to mean 0, standard deviation 1
- Scales features to [0, 1]