Weighted-Clustering with Data Quality Integration

This project was inspired by my Masters Thesis. It implements a reliability-weighted K-means clustering algorithm that incorporates data quality metrics through Coefficient of Variation (CV) values. Designed for use with American Community Survey (ACS-5) data, this approach gives higher weight to observations with lower measurement error. It's supposed to be used with GeoDataFrames and can be used in combination with my ACS-5 project, by setting include_cv_columns=True. This will automatically calcuate the CV for each observation, which can than be used in the weighting process. The following features are included:

Features

Reliability-Weighted Clustering
This algorithm allows you to incorporate data quality directly into the clustering process. Observations with lower Coefficient of Variation (CV) values — indicating more reliable measurements — receive higher weights. This ensures that clusters are driven more by high-quality data rather than noisy inputs.
Customizable Feature Scaling
Choose from standard (z-score), minmax, or no scaling to preprocess your data before clustering. Feature scaling is critical when variables are on different scales, and this option provides flexibility depending on your dataset's characteristics.
Silhouette Analysis & WCSS for Optimal Cluster Selection
The silhouette analysis method is integrated to help you identify the most appropriate number of clusters. This feature generates silhouette scores and WCSS for a user-defined range of cluster counts and visualizes them, aiding in model selection.
Map-Based Cluster Visualization
Designed for use with spatial data, the class can generate map plots using GeoDataFrames. It overlays clusters on a basemap (via contextily), making it easier to interpret spatial patterns in the results.
Descriptive Cluster Statistics
After fitting the model, you can retrieve a summary table with cluster-wise statistics. This includes the mean value of each feature (in original scale), the WCSS per Cluster (percentage of total and absolute value), the number of observations per cluster, and their percentage share — useful for interpreting and comparing clusters.

Key Concepts

Coefficient of Variation (CV)

A standardized measure of estimate reliability:

$CV = \left( \frac{MOE/1.645}{Estimate} \right) \times 100$

MOE: Margin of Error (at 90% confidence level)
Threshold: CV > 30% considered unreliable (per convention)

Reliability Weighting Principle

Weight Calculation Pipeline

We assign higher weights to more reliable estimates (lower CV):

$\text{Reliability}_i = \frac{1}{\text{MeanCV}_i + \epsilon} \quad (\epsilon = 10^{-6})$

Where:

MeanCVᵢ: Mean coefficient of variation for observation i

$\text{MeanCV}_i = \frac{1}{m} \sum_{j=1}^{m} \text{CV}_{ij}$

ε: Small constant to avoid division by zero

Min-Max Normalization

Normalize weights to [0, 1]:

$\text{Weight}_i = \frac{\text{Reliability}_i - \min(\text{Rel})}{\max(\text{Rel}) - \min(\text{Rel})}$

1 = highest reliability (lowest CV)
0 = lowest reliability (highest CV)

Within-Cluster Sum of Squares (WCSS)

Measures how tightly grouped the points in a cluster are.

Cluster-Level WCSS Calculation

For each cluster k, compute the weighted squared distance to its centroid:

$\text{WCSS}_k = \sum_{i \in C_k} w_i \cdot \|x_i - \mu_k\|^2$

Where:

Cₖ: Observations in cluster k
wᵢ: Reliability weight for observation i
xᵢ: Scaled feature vector of observation i
μₖ: Centroid of cluster k

Total WCSS

$\text{Total WCSS} = \sum_{k=1}^{K} \text{WCSS}_k$

Cluster WCSS Contribution (%)

$\text{WCSS\%}_k = \left( \frac{\text{WCSS}_k}{\text{Total WCSS}} \right) \times 100$

High WCSS%: Cluster has higher internal variability
Low WCSS%: Cluster is more homogeneous

Feature Preprocessing

Z-score Normalization

$z = \frac{x - \mu}{\sigma}$

Standardizes to mean 0, standard deviation 1

Min-Max Scaling

$x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

Scales features to [0, 1]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
weighted_kmeans.py		weighted_kmeans.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Weighted-Clustering with Data Quality Integration

Features

Key Concepts

Coefficient of Variation (CV)

Reliability Weighting Principle

Weight Calculation Pipeline

Min-Max Normalization

Within-Cluster Sum of Squares (WCSS)

Cluster-Level WCSS Calculation

Total WCSS

Cluster WCSS Contribution (%)

Feature Preprocessing

Z-score Normalization

Min-Max Scaling

About

Uh oh!

Releases

Packages

Languages

LorenzEh/Weighted-Clustering

Folders and files

Latest commit

History

Repository files navigation

Weighted-Clustering with Data Quality Integration

Features

Key Concepts

Coefficient of Variation (CV)

Reliability Weighting Principle

Weight Calculation Pipeline

Min-Max Normalization

Within-Cluster Sum of Squares (WCSS)

Cluster-Level WCSS Calculation

Total WCSS

Cluster WCSS Contribution (%)

Feature Preprocessing

Z-score Normalization

Min-Max Scaling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages