Skip to content

LorenzEh/Weighted-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

Weighted-Clustering with Data Quality Integration Python

This project was inspired by my Masters Thesis. It implements a reliability-weighted K-means clustering algorithm that incorporates data quality metrics through Coefficient of Variation (CV) values. Designed for use with American Community Survey (ACS-5) data, this approach gives higher weight to observations with lower measurement error. It's supposed to be used with GeoDataFrames and can be used in combination with my ACS-5 project, by setting include_cv_columns=True. This will automatically calcuate the CV for each observation, which can than be used in the weighting process. The following features are included:

Features

  • Reliability-Weighted Clustering
    This algorithm allows you to incorporate data quality directly into the clustering process. Observations with lower Coefficient of Variation (CV) values — indicating more reliable measurements — receive higher weights. This ensures that clusters are driven more by high-quality data rather than noisy inputs.

  • Customizable Feature Scaling
    Choose from standard (z-score), minmax, or no scaling to preprocess your data before clustering. Feature scaling is critical when variables are on different scales, and this option provides flexibility depending on your dataset's characteristics.

  • Silhouette Analysis & WCSS for Optimal Cluster Selection
    The silhouette analysis method is integrated to help you identify the most appropriate number of clusters. This feature generates silhouette scores and WCSS for a user-defined range of cluster counts and visualizes them, aiding in model selection.

  • Map-Based Cluster Visualization
    Designed for use with spatial data, the class can generate map plots using GeoDataFrames. It overlays clusters on a basemap (via contextily), making it easier to interpret spatial patterns in the results.

  • Descriptive Cluster Statistics
    After fitting the model, you can retrieve a summary table with cluster-wise statistics. This includes the mean value of each feature (in original scale), the WCSS per Cluster (percentage of total and absolute value), the number of observations per cluster, and their percentage share — useful for interpreting and comparing clusters.

Key Concepts

Coefficient of Variation (CV)

A standardized measure of estimate reliability:

  • MOE: Margin of Error (at 90% confidence level)
  • Threshold: CV > 30% considered unreliable (per convention)

Reliability Weighting Principle

Weight Calculation Pipeline

We assign higher weights to more reliable estimates (lower CV):

Where:

  • MeanCVᵢ: Mean coefficient of variation for observation i
  • ε: Small constant to avoid division by zero

Min-Max Normalization

Normalize weights to [0, 1]:

  • 1 = highest reliability (lowest CV)
  • 0 = lowest reliability (highest CV)

Within-Cluster Sum of Squares (WCSS)

Measures how tightly grouped the points in a cluster are.

Cluster-Level WCSS Calculation

For each cluster k, compute the weighted squared distance to its centroid:

Where:

  • Cₖ: Observations in cluster k
  • wᵢ: Reliability weight for observation i
  • xᵢ: Scaled feature vector of observation i
  • μₖ: Centroid of cluster k

Total WCSS

Cluster WCSS Contribution (%)

  • High WCSS%: Cluster has higher internal variability
  • Low WCSS%: Cluster is more homogeneous

Feature Preprocessing

Z-score Normalization

  • Standardizes to mean 0, standard deviation 1

Min-Max Scaling

  • Scales features to [0, 1]

About

Weights K-Means Clustering based on Coefficient of Variation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages