Machine Learning Overview Series (8) - Kmeans | Bangda Sun

K-means: a simple clustering algorithm based on distance. It is GMM (Gaussia Mixture Model) in probability perspective.

1. Introduction

K-means can be quickly implemented from scratch:

Step 1: Any data preprocessing (standardization, etc);
Step 2: Initialization,select k, randomly assign (hard assignment) cluster labels (from 1 to k) to each sample;
Step 3: While the cluster labels are not stabilized (converge): calculate the center of each cluster, re-assign cluster labels;
Step 4: Output results (cluster labels for each sample, cluster centers)

Python code:

import numpy as np
def standardize_data(X):
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
    
def calculate_distance(x, center):
    # x is a 1-d array
    diff = x - center
    return diff.dot(diff.T)
def calculate_centers(X, cluster):
    num_cluster = len(set(cluster))
    num_features = X.shape[1]
    centers = np.zeros((num_cluster, num_features))
    for i in range(num_cluster):
        centers[i] = np.mean(X[cluster == i, :], axis=0)
    return centers
def assign_cluster_label(x, centers):
    num_cluster = centers.shape[0]
    distance = np.zeros(num_cluster)
    for i in range(num_cluster):
        distance[i] = calculate_distance(x, centers[i, :])
    return np.argmin(distance)
def kmeans(X, k=3, random_state=2020):
    np.random.seed(random_state)
    
    X_scaled = standardize_data(X)
    num_obs, num_features = X_scaled.shape
    centers = np.zeros((k, num_features))
    assignment_matrix = np.zeros((num_obs, 1))
    prev_clusters = assignment_matrix[:, 0]
    curr_clusters = np.random.choice(np.arange(k), size=num_obs, replace=True)
    
    while not np.array_equal(prev_clusters, curr_clusters):
        assignment_matrix = np.hstack((assignment_matrix, curr_clusters[:, np.newaxis]))
        prev_clusters = curr_clusters.copy()
        centers = calculate_centers(X_scaled, curr_clusters)
        for i in range(num_obs):
            curr_clusters[i] = assign_cluster_label(X_scaled[i, :], centers)
        if np.array_equal(prev_clusters, curr_clusters):
            break
        
    return curr_clusters + 1, center, assignment_matrix

Kmeans clustering algorithm always converges to a local minimum of

\[
W(C) = \frac{1}{2}\sum^{K}_{k=1}\sum_{C(i) = k}\sum_{C(j) = k}d(x_{i}, x_{j}),
\]

intuitively it reduces the distances between samples within each cluster, but the actually results depend on the initialization of the algorithm.

Kmeans clustering is also interpreted as a finite mixture model with unit variance normal distributions. Mixture models can be described as two stage sampling procedures.

Something to notice:

Euclidean distance can be replaced with other measures, for example correlation coefficient
An alternative: K-medoids, the centers are samples (useful when features are 0 or 1)

2. Selecting K

Selecting K is also know as model order selection in mixture model.

2.1 Penalization Method

The likelihood of the model is

\[
L(x;\theta) = \log\prod^{n}_{i=1}\left(\sum^{K}_{k=1}c_{k}p(x_{i}|\theta_{k})\right),
\]

and looks like we just need to select \(K\) which could maximize the likelihood , but this is not true: the optimal is achieved when each data point fit a cluster, this is overfit.

Therefore adding penalty term to the likelihood function could work. Two typical choices:

\[
\begin{aligned}
\phi_{\text{AIC}}(K) &= K \\
\phi_{\text{BIC}}(K) &= \frac{1}{2}K\log n
\end{aligned}
\]

As noted, they are equivalent to AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).

2.2 Stability Method

Computing a stability score. Randomly split data into two parts (equal size), then for each data point fit for model one, compute its cluster label using model two. Intuitively it measures how many points are assigned to a different cluster under two models. \(K\) with minimum stability score is a better choice.

2.3 Gap Statistic Method

Let \(D_{r}\) be the sum of all pairwise distances for all data points in cluster \(r\):

\[
D_{r} = \sum_{i, j\in C_{r}}d_{ij}.
\]

Set within-cluster mean distance \(W_{K}\):

\[
W_{K} = \sum^{K}_{r=1}\frac{1}{2n_{r}}D_{r}.
\]

The idea is to standardize the curve of \(\log W_{K}\) by comparing it with its expectation under an appropriate null reference distribution of the data. The optimal \(K\) makes \(\log W_{K}\) falls the farthest below the reference curve, therefore the Gap statistic is defined as

\[
Gap_{n}(K) = E_{n}\left[\log W_{k}\right] - \log W_{K}.
\]

2.4 Elbow Method

Compute the within-cluster sum of square (WSS), i.e. the square of the distances of data points to cluster center. Choose \(K\) where WSS first starts to diminish. It is ambiguous from my perspective.

2.5 Silhouette Method

The Silhouette value measure how similar a point is to its own cluster compared to other clusters. Silhouette value is calculated using the mean within-cluster distance \(a\) and mean nearest-cluster distance \(b\) for each data point: \((b - a) / \max(a, b)\). \(K\) with highest Silhouette value is consider better choice.

2.6 Summary

Penalization method is from mixture model perspective, other methods are based on distance measures, as distance is the key concept in clustering algorithms.

3. References

Determining The Optimal Number Of Clusters: 3 Must Know Methods
- https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#gap-statistic-method
How to Determine the Optimal K for K-Means?
- https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb