Clustering is the most well-known unsupervised learning technique. The goal of clustering is to discover groups in observations. The groups are called clusters.
The data points in the same cluster are similar to each other, compared to points in different clusters, which are relatively dissimilar.
There are many clustering algorithms. In this notebook, we will focus on two of them:
One that requires the number of clusters (\(k\)) to be specified: K-means.
And another that does NOT require the number of clusters to be specified: DBSCAN.
To compare the performance of the clustering algorithms, in the code below we will use the same six datasets capturing a wide variety of patterns and structures.
Note that the data sets are not labeled. Also note that unsupervised learning algorithms do not work only with 2-dimensional data but with data of any dimensionality. Here we use 2-dimensional data only to be able to visualize the results.
K-means
The k-means algorithm is a simple and popular clustering algorithm. It is an iterative algorithm that partitions the data points into a pre-specified \(k\) number of clusters.
The algorithm works as follows:
Start: Select \(k\) random points as the initial centroids.
Update Cluster Assignments: Assign each data point to the cluster with the nearest centroid.
Update Cluster Centers: Update the centroids of the clusters by taking the average of the data points in each cluster.
Repeat steps 2 and 3 until the centroids do not change.
The animation below visualizes the algorithm:
The algorithm is guaranteed to converge to a result. However, the result may not be the optimal one.
Because of random initialization, the algorithm converges to different results on different runs. Such algorithms or processes, where there is an element of randomness but with some bounds of predictability, are called stochastic algorithms or processes.
Let’s try the k-means algorithm on the blobs dataset first. Note that the raw data has just two features (x1 and x2) but no labels.
X = datasets['blobs']X.head()
x1
x2
0
-5.730354
-7.583286
1
1.942992
1.918875
2
6.829682
1.164871
3
-2.901306
7.550771
4
5.841093
1.565094
The code below plots the raw data as a scatter plot.
Note that there are clearly three clusters in the data where the points within each cluster are closer to each other compared to points across clusters.
We will use the KMeans class from the sklearn.cluster module.
The constructor of the KMeans class takes the number of clusters \(k\) as input.
The KMeans class has a fit() method that takes the data as input and runs the k-means algorithm on it.
After we fit the model to the data, we can use the .labels_ attribute to get the discovered labels of the clusters assigned to each data point.
Below we add a third feature label to the data, which is the cluster label assigned to each data point by the k-means algorithm.
from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=3, random_state=0)kmeans.fit(X)X['cluster'] = kmeans.labels_X.head()
x1
x2
cluster
0
-5.730354
-7.583286
1
1
1.942992
1.918875
2
2
6.829682
1.164871
2
3
-2.901306
7.550771
0
4
5.841093
1.565094
2
The code below plots the data again, but this time with the cluster labels.
plt.scatter(X['x1'], X['x2'], c=X['cluster'], cmap='viridis');plt.xlabel('x1')plt.ylabel('x2')plt.title('KMeans Clustering of Blobs dataset');
Note that the k-means algorithm has perfectly discovered the three blobs in the data.
Implementation
import pandas as pdimport numpy as npdef euclidean_distance(x1, x2):return np.sqrt(np.sum((x1 - x2) **2))def initialize_centroids(df, k):"""Randomly initialize centroids by selecting k rows from the DataFrame."""return df.sample(n=k, random_state=0).reset_index(drop=True)def assign_clusters(df, centroids):"""Assign each data point to the nearest centroid.""" distances = pd.DataFrame( {i: ((df - centroids.iloc[i]) **2).sum(axis=1) **0.5for i in centroids.index} )return distances.idxmin(axis=1)def update_centroids(df, clusters, k):"""Update centroids as the mean of all points in each cluster.""" new_centroids = df.groupby(clusters).mean()return new_centroids.reindex(range(k)).reset_index(drop=True)def kmeans(df, k, max_iters=1000, tolerance=1e-4):"""K-Means clustering algorithm."""# Step 1: Initialize centroids centroids = initialize_centroids(df, k)for iteration inrange(max_iters):# Step 2: Assign clusters clusters = assign_clusters(df, centroids)# Step 3: Update centroids new_centroids = update_centroids(df, clusters, k)# Step 4: Check for convergenceif ((centroids - new_centroids).abs().sum().sum() < tolerance):break centroids = new_centroidsreturn clusters, centroidsX = datasets['blobs']clusters, centroids = kmeans(X[['x1', 'x2']], k=3)X['cluster'] = clustersplt.scatter(X['x1'], X['x2'], c=X['cluster'], cmap='viridis');plt.scatter(centroids['x1'], centroids['x2'], c='red', marker='x', s=100);
Limitations of K-means
K-means is a simple and popular clustering algorithm. However, it has some limitations:
It requires the number of clusters \(k\) to be specified. If the number of clusters is not known in advance, then we need to try different values of \(k\) and select the one that gives the best results.
It is sensitive to the initial random selection of centroids. The algorithm may converge to different results on different runs.
Since k-means is reliant on averages, it is sensitive to outliers. Outliers can significantly affect the location of the centroids and hence the clusters.
Most importantly, k-means does not work well with clusters of different sizes and densities. It assumes that the clusters are spherical and of similar size.
To illustrate this limitation, let’s try the k-means algorithm on a dataset that does not satisfy the assumptions of the algorithm.
# Before clusteringX = datasets['noisy_circles']print(X.head())# Applying KMeanskmeans = KMeans(n_clusters=2, random_state=0)kmeans.fit(X)X['cluster'] = kmeans.labels_# After clusteringplt.scatter(X['x1'], X['x2'], c=X['cluster'], cmap='viridis');plt.xlabel('x1')plt.ylabel('x2')plt.title('KMeans Clustering of Noisy Circles dataset');
Note how the k-means algorithm fails to discover the two clusters in the data. This is because the clusters are a) not spherical and b) of different sizes.
Such failures of a clustering algorithm can only be detected by either visualizing the results or computing internal cluster validation metrics such as the silhouette score.
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a density-based clustering algorithm. It is a popular clustering algorithm because it does not require the number of clusters to be specified. It can discover clusters of arbitrary shapes. It can also identify outliers in the data.
The algorithm has two parameters:
\(\epsilon\): maximum distance between two points for them to be considered as in the same neighborhood.
\(m\): minimum number of points required to form a dense region.
The algorithm operates as follows:
An unvisited point in the dataset is selected as the starting point.
Using a distance threshold \(\epsilon\), the algorithm identifies all neighboring points.
If the neighborhood contains at least \(m\) points, a new cluster is initiated, including the starting point and all neighbors. The algorithm then iteratively explores the neighbors of each point in the cluster, adding their neighborhoods to the cluster if they also meet the \(m\)-point criterion.
If the neighborhood contains fewer than \(m\) points, the starting point is marked as noise.
The process continues until all points have been visited.
Note that despite the random initialization, DBSCAN is a deterministic algorithm. That is, it always produces the same result on the same data set.
X = datasets['noisy_circles']print(X.head())sns.scatterplot(data=X, x='x1', y='x2');
Just because DBSCAN does better than k-means on the circles dataset, it does not mean that DBSCAN is always better than k-means. Each clustering algorithm has its own strengths and weaknesses.
Implementation
import pandas as pddef euclidean_distance(row1, row2):"""Compute Euclidean distance between two points."""return ((row1 - row2) **2).sum() **0.5def get_neighbors(df, point_idx, eps):"""Find neighbors within `eps` distance of a point.""" point = df.iloc[point_idx] neighbors = df.apply(lambda row: euclidean_distance(point, row), axis=1)returnlist(neighbors[neighbors <= eps].index)def expand_cluster(df, labels, point_idx, neighbors, cluster_id, eps, min_samples):"""Expand a cluster starting from a core point."""# Assign the point to the cluster labels[point_idx] = cluster_id i =0while i <len(neighbors): neighbor_idx = neighbors[i]# If the point is noise, make it a part of the clusterif labels[neighbor_idx] ==-1: labels[neighbor_idx] = cluster_id# If the point is unvisited, mark it as visited and add its neighbors to the listelif labels[neighbor_idx] ==0: labels[neighbor_idx] = cluster_id new_neighbors = get_neighbors(df, neighbor_idx, eps)# If the point is a core point, merge its neighborsiflen(new_neighbors) >= min_samples: neighbors = neighbors + [n for n in new_neighbors if n notin neighbors] i +=1def dbscan(df, eps, min_samples):"""DBSCAN clustering algorithm."""# 0 means unvisited; -1 means noise; other numbers indicate the cluster ID labels = [0] *len(df) cluster_id =0for point_idx inrange(len(df)):# Skip if the point is already visitedif labels[point_idx] !=0: continue neighbors = get_neighbors(df, point_idx, eps)# Mark as noise if not enough neighborsiflen(neighbors) < min_samples: labels[point_idx] =-1else:# Start a new cluster cluster_id +=1 expand_cluster(df, labels, point_idx, neighbors, cluster_id, eps, min_samples)return labelsX = datasets['noisy_moons']labels = dbscan(X[['x1', 'x2']], eps=0.2, min_samples=5)X['cluster'] = labelsplt.scatter(X['x1'], X['x2'], c=X['cluster'], cmap='viridis');
Note that the two algorithms work better on different datasets.
Furthermore, the parameters of the two algorithms (\(k\) for nearest neighbor and \(\epsilon\) and \(m\) for DBSCAN) need to be tuned to get the best results for an individual datasets.
Challenges
Limitations of Clustering
Note that not all data sets are suitable for clustering. Some data sets do not have a well-defined cluster structure.
For example, below we try the k-means algorithm on the sentiments dataset. We know that the data set has three classes: positive, negative, and neutral. However, the k-means algorithm fails to discover the three classes. This is because the data set does not have a well-defined cluster structure.
import pandas as pddata = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/datascience_ml/main/data/chat_dataset.csv')data.head()
Please take caution in comparing the discovered clusters with any available labels for a dataset.
In clustering, the label ‘values’ are arbitrary. For example, if we have a dataset with three classes, we can label them as 0, 1, and 2 or as 1, 2, and 3 or as 100, 200, and 300.
The Silhouette Score is calculated using the mean intra-cluster distance (\(a\)) and the mean nearest-cluster distance (\(b\)) for each sample. The Silhouette Coefficient for a sample is
where \(b\) is the distance between a sample and the nearest cluster that the sample is not a part of.
Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
sklearn.metrics.silhouette_score function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
A score of 1 indicates that the object is far away from the neighboring clusters. A score of 0 indicates that the object is close to the decision boundary between two neighboring clusters. A score of -1 indicates that the object may have been assigned to the wrong cluster.
Silhouette Score to identify \(k\), \(\epsilon\) and min_samples
from sklearn.metrics import silhouette_scoresscores = {'noisy_circles':[], 'blobs':[]}ks = [2, 3, 4, 5, 10, 15]url ="https://raw.githubusercontent.com/fahadsultan/datascience_ml/main/data/clusters/"for name in ['noisy_circles', 'blobs']: X = pd.read_csv(url + name +".csv", index_col=0)for k in ks: kmeans = KMeans(n_clusters=k, random_state=0) kmeans.fit(X) score = silhouette_score(X, kmeans.labels_) sscores[name].append(score)ax = sns.lineplot(x=ks, y=sscores['noisy_circles'], marker='s');ax = sns.lineplot(x=ks, y=sscores['blobs'], marker='s');ax.set(xlabel='k', ylabel='silhouette_score');plt.grid()plt.legend(['noisy_circles', 'blobs']);plt.title('Because K-Means does not work for `noisy circles` data, \nSilhouette Score never reaches close to 1 for any `k`');