Cluster Analysis: Concepts and Types

Posted: 14/05/2025

Cluster analysis groups data objects into clusters based on the information found in the data itself. The goal is to produce clusters such that all of the members of a single cluster are similar to each other, while objects belonging to different clusters are unrelated.

The main reasons behind clustering analysis are:

Types of Clusterings

Clusterings are a collection of clusters.

Types of Clusters

Cluster Validation

Once we used an algorithm to produce a clustering of the data, we need some way to evaluate the ”goodness” of it. Some aspects of cluster validation are:

Note that all except for one don’t need external information, so they can be performed unsupervised, and the last one can be performed both supervised and unsupervised.

The evaluation measures (or indices) used to judge different aspects of cluster validity can be classified into three types:

In this section, we mainly focus on the internal measurement

Internal measurement

Cohesion and separation

>Many internal measures are based on the notions of cohesion and separation. These measures are also called proximity graph-based. Cluster cohesion is the sum of the weights of all the links within a cluster, while separation is the sum of the weights between nodes in the cluster and the nodes outside of the cluster.

Cohesion can be calculated as:

\[ \text{SSE} = \text{WSS} = \sum_{i=1}^{K} \sum_{x \in C_i} (x - m_i)^2 \]

while separation can be calculated as:

\[ \text{BSS} = \sum_i |C_i| \, (m - m_i)^2 \]

The validity of the clustering is calculated as:

\[ \text{Validity} = \sum_{i=1}^{K} w_i \, \text{Validity}(C_i) \]

>where the validity() function can be cohesion, separation, or a combination of the two. The weights will vary depending on the measure used; sometimes they will all be equal to 1 or K, other measures may set them in different, more complex, ways.

Similarity Matrix

This is a method that allows us to graphically assess the quality of a clustering. If we order a similarity matrix by cluster labels and plot it, in theory, if we have well separated clusters, the result will be a block-diagonal matrix. If not, the patterns displayed can still reveal relationships between clusters.

Correlation

If we have the distance/similarity matrix of a dataset, and the cluster labels produces by our cluster analysis, we can evaluate the goodness of the clustering by looking at the correlation between the matrix and an ideal version of it produced by looking at the cluster labels.

More specifically, an ideal cluster is one whose points all have a similarity of 1 to all the other points in the same cluster, and a correlation of 0 to all the points in other clusters. If we sort the rows and columns of the similarity matrix so that all objects belonging to the same cluster are close together, the ideal cluster similarity matrix will have a block diagonal structure. This means that the similarity is non-zero in the blocks of the similarity matrix whose entries represent intra-cluster similarity, and 0 elsewhere. The ideal cluster similarity matrix is then constructed by creating a matrix with one row and one column per data point, and associating 1 to an entry if the corresponding pair of points are grouped in the same cluster, 0 otherwise.

If the two matrices have high correlation, it means that points belonging to the same cluster are close, while low correlation indicates the opposite.

The correlation is calculated only among

\[ \frac{n(n-1)}{2} \]

entries, since both matrices are symmetric.

Silhouette Coefficient

The popular method of silhouette coefficients combines both cohesion and separation. It can be used for individual points, clusters, or clusterings.

The steps to calculate the silhouette coefficient for a point are the following (using distances, but similarities can be used as well):

  1. For the i-th object, calculate its average distance to all other objects in the same cluster; call this value $a_i$.
  2. For the i-th object and for any cluster not containing it, calculate the average distance between the point and all the objects in a given cluster. Find the minimum of these averages with respect to all clusters; call this value $b_i$.
  3. For the i-th object, the silhouette coefficient is \[ s_i = \frac{b_i - a_i}{\max(a_i, b_i)} . \]

Thus, the value of the coefficient varies between -1 and 1. A negative value is not desirable, since it would mean $a_i$ is greater than $b_i$, so the point tends to be further away from points of its own cluster than external ones. We ideally want the coefficient to be positive, and $a_i$ to be as close to 0 as possible.

The silhouette coefficient of a whole cluster can be calculated as the average of the coefficients of all the points in the cluster. A measure of the goodness of a clustering can then be obtained by calculating the average coefficient over all the points in the dataset.

SSE

SSE can be used to estimate the ”goodness” of more complex datasets, with irregularly shaped clusters. It can also be used to compare two clusters or clusterings by comparing their average SSE.

The SSE can also be used to estimate the number of clusters: if we plot the SSE versus the number of clusters obtained by executing an algorithm multiple times, we will likely see a distinct knee in the curve that corresponds to the ideal number of clusters to divide the data into. This can also be done with the silhouette coefficient, by looking at peaks instead of knees.

← Back to Blog List