Cluster Analysis: Concepts and Types
Posted: 14/05/2025
Cluster analysis groups data objects into clusters based on the information found in the data itself. The goal is to produce clusters such that all of the members of a single cluster are similar to each other, while objects belonging to different clusters are unrelated.
The main reasons behind clustering analysis are:
- Understanding: identifying classes and groups of objects plays an important role in how people analyze the world. Humans are naturally efficient at finding clusters and classifying new data on the basis of the clusters found. Cluster analysis is also often referred to as unsupervised classification for this reason (as opposed to classic classification, which is supervised).
- Summarization: if instead of applying an algorithm to the whole dataset, it is applied to (accurately defined) clusters, the efficiency of the execution may be increased. This is especially important for algorithms that have a complexity of O(n²), such as regression or component analysis.
Types of Clusterings
Clusterings are a collection of clusters.
- Hierarchical vs Partitional: A partitional clustering is a division of the dataset into non-overlapping clusters. Hierarchical clustering finds a set of nested clusters that form a hierarchy organized as a tree.
- Exclusive vs Overlapping vs Fuzzy: An exclusive clustering assigns each object to only one cluster. Overlapping clustering may assign one object to multiple clusters. In fuzzy clustering, every object belongs to every cluster with a weight between 0 and 1 that measures how much that object belongs to a given cluster. The sum of all weights for an object must be 1.
- Complete vs Partial: A complete clustering assigns each object to a cluster, while a partial clustering does not; it may exclude noise and outliers.
- Heterogeneous vs Homogeneous: In homogeneous clustering, all clusters have the same size, shape, or density. In heterogeneous clustering, clusters may differ.
Types of Clusters
- Well-separated: A cluster is a set of objects in which each object is closer to any other point in its cluster than to any point in a different cluster. These clusters can have any shape.
- Prototype-based and Center-based: A cluster is defined by a prototype (often a centroid or medoid). Center-based clusters are those where each object is closer to the center of its cluster than to any other center.
- Graph-based: If data can be represented as a graph, clusters can be defined as connected components. A subtype is contiguity-based clusters, where points are connected based on distance criteria.
- Density-based: A cluster is a dense region of objects surrounded by low-density areas. This is useful for finding irregularly shaped clusters.
- Shared-property (or Conceptual clusters): A cluster is defined as a set of objects that share a generic property. This is a more general definition encompassing all others.
- Objective function: Clusters are found to minimize or maximize an objective function.
Cluster Validation
Once we used an algorithm to produce a clustering of the data, we need some way to evaluate the ”goodness” of it. Some aspects of cluster validation are:
- Determining the clustering tendency of a dataset, as in, distinguishing wether there’s an actual structure in the data
- Comparing the results of algorithms to externally known ones
- Evaluating how well the results fit the data even in the absence of external data
- Comparing two or more results to determine which is better
- Determining the ideal number of clusters
The evaluation measures (or indices) used to judge different aspects of cluster validity can be classified into three types:
- Unsupervised/internal indices: measure the goodness of the clustering without respect to external information. An example is the SSE. This group is also often divided into two subgroups, one containing measures of cluster cohesion, one containing measures of cluster separation. The first group measures how closely related the objects in a cluster are to each other, the second measures how separated a cluster is from the others.
- Supervised/external indices: measure the goodness of the clustering by checking how much it matches some external structure. An example is entropy, which measures how well cluster labels match externally provided class labels.
- Relative:: compare different clusterings or clusters, so they produce a relative evaluation. They are not an actual separate group of measures, but a different way of using them. Two clusters can be evaluated with respect to one another by using SSE or entropy, for example.
In this section, we mainly focus on the internal measurement
Internal measurement
Cohesion and separation
>Many internal measures are based on the notions of cohesion and separation. These measures are also called proximity graph-based. Cluster cohesion is the sum of the weights of all the links within a cluster, while separation is the sum of the weights between nodes in the cluster and the nodes outside of the cluster.Cohesion can be calculated as:
\[ \text{SSE} = \text{WSS} = \sum_{i=1}^{K} \sum_{x \in C_i} (x - m_i)^2 \]while separation can be calculated as:
\[ \text{BSS} = \sum_i |C_i| \, (m - m_i)^2 \]The validity of the clustering is calculated as:
\[ \text{Validity} = \sum_{i=1}^{K} w_i \, \text{Validity}(C_i) \] >where the validity() function can be cohesion, separation, or a combination of the two. The weights will vary depending on the measure used; sometimes they will all be equal to 1 or K, other measures may set them in different, more complex, ways.Similarity Matrix
This is a method that allows us to graphically assess the quality of a clustering. If we order a similarity matrix by cluster labels and plot it, in theory, if we have well separated clusters, the result will be a block-diagonal matrix. If not, the patterns displayed can still reveal relationships between clusters.Correlation
If we have the distance/similarity matrix of a dataset, and the cluster labels produces by our cluster analysis, we can evaluate the goodness of the clustering by looking at the correlation between the matrix and an ideal version of it produced by looking at the cluster labels.More specifically, an ideal cluster is one whose points all have a similarity of 1 to all the other points in the same cluster, and a correlation of 0 to all the points in other clusters. If we sort the rows and columns of the similarity matrix so that all objects belonging to the same cluster are close together, the ideal cluster similarity matrix will have a block diagonal structure. This means that the similarity is non-zero in the blocks of the similarity matrix whose entries represent intra-cluster similarity, and 0 elsewhere. The ideal cluster similarity matrix is then constructed by creating a matrix with one row and one column per data point, and associating 1 to an entry if the corresponding pair of points are grouped in the same cluster, 0 otherwise.
If the two matrices have high correlation, it means that points belonging to the same cluster are close, while low correlation indicates the opposite.The correlation is calculated only among
\[ \frac{n(n-1)}{2} \]entries, since both matrices are symmetric.
Silhouette Coefficient
The popular method of silhouette coefficients combines both cohesion and separation. It can be used for individual points, clusters, or clusterings.
The steps to calculate the silhouette coefficient for a point are the following (using distances, but similarities can be used as well):
- For the i-th object, calculate its average distance to all other objects in the same cluster; call this value $a_i$.
- For the i-th object and for any cluster not containing it, calculate the average distance between the point and all the objects in a given cluster. Find the minimum of these averages with respect to all clusters; call this value $b_i$.
- For the i-th object, the silhouette coefficient is \[ s_i = \frac{b_i - a_i}{\max(a_i, b_i)} . \]
Thus, the value of the coefficient varies between -1 and 1. A negative value is not desirable, since it would mean $a_i$ is greater than $b_i$, so the point tends to be further away from points of its own cluster than external ones. We ideally want the coefficient to be positive, and $a_i$ to be as close to 0 as possible.
The silhouette coefficient of a whole cluster can be calculated as the average of the coefficients of all the points in the cluster. A measure of the goodness of a clustering can then be obtained by calculating the average coefficient over all the points in the dataset.
SSE
SSE can be used to estimate the ”goodness” of more complex datasets, with irregularly shaped clusters. It can also be used to compare two clusters or clusterings by comparing their average SSE.
The SSE can also be used to estimate the number of clusters: if we plot the SSE versus the number of clusters obtained by executing an algorithm multiple times, we will likely see a distinct knee in the curve that corresponds to the ideal number of clusters to divide the data into. This can also be done with the silhouette coefficient, by looking at peaks instead of knees.