Cluster Analysis: Concepts and Types
Posted: 14/05/2025
Cluster analysis groups data objects into clusters based on the information found in the data itself. The goal is to produce clusters such that all of the members of a single cluster are similar to each other, while objects belonging to different clusters are unrelated.
The main reasons behind clustering analysis are:
- Understanding: identifying classes and groups of objects plays an important role in how people analyze the world. Humans are naturally efficient at finding clusters and classifying new data on the basis of the clusters found. Cluster analysis is also often referred to as unsupervised classification for this reason (as opposed to classic classification, which is supervised).
- Summarization: if instead of applying an algorithm to the whole dataset, it is applied to (accurately defined) clusters, the efficiency of the execution may be increased. This is especially important for algorithms that have a complexity of O(n²), such as regression or component analysis.
Types of Clusterings
Clusterings are a collection of clusters.
- Hierarchical vs Partitional: A partitional clustering is a division of the dataset into non-overlapping clusters. Hierarchical clustering finds a set of nested clusters that form a hierarchy organized as a tree.
- Exclusive vs Overlapping vs Fuzzy: An exclusive clustering assigns each object to only one cluster. Overlapping clustering may assign one object to multiple clusters. In fuzzy clustering, every object belongs to every cluster with a weight between 0 and 1 that measures how much that object belongs to a given cluster. The sum of all weights for an object must be 1.
- Complete vs Partial: A complete clustering assigns each object to a cluster, while a partial clustering does not; it may exclude noise and outliers.
- Heterogeneous vs Homogeneous: In homogeneous clustering, all clusters have the same size, shape, or density. In heterogeneous clustering, clusters may differ.
Types of Clusters
- Well-separated: A cluster is a set of objects in which each object is closer to any other point in its cluster than to any point in a different cluster. These clusters can have any shape.
- Prototype-based and Center-based: A cluster is defined by a prototype (often a centroid or medoid). Center-based clusters are those where each object is closer to the center of its cluster than to any other center.
- Graph-based: If data can be represented as a graph, clusters can be defined as connected components. A subtype is contiguity-based clusters, where points are connected based on distance criteria.
- Density-based: A cluster is a dense region of objects surrounded by low-density areas. This is useful for finding irregularly shaped clusters.
- Shared-property (or Conceptual clusters): A cluster is defined as a set of objects that share a generic property. This is a more general definition encompassing all others.
- Objective function: Clusters are found to minimize or maximize an objective function.