DBSCAN
Posted: 05/07/2025
DBSCAN
Density-based clustering separates regions with high density from regions with low density. A popular density-based clustering algorithm is DBSCAN.
Density is estimated for a particular point in the dataset by counting the number of neighbors within a specified radius (Eps) of that point, including the point itself. Fixed a value of MinPts, each point is then classified as one of the following:
- Core point: a point that is in the interior of a density-based cluster. It has MinPts or more points within Eps distance.
- Border point: a point that is not a core point, but is inside the neighborhood of one.
- Noise point: a point that is neither core nor border, so it has few points close by and is far away enough from core points of all clusters.
Algorithm 1: DBSCAN algorithm
1. Label points as core/border/noise.
2. Eliminate noise points.
3. Connect with an edge core points within a distance ≤ Eps.
4. Make each connected group into a cluster.
5. Assign each border point to the cluster its associated core point belongs to.
There’s the issue of determining the value of Eps and MinPts . The basic approach is to look at the distance between a point and its kth nearest neighbor, called k-dist. For points that belong to a cluster, k-dist will be small if k is smaller than the cluster size, since the point will be in a dense area. For points that are outside of clusters, instead, the k-dist will be pretty large. If we compute the k-dist of all points in the dataset for some k, sort them in increasing order, and plot the result, we expect to see a sharp turn (a so called ”knee”) in the value of k-dist. The value at which this turn happens can be chosen as Eps, and the value of k as MinPts ts. This way, points for which k-dist is less than Eps will be core points, while all others will be either border or noise points.
The value of k itself should be neither too small, since small numbers of points close together may be considered a cluster, nor too high, because smaller clusters may be labeled as noise.
Since DBSCAN defines cluster on the basis of density, it’s resistant to noise, and can handle irregularly shaped clusters. On the other hand, it does not perform well if the data presents wildly varying densities, or in case of high dimensional data.
The time complexity of the algorithm is O(mt), where m is the number of points, and t is whatever time is needed to find the points in the Eps neighborhood of a generic point. In the worst case, the complexity is O(m2), although usage of certain data structures for efficient retrieval of points within a given distance of a specified point for low-dimensional spaces can lower the average case complexity to O(mlog(m)).
The space complexity is O(m), since it only needs to store a little amount of data per data point (cluster label and classification of the point as core, border, or noise).
OPTICS
OPTICS (Ordering Points To Identify the Clustering Structure) is a variant of DBSCAN that addresses its main disadvantages. It produces an ordering of the dataset with respect to its density based clustering structure. It requires Eps and MinPts as parameters. Two distances are defined for each point p:
- The core distance, which is the minimum value of radius required to classify the point as a core one. If p is not a core one, this distance is undefined.
- The reachability distance, between p and another point q in the dataset, defined as the maximum between the core distance of the point p and the distance between p and q. It’s not defined if q is not a core point.
The algorithm produces a list of points, each annotated with their smallest reachability distance. To visualize the result, a reachability plot can be used, which is a special kind of dendrogram. Points belonging to a cluster will have low reachability distance to their neighbor, and they will form a valley in the plot: the deeper the valley, the denser the cluster.
Clusters can be then extracted by selecting a threshold on the y-axis or a range on the x-axis after visual inspection. Alternatively, specific algorithms can be used to detect valleys depending on steepness, knee detection, or local minima.
Both core distance and reachability distance may be undefined in the case of no sufficiently dense clusters with respect to the chosen Eps value. If Eps is large enough this won’t happen, but in an extreme case Eps may be so high that any Eps neighborhood query returns the whole dataset. Eps itself is not necessary to OPTICS, but it’s still important to use it to cut off the density of clusters that are not interesting and potentially speed up the algorithm.
Algorithm 2: OPTICS algorithm
1. Initialize the reachability distance of all points as undefined.
2. for each unprocessed point p do:
3. Get the neighborhood N of p.
4. Mark p as processed and add to the output list.
5. if p is a core point then:
6. Initialize a priority queue Q to get the closest point to p in terms of reachability.
7. Call update(N,p,Q).
8. for each point q in Q do:
9. Get the neighborhood N′ of q.
10. Mark q as processed and add to the output list.
11. if q is a core point then:
12. Call update(N’,q,Q).
13. end if
14. end for
15. end if
16. end for
17. Output the ordered list of points.
Algorithm 3: update().
1. Calculate the core distance for p.
2. for each q in N do:
3. if q is not processed then:
new_rd = reachability distance between p and q.
4. end if
5. if q is not in Q then:
6. Insert the pair < q, new_rd > in Q.
7. else
8. if new_rd < q rd then:
9. Move q up in the queue Q by updating q_rd.
10. end if
11. end if
12.end for