DBSCAN

Posted: 05/07/2025

DBSCAN

Density-based clustering separates regions with high density from regions with low density. A popular density-based clustering algorithm is DBSCAN.

Density is estimated for a particular point in the dataset by counting the number of neighbors within a specified radius (Eps) of that point, including the point itself. Fixed a value of MinPts, each point is then classified as one of the following:

Algorithm 1: DBSCAN algorithm


    1. Label points as core/border/noise.
    2. Eliminate noise points.
    3. Connect with an edge core points within a distance ≤ Eps.
    4. Make each connected group into a cluster.
    5. Assign each border point to the cluster its associated core point belongs to.
    

There’s the issue of determining the value of Eps and MinPts . The basic approach is to look at the distance between a point and its kth nearest neighbor, called k-dist. For points that belong to a cluster, k-dist will be small if k is smaller than the cluster size, since the point will be in a dense area. For points that are outside of clusters, instead, the k-dist will be pretty large. If we compute the k-dist of all points in the dataset for some k, sort them in increasing order, and plot the result, we expect to see a sharp turn (a so called ”knee”) in the value of k-dist. The value at which this turn happens can be chosen as Eps, and the value of k as MinPts ts. This way, points for which k-dist is less than Eps will be core points, while all others will be either border or noise points.

The value of k itself should be neither too small, since small numbers of points close together may be considered a cluster, nor too high, because smaller clusters may be labeled as noise.

Since DBSCAN defines cluster on the basis of density, it’s resistant to noise, and can handle irregularly shaped clusters. On the other hand, it does not perform well if the data presents wildly varying densities, or in case of high dimensional data.

The time complexity of the algorithm is O(mt), where m is the number of points, and t is whatever time is needed to find the points in the Eps neighborhood of a generic point. In the worst case, the complexity is O(m2), although usage of certain data structures for efficient retrieval of points within a given distance of a specified point for low-dimensional spaces can lower the average case complexity to O(mlog(m)).

The space complexity is O(m), since it only needs to store a little amount of data per data point (cluster label and classification of the point as core, border, or noise).

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a variant of DBSCAN that addresses its main disadvantages. It produces an ordering of the dataset with respect to its density based clustering structure. It requires Eps and MinPts as parameters. Two distances are defined for each point p:

The algorithm produces a list of points, each annotated with their smallest reachability distance. To visualize the result, a reachability plot can be used, which is a special kind of dendrogram. Points belonging to a cluster will have low reachability distance to their neighbor, and they will form a valley in the plot: the deeper the valley, the denser the cluster.

Clusters can be then extracted by selecting a threshold on the y-axis or a range on the x-axis after visual inspection. Alternatively, specific algorithms can be used to detect valleys depending on steepness, knee detection, or local minima.

Both core distance and reachability distance may be undefined in the case of no sufficiently dense clusters with respect to the chosen Eps value. If Eps is large enough this won’t happen, but in an extreme case Eps may be so high that any Eps neighborhood query returns the whole dataset. Eps itself is not necessary to OPTICS, but it’s still important to use it to cut off the density of clusters that are not interesting and potentially speed up the algorithm.

Algorithm 2: OPTICS algorithm


    1. Initialize the reachability distance of all points as undefined.
    2. for each unprocessed point p do:
    3.      Get the neighborhood N of p.
    4.      Mark p as processed and add to the output list.
    5.      if p is a core point then:
    6.          Initialize a priority queue Q to get the closest point to p in terms of reachability.
    7.          Call update(N,p,Q).
    8.          for each point q in Q do:
    9.              Get the neighborhood N′ of q.
    10.             Mark q as processed and add to the output list.
    11.             if q is a core point then:
    12.                 Call update(N’,q,Q).
    13.             end if
    14.         end for
    15.     end if
    16. end for
    17. Output the ordered list of points.
    

Algorithm 3: update().


    1. Calculate the core distance for p.
    2. for each q in N do:
    3.      if q is not processed then: 
                new_rd = reachability distance between p and q.
    4.      end if
    5.      if q is not in Q then:
    6.          Insert the pair < q, new_rd > in Q.
    7.      else
    8.          if new_rd < q rd then:
    9.              Move q up in the queue Q by updating q_rd.
    10.         end if
    11.     end if
    12.end for
    

← Back to Blog List