Talk:DBSCAN

Latest comment: 6 months ago by 91.193.208.197 in topic Complexity

Untitled

edit

Is the abbreviation of epsilon to eps really necessary? Pflaquerre (talk) 07:52, 15 July 2008 (UTC)Reply

No it is not - but it is consistent with the original publication of Ester, Kriegel, Sander and Xu. mathias (talk) 17:52, 7 April 2009 (UTC)Reply

It seems as though the algorithm could benefit from being clarified a little bit. What exactly are we recursing on? Is C global? What are the procedure/function's parameters? Why is noise never marked as visited? Pflaquerre (talk) 07:55, 15 July 2008 (UTC)Reply

I second this Crenshaw (talk) 23:42, 26 November 2008 (UTC)Reply

In Steps, it says: " starting point and its neighbors are added to this cluster". The pseudo code has only: "mark P as visited". Should the pseudocode say "Mark P and all points in N as visited?" —Preceding unsigned comment added by 98.207.54.162 (talk) 20:59, 7 February 2009 (UTC)Reply

The problem of big O notation

edit

the notation "O((n2-n)/2)" in the "Nearest Neighborhood Detection" section is wrongly used. It should be O(n^2), because the constant "2" and lower order "n" are asymptotically insignificant. --Dexter (talk) 02:18, 28 April 2009 (UTC)Reply

Pseudo code wrong?

edit

Timmenzies (talk) 15:07, 14 June 2011 (UTC)Reply

In the original paper, if the expanding cluster runs into another older cluster then the new cluster gets the same id as the older cluster.

Wasn't sure that this would happens in practice (since expandCluster a complete search through all relevant neighbors) but the original paper says that this can happen in the special case of border points (a note that I do not understand, BTW).

In any case, to be consistent with the paper and to handle that special case, should not the pseudo code be as follows? (my changes are all marked marked with //

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors (P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         C.id = a new id                                  // new
         expandCluster(P, N, C, eps, MinPts)
          
expandCluster(P, N, C, eps, MinPts)
   add P to cluster C
   for each point P' in N 
      if P' is not visited
         mark P' as visited
         N' = getNeighbors(P', eps)
         if sizeof(N') >= MinPts
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C
      else                                                         // new 
         C.id = the id  of  the cluster holding P' // new
The result of DBSCAN is non-deterministic on border points; but you clearly do NOT want to merge clusters when they share a border point, because that would violate the model of density connectedness. Your modification would cause this, breaking the whole model of density connectedness. Also I belive your first sentence is wrong. IIRC the new *point* keeps the Cluster-ID of the old cluster, which is exactly what the code does. An alternative would be to just overwrite the cluster assignment. The two variants are isomorphic, you can emulate the other by just processing the dataset in the opposite order.
Border points are density-reachable from a cluster but not dense themselves. They can be density-reachable by two clusters, but since they are not core points they will NOT merge the two clusters. By being density reachable, they would belong to both clusters, but the DBSCAN algorithm assigns them to whichever cluster discovered them first. They are usually not significant, and this way you get a proper partitioning and not some fuzzy or overlapping clustering. --87.174.53.81 (talk) 22:13, 14 June 2011 (UTC)Reply

Pictures

edit

This article is in serious need of some pictures. Particularly, the section "Basic Idea". — Preceding unsigned comment added by 99.181.61.118 (talk) 22:19, 15 August 2011 (UTC)Reply

In the caption of the figure in section "Preliminary", some of the core points only have 3 reachable neighbours. If I'm not mistaken, then given the definition or core points, either the label is wrong (minPts = 4 should be minPts = 3) OR the figure is wrong and only two points should be considered as core points (I guess correcting the caption is easier than correcting the figure). I didn't want to edit before being sure. — Preceding unsigned comment added by 143.239.81.7 (talk) 14:13, 26 September 2018 (UTC)Reply

The number of points is 4. It does not say "number of points different from the query point". The range query does not exclude the query point - it is part of the database. Chire (talk) 14:29, 12 October 2018 (UTC)Reply

The topic is not that clear... what kind of ordering is created...

edit

There are few images on this page which seem more useful http://de.wikipedia.org/wiki/OPTICS

Moreover there is no explanation of the ordering... Maybe an example would be very useful.. Please help need to implement but have no idea how the ordering works

cant find any other better source Thank you Varunkumar Jayapaul jv5k5-wiz@yahoo.co.in — Preceding unsigned comment added by 203.196.165.100 (talk) 12:36, 14 June 2012 (UTC)Reply

DBSCAN does not generate an ordering. You are confusing it with OPTICS algorithm. Have you considered looking at the original publications? They contain pseudocode. --Chire (talk) 16:17, 14 June 2012 (UTC)Reply

Basic Idea Section Unreadable

edit

The Basic Idea section is incomprehensible to someone who does not understand the algorithm ahead of time, due to having too many pronouns. The "it's" could be p or q.

192.249.47.174 (talk) 15:51, 21 June 2012 (UTC)Reply

It should be noted in the algorithm details that DBSCAN can create a cluster that is less than minPts

edit

A cluster in DBSCAN is only guaranteed to consists of at least 1 core point.

Since border points that belong to more than 1 cluster will be "randomly" (usually: first-come) assigned to one of the clusters, a core point may not be able to retain/get all its neighbors. Thus, it may be smaller than minPts.

One dimensional example: minPts=4, epsilon=1:

0.0 0.5 1.0 2.0 3.0 3.5 4.0

|-------X-------|                   Core point at 1.0, Cluster 1.
                |-------X-------|   Core point at 3.0, Cluster 2.

All other points are not core points. Since the point 2.0 is not a core point, it does not connect the two clusters. One of the clusters will have 4 members, the other only 3. — Preceding unsigned comment added by 208.38.35.13 (talk) 16:21, 25 February 2014

When copying data from a web site (Stackoverflow): [1] you should A) obey copyrights, and B) give credit. --188.98.200.186 (talk) 11:31, 2 March 2014 (UTC)Reply

RegionQuery: Including reference point P?

edit

The current regionQuery explicitly contains the given reference point P as well.

regionQuery(P, eps): return all points within P's eps-neighborhood (including P)

In the original paper, the authors state:

A call of SetOfPoints.regionQuery(Point, Eps) returns the Eps-Neighborhood of Point in SetOfPoints as a list of points.

In Neighbourhood (graph_theory), it is stated:

... The neighbourhood described above does not include v itself, and is more specifically the open neighbourhood of v; it is also possible to define a neighbourhood in which v itself is included, called the closed neighbourhood and denoted by NG[v]. When stated without any qualification, a neighbourhood is assumed to be open. ...

The authors do not elaborate on whether it's a closed or open neighborhood. The last sentence of the article about neighborhoods would suggest to exclude P unless it is explicitly stated otherwise.

--Fretdf (talk) 06:27, 20 August 2014 (UTC)Reply

In a database context, such as when using an R*-tree as recommended by the authors, the query would include the query point. It is part of the database, too. --Chire (talk) 11:24, 20 August 2014 (UTC)Reply
I have also verified by looking at the original source code. The query point is included (for all I can tell). --Chire (talk) 13:01, 20 August 2014 (UTC)Reply
edit

Yesterday, I have added a link to the SPMF implementation of DBScan and it was deleted by IP address "94.216.196.253" as being "an advertisement". However, SPMF is a well-known data mining library used in more than 160 research papers (see SPMF website - citations page), and has many users (more than than 70,000 visitors to the website last year). It is thus relevant to mention that it offers a DBScan implementation. Beside, it is interesting to note that the ip address that have delete the link to SPMF is located in Muenchen, Germany. Thus, it might have been deleted by members of the ELKI data mining library and it may look like a conflict of interest... By the way, I'm the founder of SPMF. — Preceding unsigned comment added by 139.103.155.68 (talk) 19:31, 8 March 2015 (UTC)Reply

You added this to three articles (WEKA, k-means and this, so I reverted all three...):
GPL-V3 Java implementations of more than 80 data mining algorithms. It offers an efficient Java implementation of the DBScan algorithm using a KD-Tree that is easy to integrate in other Java software. A GUI and command line interface is provided.
This is advertisement text. — Preceding unsigned comment added by 94.216.196.253 (talk) 23:43, 8 March 2015 (UTC)Reply
The SPMF library is relevant. If you think that there is a problem with the writing style, which is subjective, I suggest to rewrite rather than deleting. You only need to remove the part that you think is inappropriate. I will put back the text for Weka and K-Means page but I will rewrite it differently. Again, if you have a problem with the text, you can rewrite. — Preceding unsigned comment added by 99.252.71.54 (talk) 01:17, 9 March 2015 (UTC)Reply

Discretization of the space for the graph analogy

edit

I am not sure how important this is for experienced readers, but the fact that a possibly continuous space is seen as a graph should be made explicit; I would like to talk about edges and paths only once the following points are put: -every point in the space is a node -there exists an edge (x,y) if d(x,y)<=e Thank you! Reim (talk) 12:17, 1 July 2015 (UTC)Reply

Negative description about other software

edit

I just came back to this page after a few months and read that the description of SPMF has been changed to that: "* SPMF offers a minimalistic GPL-V3 Java implementation of the DBSCAN algorithm for Euclidean distance only with KD-Tree support, that often doesn't yield any result at all." According to the IP address, it seems that the text has been rewritten by authors of the Elki software. It is claimed that it often yield no result. But I run the algorithm and there is definitely a result that is produced. If a bug has been found such that it does not produce result in some cases (which need to be demonstrated) what about reporting it so that it can be fixed like most users do when finding a bug, instead of writing some very negative description on Wikipedia. I mean, you can promote your software without writing negative stuff like that about other software. Besides, there is more than one distance function offered in SPMF (subclasses of DistanceFunction). This has just not been mentioned very clearly in the SPMF documentation. The documentation and code has been updated to make this clearer and fix that. I have now fixed the description in the Wikipedia article. In the future, if you find a bug, report it to me, and I will fix it. I'm the founder of SPMF. ~~ 

Relation to Friends-of-Friends algorithm.

edit

The algorithm is widely used in cosmology to identify dark matter halos from simulations, under the name FOF (friends-of-friends):

The FOF algorithm was described and studied on Page 382, Section e) multiplicity functions of

http://adsabs.harvard.edu/abs/1985ApJ...292..371D

This use was much earlier than the 1996 paper.

There is a short discussion about FOF and DBSCAN under the discussion under 'related work' on page 4 of

http://homes.cs.washington.edu/~magda/papers/kwon-ssdbm10.pdf

FOF is the special case where minPTS == 1. Although with FOF objects with less than (typically 32) points are removed from the catalogue.

Based on this I propose we add a sentence somewhere in this article:

  a special case of DBSCAN, known as Friends-of-friends
  algorithm was first proposed by Davis et al 1985 and still widely used in cosmology.

Objections?

Yu Feng Rainwoodman (talk) 20:03, 18 September 2015 (UTC)Reply

The pointless case of minPts < 2 is also a special case of single-link hierarchical clustering, and this kind of clustering was discussed already by Sneath 1957. I doubt "FOF" is the first. It's nothing but the transitive closure of the neighbor relation, and not worth being called an "algorithm" at all - transitive closures are a well understood and old mathematical concept. Did Davis discuss anything to make it fast? 93.204.126.210 (talk) 08:42, 19 September 2015 (UTC)Reply
The article already discusses this:

The low value of minPts = 1 does not make sense, as then every point on its own will already be a cluster. With minPts = 2, the result will be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at height ε.

In other words, they are using single-link clustering by the name "FOF" in cosmology, not DBSCAN. 93.204.126.210 (talk) 08:45, 19 September 2015 (UTC)Reply

Noise points vs pseudo code

edit

In the DBSCAN function in the pseudo code, a point can be marked as noise, however later that already marked noise point can become a part of a cluster. This is also mentioned in the original article: "The ClId (clusterId) of points which have been marked to be NOISE may be changed later, it they are density-reachable from some other point of the database."

I think this should be indicated in the pseudo code, to avoid the collection of the noise points in the DBSCAN function (as I did). E.g. like this:

expandCluster(P, NeighborPts, C, eps, MinPts) {
   add P to cluster C
   for each point P' in NeighborPts { 
      if P' is not visited {
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      }
      if P' is not yet member of any cluster
         add P' to cluster C
         if P' is marked as NOISE                                       // < here
            unmark P' as NOISE                                          // < here
   }
}

DeVi (talk) 11:39, 28 November 2016 (UTC)Reply

I'd rather stick to the published pseudocode than make a Wikipedia version of it and worst-case introduce some bugs. I'd also prefer to keep the pseudocode short. Chire (talk) 16:23, 1 December 2016 (UTC)Reply

Complexity

edit

Algorithm's authors claim that it is possible to achieve DBSCAN with O(n log n) time complexity with an implementation based on index structures. However, there is an ongoing discussion whether the claim is valid. To the best of my knowledge, no formal proof of the linarithmic complexity exists.

The edit was reverted with a comment "already discussed on the article. see references." -- I see no references for complexity analysis in the article. The linarithmic claim was questioned for instance in:

Gunawan, A. and de Berg, M., 2013. A faster algorithm for DBSCAN. Master’s thesis.

Gan, J. and Tao, Y., 2015, May. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519-530).

or even:

Schubert, E., Sander, J., Ester, M., Kriegel, H.P. and Xu, X., 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), pp.1-21. 91.193.208.197 (talk) 14:30, 9 May 2024 (UTC)Reply

none of them doubt "the worst case run time complexity remains O(n²)." - so the article seems up to date, and I'd argue that last article settled the "ongoing discussion" - was there anything new in the last 5 years? Check if the article supports the "O(n²) worst case" claim of the wikipedia article and then just reuse the citation.