Talk:K-d tree/Archive 1

Latest comment: 13 years ago by 129.67.86.189 in topic Storage requirement

Kd-tree should be named kd-trie and vice versa

If the term kd-trie is accepted, then the current kd-tree article should be about a kd-trie and vice versa. Here's the rationale (this a copy from kd-trie's discussion page). The idea of a trie is to store strings (or finite sequences of any ordered set) such that symbols are shared between words with the same prefix. This delivers O(m) time complexity for insertion, deletion and searching, where m is the length of the word. Now consider the data structure in the kd-tree article. The first inserted point should be taken as an "empty word". Every point has an associated axis-aligned plane which contains it. The subsequent points will be classified by this plane as being on the "negative" or "positive" side (for any direction). Now if you insert another point, then it implicitly has an associated string with it that is given by traversal to its insertion place in the tree. If in the traversal you choose the negative sub-tree, then append "n" to the string, otherwise append "p" to the string. This way you can see that the points are associated with strings of the form "nppnpppnnnpnpnp" etc. Thus you can see that these associated strings form a trie. But they are special kinds of tries: every point is part of the set, and so every associated substring is contained in the word-set —Preceding unsigned comment added by Kaba3 (talkcontribs) 21:46, 16 January 2008 (UTC)
I would estimate that many people know both implementations as KD-Trees, the distinction between a trie and a tree is minor and can be pointed out in text. The discussion in kd-trie is not extensive by any means, and the first comment there "kd-trie is my own invention..." is somewhat concerningUser A1 (talk) 01:52, 17 January 2008 (UTC)
I agree. It would be better to embed the discussion in the kd-trie page to the kd-tree page. I hope someone does something, because currently the kd-trie term is misleading. I don't have the guts because I am a wikipedia-newbie and don't want to step on anyone's shoes:) --Kaba3 (talk) 19:52, 18 January 2008 (UTC)
I don't particularly agree that a kd-tree is a trie in this sense. It's the same as claiming a binary-tree is a trie since you can form a string of l's and r's representing whether you went left or right to find the node. If those directions had some greater meaning then it might be useful. Anyhow bdodson (talk) 08:47, 7 December 2008 (UTC)

Contradiction in text?

Doesn't the statement "In addition, every node of a kd-tree, from the root to the leaves, stores a point" contradict the end of the first paragraph "kd-tries are a variant that store data only in leaf nodes" ? I'd change it, but I know almost nothing about the topic. Also, while you are at it, "point is infite" should be "point is infinite." Nice article. It helped me a lot.

There are at least two different definitions of kd-tree. The most popular one calls for storing points in the intermediate nodes. Preparata/Shamos book gives that definition. However, de Berg/van Kreveld/Overmars/Schwarzkopf book gives an alternative definiton, where points are stored in the leaves only. In the latter variant the splitting planes are still chosen as medians and point (or points) lying on the plane are simply forced into one of the two partitions. —The preceding unsigned comment was added by 198.182.56.5 (talk) 01:31, 28 April 2007 (UTC).
Yes, I think there is a contradiction too. I left a message at the kd-trie article's discussion page with a rationale, but haven't got a response yet. In short, my opinion is that the current kd-tree should be called a kd-trie, and the current kd-trie a kd-tree. --Kaba3 (talk) 09:43, 24 November 2007 (UTC)
definitions are contradictory. But both are valid

Picture Included

The picture included really isn't the best choice for demonstrating a kd-tree to someone who isn't familiar with it. has a much better picture, for instance

The problem I have with the picture (we're talking about the pic at the top, yes?) is that the vertices of the splitting planes appear, visually speaking, to be tree nodes, when in fact that is not correct. And the tree nodes themselves are not represented. wwheeler 18 January 2007

Go for it. I made the existing one but agree that it's not too explanatory. Btyner 05:17, 20 November 2005 (UTC)
I also made an SVG version of this new one, but for some reason, commons didn't recognize it as such. Btyner 22:21, 21 February 2006 (UTC)

Also, some psuedo code for construction, insertion, deletion, nearest neighbor, etc. would be very useful. --Numsgil 13:07, 13 October 2005 (UTC)

Python?

Does the Python version of the pseudocode really add anything? It seems like it is basically a restatement of the pseudocode with a slightly different (less clear) syntax. Do we really need both? Neilc 22:37, 15 April 2006 (UTC)

I personally found it useful to have an executable version to try things out on and improve my understanding. That's also why I wrote the code that produced that 2D results picture. Granted, this page may not be the best place for an implementation, though I don't know a better place for such a tiny snippet. KiwiSunset 04:51, 19 April 2006 (UTC)

Content

Ok, so I can understand that balancing a kd tree takes care, but how do I actually do it? This article really needs someone with that knowledge to add that information in. ACiD2

I've been working on tracking down the existing work done by Overmars on this topic, which is very similar to AVL trees. I might end up just re-deriving it, getting some review, and giving the citation as the source of the idea. If anyone has access to the pertinent journal article, it would be very helpful. There was also a technical report published. M. H. Overmars and J. van Leeuwen, Dynamic multi-dimensional data structures based on quad- and k-d trees, Acta Informatica 17, 3(1982), 267-285.

Search complexity?

The articles contains no information about the complexity of searching a kd-tree for an exact nearest neighbor?

The algorithm analyzed in the following paper is fairly generic (it uses region queries), but does apply to kd-trees. I don't believe there's an existing article about the algorithm itself, but it would be nice to have one. Cleary, J. G. 1979. Analysis of an Algorithm for Finding Nearest Neighbors in Euclidean Space. ACM Trans. Math. Softw. 5, 2 (Jun. 1979), 183-192. DOI= http://doi.acm.org/10.1145/355826.355832

Adding elements?

Doesnt adding elements in the method described lead to an unbalanced tree? If this is so then i think it should be made explicitly clear, and a rebalancing algorithm (or a way to preserve balance) should be given User A1 07:21, 1 December 2006 (UTC)

kd-Tree is a static data structure, so yeah adding an element will introduce disorder. Rebalancing it is perhaps outside the realms of the basic description. IlyaTheMuromets 22:45, 8 Jan 2007 (EST)

Missing title in title bar

The title of this article does not seem to be showing up in my browser title bar. Is this just me, and if not any suggested soultions?

MadScientistVX 05:09, 23 February 2007 (UTC)

Odd. I second that, firefox 1.5.0.7 linux x86, despite the presence of <title>Kd-tree - Wikipedia, the free encyclopedia</title> User A1 15:23, 23 February 2007 (UTC)

Removing elements

This piece is not really clear to me, is someone could clarify it a bit, that would be excellent! Laurens 11:07, 27 April 2007 (UTC)

Yes, "removing the invariant" is quite unclear. The point that they wish to make is that when deleting a point from the tree, one has to ensure that sub-points are merged back into the original tree in a sensible fashion, ie one that doesn't break the spatial nature of the tree itself.
There are several possible techniques for doing this. Firstly one may simply mark the element that one wishes to delete with some form of ignore flag, such that this node no longer contributes to the algorithms that are run over the tree. This has the nice effect that it is fast to perform on any given node, but the downside is that the tree still has that node in it, so it is still taking up computational space. Secondly one could unlink the node, which we can call node "A" then generate a tree using the subnodes of "A" then link its children back into the tree, this has a higher up front computational cost, but removes "A" from the tree. Also this method may result in unbalanced trees, such that search algorithms run past it may perform in a non-optimal fashion. Finally one could completely destroy the tree and rebuild it, but without using "A". This costs the most computationally, but will produce an optimal (in some sense) tree.
I haven't included this into the article as the bit about linking and unlinking is intentionally vague, as i never actually implemented this when i wrote a KD tree algorithm for NN searching. User A1 13:58, 27 April 2007 (UTC)
Thanks, that helps a bit :) Laurens 15:07, 27 April 2007 (UTC)

Bug in query function?

It seems to me that the first d levels in the tree won't be checked against the bounding box in all dimensions. So querying for points in ([6,8], [3,4]) in the example tree would return (7, 2) even though 2 is not in [3,4]. Or have I missed something? 206.248.181.57 (talk) 23:07, 17 November 2007 (UTC)

No you havent, that algorithm should work if tree definition would be that nodes can't store values - only leafs. You have to check explicitly for boudaries - or you'll get false positives. —Preceding unsigned comment added by 149.156.124.14 (talk) 23:28, 15 January 2008 (UTC)

Sample code in Ruby

  1. class for generating trees

class KDTree

 attr_reader :root # root node
 attr_reader :points # search results for ortagonal search range
 # creates new tree from list of points - with 'dim' dimensions
 def initialize(points, dim)
   @dim = dim
   @root = KDNode.new(dim).parase(points)
 end
 # adds node to tree - creates unbalance
 def add_node(point)
   @root.add(point)
 end
 def find(*range)
   @points = []
   query(range, @root)
   @points
 end
 def query(range, node)
   axis = node.axis
   median = node.location[axis]
   if node.left && (median >= range[axis].begin)
     query(range, node.left);  #/* Skip left if max of left tree (median) is out of range */
   end
   if node.right && (median <= range[axis].end)
     query(range, node.right); #/* Skip right if min of right tree (median) is out of range */
   end
   if (0..@dim-1).all?{|ax| range[ax].include? node.location[ax]} # check if all constrains are fullfilled
     @points << node.location;
   end
 end
 #print tree
 def print
   @root.print
 end

end

class KDNode

 attr_reader :left, :right
 attr_reader :location
 attr_reader :axis
 def initialize(dim, location=nil, left=nil, right=nil)
   @dim = dim
   @location = location
   @left = left
   @right = right
 end
 # node parasing
 def parase(points, depth = 0)
   @axis = depth % @dim
   points = points.sort_by{|point| point[@axis]} # that's why this algorithm isn't o(nlog(n))
   half = points.length / 2 # simplest median choosing
   @location = points[half]
   @left = KDNode.new(@dim).parase(points[0..half-1], depth+1) unless half < 1
   @right = KDNode.new(@dim).parase(points[half+1..-1], depth+1) unless half+1 >= points.length
   self
 end
 def add(point)
   if @location[@axis] < point[@axis]
     @left ? @left.add(point) : @left = KDNode.new(point)
   else
     @right ? @right.add(point) : @right = KDNode.new(point)
   end
 end
 def remove
   self.parse(@left.to_a + @right.to_a, @axis)
 end
 def to_a
   @left.to_a + [@location] + @right.to_a
 end
 def print(l=0)
   @left.print(l+1) if @left
   puts("  "*l + @location.inspect)
   @right.print(l+1) if @right
 end

end

a = []

10.times do |x|

 10.times do |y|
   10.times do |z|
     a << [x, y, z]
   end
 end

end

tree = KDTree.new(a, 3) tree.print

puts " -------------- "

tree.find([0,4], [1,4], [5,7])

pp tree.points.sort_by{|po| po[0]} —Preceding unsigned comment added by 149.156.124.14 (talk) 23:37, 15 January 2008 (UTC)

A Reference to a Generic Java Implementation of the KD Tree Needed

Recently I posted a link to a Java implementation of the KD tree which uses the new Java 5 construct of generics (similar to templates in C++). The link was as follows:

Savarese Algorithms

However the addition was removed with the comment that the contributor (Savarese, not myself) was "just an individual" and that although the library was open source that apparently it was somehow not open enough to merit mentioning on the wikipedia page.

Having done a fair bit of research on the web I can attest that Generic Java implementations of KDTrees are not well represented and that, all other concerns aside, the Savarese implementation is one of unusually high quality (modern implementation, full JUnit tests, consistent code quality). I, as a Java programmer, would find a reference to a implementation in my language of choice most useful, as I imagine others would.

So my question is this, how is it possible to get a reference on the Wikipedia KD Tree page to a high quality Java KD Tree implementation?

Would it be necessary for me to do a clean room implementation of a KD Tree using Generics then I suppose get some sanctioning body to accept it so that it won't just be "some guy" implementing a KD Tree? If so it seems rather a high bar and not really in the spirit of individual contributors as I thought Wikipedia to be.

As a side note, and hardly fully definitive, the crowd seems to give a nod to the Savarese implementation as it is the first google hit if you search "KD Tree Java". —Preceding unsigned comment added by Ltburch (talkcontribs) 17:35, 21 April 2008 (UTC)

bug in python code?

{{User:ClueBot III/ArchiveNow}} I think this is wrong

   node.leftChild = kdtree(pointList[0:median-1], depth+1)
   node.rightChild = kdtree(pointList[median+1:], depth+1)

and should be

   node.leftChild = kdtree(pointList[0:median], depth+1)
   node.rightChild = kdtree(pointList[median:], depth+1)
 —Preceding unsigned comment added by LodeLeroy (talkcontribs) 14:41, 5 August 2008 (UTC) 

Bug correction in python code.

{{User:ClueBot III/ArchiveNow}} The slicing notation in the python code would lead to losing a predecessor to the median. --Revpj (talk) 03:51, 10 August 2008 (UTC)

Balancing

Hey. I'd be happy to write up about rebalancing the tree if I knew how it's done or I could find a paper on it. At the very least we should have a reference to something. I believe it's possible, but I can't find anything easily. If you can point me to a paper, that would be great. bdodson (talk) 08:44, 7 December 2008 (UTC)

Nearest neighbour

Hello,

I added the NN algorithm, but i have hidden it in comments as it was a little off the cuff. Feel free to fix it, i will go over it again later.

Thanks User A1 12:54, 2 December 2006 (UTC)

I don't suppose you can clarify this section, I find the way it is explained higly confusing (Not that confusing me is particularly hard). For example 'the tree is searched in a depth-first fashion, refining the nearest distance' this should be put in a more understandable way, such what do you mean by depth-first, refine the nearest distance how, and what do you mean by nearest difference. We need to assume the readers are not programmers, as they most likely aren't (though however, I am). --Chase-san (talk) 10:34, 11 January 2008 (UTC)
A year later and someone replies! I made some very minor changes, (wikified depth-first and hypersphere) later I will add a reference and look into making an animation. User A1 (talk) 13:31, 11 January 2008 (UTC)
I'm not sure that the proposed algorithm is correct. The explanation gives us an idea how the algorithm works but no reassurance that it is correct and runs in logarithmic time. It seems to me that nearest neighbour with an Euclidean (L2) metric given any Kd-tree produced by a process which only considers the orders defined by the points x and y coordinates respectively is  , where n is the number of points. Consider the following argument. Let n-1 >= 4 be square and place   times   points on coordinates (x,y) for integers  . Place n points on coordinates   for  , with e_i = 0 except for k, for which it is simply small enough but greater than zero. Place one final point on (0, -n^2). For any given Kd-tree on this instance produced by the above methods, every other choice of k will produce the same order and therefore also the same Kd-tree. Evaluating the above equation, k will be the point closest to (0, -n^2). For any given Kd-tree of this kind, and for any deterministic algorithm inspecting at most sqrt(n)-2 of the points on the bottom row, the algorithm must fail for at least one instance. It follows that nearest neighbour given a Kd-tree produced by order alone is  . Is it also  ? — C. lorenz (talk) 19:35, 25 September 2008 (UTC)
Have a quick look at "An introductory tutorial on kd-trees" by Andrew Moore. To be honest I have not taken the time to understand your comment at this time, though I intend to at some time. User A1 (talk) 23:45, 25 September 2008 (UTC)
The log(n) comes from the structure of the tree, operating under the assumption of a spatially random distribution of points (there do exist scenarios where the worst case running time is considerably worse than log(n) -- the scenario where the NNs are almost at the same radius from the search point.User A1 (talk) 23:52, 25 September 2008 (UTC)
I'd buy average-case log n under the uniform distribution but it's important to make that distinction. Would you clarify this and do the same for the section on Kd-trees under Nearest_neighbor_search? Perhaps a reference to e.g. this tutorial? Thanks for the quick reply! — C. lorenz (talk) 06:15, 26 September 2008 (UTC)
Hi all. I rewrote the algorithm description to hopefully be much clearer. I know it was a tricky algorithm for me to understand when I implemented it. bdodson (talk) 08:41, 7 December 2008 (UTC)
Hi, I was trying really hard to read through the algorithm description, but I'm finding it really confusing. Is there any way we could get some psuedocode on this? It would be nice to have the implementation of the kd-tree (already in psuedocode and is really helpful) and then how to find neighbors to see the power of kd-trees.71.87.112.47 (talk) 09:25, 22 January 2009 (UTC)
I added the link to the Moore tutorial. If I recall, that has got what you are looking for. I may, if I get to it, create some pseudocode. User A1 (talk) 10:38, 22 January 2009 (UTC)

Adding

I believe the "Adding elements" section is misleading/incorrect. As currently phrased, new nodes are only added as children of leaf nodes. A leaf node by definition has no children, so the current text implies you cannot add a "right" child if a "left" child already exists. I suggest changing "Once you get to a leaf node" to "Once you get to a node under which no child exists for the desired direction"

Tomiii (talk) 01:08, 20 March 2009 (UTC)

You are right, & I modified the section to correct the error. Let me know what you think. User A1 (talk) 03:15, 20 March 2009 (UTC)

NP-Hard

Why is stated that "The problem of finding NN in high-dimensional data is thought to be NP-hard[2]" First of all, I don't see where that cite makes any mention of that. Second of all, unless I'm missing something NN can be done in linear time. Take the query point, compute the distance between the query point and every other point maintaining a minimum as you go. Thats it. If you want kNN, then maintain a structure holding the k minimum or if you really want to brute force it, compute the distances, then do an nlgn sort. NN for all points in the data set is just N^2 even. So what gives? Jmacglashan (talk) 06:04, 1 May 2009 (UTC)

  • In d dimensions, the best solution to the problem becomes d*N^2, with query space n^O(d) space. Using a KD tree gives exponential results in d, and thus in high dimensions can be improved upon by brute force as you point out. I think the NP claim is a misreading of the text, which states that the approach of using approximate NN solutions is similar to the use of approximate solutions to NP problems, and is therefore false User A1 (talk) 07:39, 1 May 2009 (UTC)
    • Perhaps your interpretation of that sentence is right, but it is indeed very confusing. Thanks for removing it. Jmacglashan (talk) 15:42, 1 May 2009 (UTC)

NN algo

I think, this statement in the NN algorithm is at least misleading: "Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate and the search point is less than the distance from the search point to the current best". It is probably meant to be the "bounds overlap ball" test of the original paper by Fiedman et al. As stated in the current article, it amounts to testing the bbox of the current sphere for overlap with the region of the node, which is something different. I don't have the time right now to fix the article. —Preceding unsigned comment added by Snoopy67 (talkcontribs) 18:33, 12 May 2009 (UTC)

single node can be a tree

tree

trees are nonlinear data structures that is it has to be more no of data items to be present . But if consider about a single node it cant be served as a tree or even nonlinear data stucture by this we can say it not a tree

Misleading complexities

The article cites the paper "It's okay to be skinny, if your friends are fat" when it says that the complexity of finding approximate nearest-neighbors would be O(epsilon^-d log n) (and something similar for range searching). This is very misleading. What's happening here is an unfortunate habit of people in this research area in making the "dimension is constant" statement, which makes complexities look much better than they are. A "better" data structure for nn-searching than the kd-tree is the BBD-tree (because of the extra theoretical performance guarantees, not necessarily practically). If we look at its nearest-neighbors search complexity (from the paper "An optimal algorithm for approximate nearest neighbor searching in fixed dimensions"), we see it is O(C log n), where C <= d ceil(1 + 6d/epsilon)^d. This certainly does not simplify to O(log n) if we substitute epsilon = 1! There _is_ a hidden factor exponential on dimensionality, and this also holds for the kd-tree. The nearest-neighbor search article has similar problems, which I comment there separately.

Kaba3 (talk) 22:27, 6 August 2010 (UTC)

Hello Kaba3,

I have grabbed a copy of the article you cite; it is quite long and I am unfamiliar with the BBD tree algorithm, I shall read it and have a think. In the mean time, please feel free to be WP:BOLD, and make edits to the article. Generally speaking, technical topics do not generate a great deal of dissent, particularly when edits are substantial and well cited. User A1 (talk) 23:48, 6 August 2010 (UTC)

Original reaserch?

That's the reference: GOLDSMITH, J., AND SALMON, J. 1987. Automatic creation of object hierarchies for ray tracing. IEEE CG&A 7, 5, 14–20.

I have moved the following text here, until we get a reference that explains what is going on. Also, a ref would be good to satisfy No original research. I have modified the text to make it a bit more readable.


===Optimisation Heuristics=== ===Surface Area Heuristic=== Heuristic approaches can be used to reduce the computational complexity of querying the tree. This can be done by rotation of the data such that the data--splitting plane orientation in such a way that the probable cost of the resulting tree is minimized. The Surface Area Heuristic (SAH) calculates for every potential split point the following value: <!--Equation needs fixing, and my computer is about to run out of power...--> <math>estimatedCostOfTraversingOneMoreNode + estimatedCostOfOneIntersectionTest \times \frac{surfaceLeft \times \#primitivesLeft + surfaceRight \times \#primitivesRight}{surfaceOfThisVolume}</math>. The split position/orientation with the lowest SAH-value is compared to the probable cost of not splitting at all. The cost of not splitting at all is <math>estimatedCostOfOneIntersectionTest \times \#remainingPrimitivesInside</math>. If estimated cost of splitting is lower than the cost of not splitting then the current volume will be split using a splitting plane at the chosen position/orientation.

User A1 (talk) 15:30, 6 August 2009 (UTC)

Modified Lua code for NN search

Hi

I modified the NN search algorithm so that it is more in line with the aforementioned textual description of the search.

Below the new search function.

Hyperplane (talk) 22:38, 1 January 2011 (UTC)

-- Global variable for best point --
best = nil

function kdsearchnn(here, point)    
   
   if here == nil then
      return
   end
		
   if best == nil then				
      best = here
   end
	
   -- Search the near branch --
   child = child_near(here, point)
   kdsearchnn(child, point)

   -- If the current node is closer than the current best, then it becomes the current best --
   if distance(here, point) < distance(best, point) then		
      best = here
   end

   -- Possibly consider the far branch of the splitting plane --
   if distance_axis(here, point) < distance(best, point) then
      child = child_far(here, point)		
      kdsearchnn(child, point)		
   end
end

Storage requirement

I couldn't find anything on the storage requirement of the data structure in this article. Shouldn't this be mentioned besides the time complexity?

82.95.224.82 (talk) 11:05, 27 March 2011 (UTC)

Storage requirement

I couldn't find anything on the storage requirement of the data structure in this article. Shouldn't this be mentioned besides the time complexity?

82.95.224.82 (talk) 11:52, 27 March 2011 (UTC)

Its O(n), as it is a tree. 129.67.86.189 (talk) 13:47, 7 April 2011 (UTC)