KD-Tree Nearest Neighbor Data Structure

preview_player
Показать описание
KD-Tree is a data structure useful when organizing data by several criteria all at once. Consider an example where you have a set of points on a 2 dimensional plane. Now suppose you are asked to find the nearest neighbor of some target point. What data structure would you use to store these points to be able to solve that problem?

A naive approach would be to perhaps store the points in an array, sorted by one of the properties, such as the X coordinate. Then you could easily do a binary search to find the point with the closest X coordinate. But unfortunately you’d still not necessarily be any closer to finding a point where the Y coordinate is also nearby. To convince yourself of it, just think of the case where all of the points share the same X coordinate. Clearly, you’d have to examine every single point, one by one, to find the nearest neighbor of the target point.

KD-Tree data structure is much like the regular binary search tree that you are familiar with. Except each node would typically hold not just 1 value, but a list of values. When traversing a KD-Tree, we cycle through the index to the values list, based on the depth of a particular node.

The “K” in the name “KD-Tree” refers to the number of properties that each data point has. In our example, since it’s a 2 dimensional plane, we have exactly 2 properties - the X coordinate and the Y coordinate. So, the root node starts off arbitrarily picking one of those. Let’s suppose it’s the x coordinate. Then its children, the second level, alternate to using the Y coordinate. The grandchildren nodes cycle back to using the X coordinate, and so on.

Java source code:

When inserting a new node, just like in the typical binary search tree, we compare the node’s value and go left if it’s smaller, right if it’s bigger. It’s just that, as we go deeper and deeper down the tree, we keep alternating between using the X and the Y values in the comparison.

Note that in our example, since we started with comparing the X values, the Y values were ignored and as a consequence the 2nd layer nodes break the binary search tree rule where all nodes less than the root are on the left side and vice versa. But amazingly, it turns out that this is OK! The magic here is that we can check for this inconsistency on the way back, when unwinding the recursion.

Visually you can think of the root node dividing the XY plane into left and right sides, since we decided to start off using the X coordinate. Then the next point will fall into either the left or the right side and in turn divide that side into top and bottom sections. So as more and more nodes are inserted into the tree, each point splits the parent’s section into 2 sub-sections, alternating between splitting the section into left and right, vs top and bottom.

Once we have gone as far down as we can, we then recurse back, as you would in a regular binary tree search. We do have a candidate closest neighbor, but there is a possibility of a closer point still.

Think of the shortest distance as a line going from the target point to the candidate point. Let’s call this line “r”. This line is probably going to be at some random diagonal angle. But if the line directly perpendicular to the section that we chose not to visit when recursing down, is shorter, then there is a chance that that section may have a closer point.

So this is where the magic happens - as we recurse back, we also calculate the distance to the section that we did not visit. If that distance, let’s call it r-prime, is smaller than the best distance found so far, then we traverse into that section as well.

But note that this does not mean that we visit every node in the tree. If the height of the tree is H, then at most we’ll visit 2H nodes. In other words, this is still a logarithmic search in respect to the number of nodes in the tree.

So far we have been using an example of a 2 dimensional plane. But the methodology works equally well in higher dimensions. Each node would still just have a left and a right branch, but the number of values stored in a node could be arbitrarily large.

Suppose you are designing a database for a hospital that keeps track of each patient’s blood type, white blood cell count and blood pressure. Then a new patient comes in with an unknown illness. Using a KD-Tree you would be able to quickly select one or even a set of patients that share similar properties. These would be the “nearest neighbors”, with a higher likelihood of having a similar illness.
Рекомендации по теме
Комментарии
Автор

50 min lecture material wrapped in a 7 min video. Honestly amazing!

benwu
Автор

In 6 and a half minutes, you managed to teach me the fundamentals of something I thought would take a lot of time. Amazing, thank you!

yashagarwal
Автор

This is the greatest visualization of nearest neighbors I've ever seen. You sir, are a blessing beyond words, I'm not religious but I might as well be after watching your video. Thank you.

wonton
Автор

Awesome tutorial on KDtree data structure! The nearest neighbor search algorithm is easy to follow here. Thanks, keep em coming!

maridavies
Автор

This video hits all the right notes: short, useful and clear. Keep up the excellent work!

Автор

This tutorial was very precise, direct and comprehensive and it really helped me understand how to use kd trees. Thank you so much sir!

Aca
Автор

One of the most useful searching algo in data science

arnavanuj
Автор

This cleared up k-d Trees for me. Thanks for putting up the video.

papayasprout
Автор

I came to you from 2 independent great courses not understanding a specific scenario where they ignore an entire section without clarifying the reason but you explained it extremely better than both; Well done! Thank you :)

omarshawky
Автор

Concise and yet very informative. Enjoyed. Thanks!

familytamelo
Автор

Great visualization of a complex concept

maxinteltech
Автор

This is an interview question. How to design an algorithm for a flight computer to find closest airport during the flight in an emergency? Well explained along with lucid diagrams... Thanks for sharing your knowledge...

NoName-iptt
Автор

Fantastic explanation and visualisation, thank you so much! :)

ChrisOffner
Автор

Wow !new episode updated , i already cannot wait to learn it!thank you so much .

卡机不
Автор

Thank you, I think it solved my problems with making kd-tree and searching it.

dainel
Автор

thaks dear i cant believe that what a amazing way to explanation it is a kind of unblievable again thank you sir and a lot of love

pawankumarpandit
Автор

Nice explanation, I have a question though. Since this is not a balanced binary search tree, the worse case is still O(n) right? Or is there a way to use balancing techniques (like in a treap) with kd trees?

zafirrasyiditaufik
Автор

Great video! Thank you for making math visible :)

KingGJT
Автор

Thank you very much for making this video !

wouttengrootenhuysen
Автор

I love this channel please upload more

yeetgod