KD-Tree Nearest Neighbor Data Structure

Показать описание

KD-Tree is a data structure useful when organizing data by several criteria all at once. Consider an example where you have a set of points on a 2 dimensional plane. Now suppose you are asked to find the nearest neighbor of some target point. What data structure would you use to store these points to be able to solve that problem?

A naive approach would be to perhaps store the points in an array, sorted by one of the properties, such as the X coordinate. Then you could easily do a binary search to find the point with the closest X coordinate. But unfortunately you’d still not necessarily be any closer to finding a point where the Y coordinate is also nearby. To convince yourself of it, just think of the case where all of the points share the same X coordinate. Clearly, you’d have to examine every single point, one by one, to find the nearest neighbor of the target point.

KD-Tree data structure is much like the regular binary search tree that you are familiar with. Except each node would typically hold not just 1 value, but a list of values. When traversing a KD-Tree, we cycle through the index to the values list, based on the depth of a particular node.

The “K” in the name “KD-Tree” refers to the number of properties that each data point has. In our example, since it’s a 2 dimensional plane, we have exactly 2 properties - the X coordinate and the Y coordinate. So, the root node starts off arbitrarily picking one of those. Let’s suppose it’s the x coordinate. Then its children, the second level, alternate to using the Y coordinate. The grandchildren nodes cycle back to using the X coordinate, and so on.

Java source code:

When inserting a new node, just like in the typical binary search tree, we compare the node’s value and go left if it’s smaller, right if it’s bigger. It’s just that, as we go deeper and deeper down the tree, we keep alternating between using the X and the Y values in the comparison.

Note that in our example, since we started with comparing the X values, the Y values were ignored and as a consequence the 2nd layer nodes break the binary search tree rule where all nodes less than the root are on the left side and vice versa. But amazingly, it turns out that this is OK! The magic here is that we can check for this inconsistency on the way back, when unwinding the recursion.

Visually you can think of the root node dividing the XY plane into left and right sides, since we decided to start off using the X coordinate. Then the next point will fall into either the left or the right side and in turn divide that side into top and bottom sections. So as more and more nodes are inserted into the tree, each point splits the parent’s section into 2 sub-sections, alternating between splitting the section into left and right, vs top and bottom.

Once we have gone as far down as we can, we then recurse back, as you would in a regular binary tree search. We do have a candidate closest neighbor, but there is a possibility of a closer point still.

Think of the shortest distance as a line going from the target point to the candidate point. Let’s call this line “r”. This line is probably going to be at some random diagonal angle. But if the line directly perpendicular to the section that we chose not to visit when recursing down, is shorter, then there is a chance that that section may have a closer point.

So this is where the magic happens - as we recurse back, we also calculate the distance to the section that we did not visit. If that distance, let’s call it r-prime, is smaller than the best distance found so far, then we traverse into that section as well.

But note that this does not mean that we visit every node in the tree. If the height of the tree is H, then at most we’ll visit 2H nodes. In other words, this is still a logarithmic search in respect to the number of nodes in the tree.

So far we have been using an example of a 2 dimensional plane. But the methodology works equally well in higher dimensions. Each node would still just have a left and a right branch, but the number of values stored in a node could be arbitrarily large.

Suppose you are designing a database for a hospital that keeps track of each patient’s blood type, white blood cell count and blood pressure. Then a new patient comes in with an unknown illness. Using a KD-Tree you would be able to quickly select one or even a set of patients that share similar properties. These would be the “nearest neighbors”, with a higher likelihood of having a similar illness.

Рекомендации по теме

Комментарии

50 min lecture material wrapped in a 7 min video. Honestly amazing!

benwu

In 6 and a half minutes, you managed to teach me the fundamentals of something I thought would take a lot of time. Amazing, thank you!

yashagarwal

This is the greatest visualization of nearest neighbors I've ever seen. You sir, are a blessing beyond words, I'm not religious but I might as well be after watching your video. Thank you.

wonton

Awesome tutorial on KDtree data structure! The nearest neighbor search algorithm is easy to follow here. Thanks, keep em coming!

maridavies

This video hits all the right notes: short, useful and clear. Keep up the excellent work!

This tutorial was very precise, direct and comprehensive and it really helped me understand how to use kd trees. Thank you so much sir!

Aca

One of the most useful searching algo in data science

arnavanuj

This cleared up k-d Trees for me. Thanks for putting up the video.

papayasprout

I came to you from 2 independent great courses not understanding a specific scenario where they ignore an entire section without clarifying the reason but you explained it extremely better than both; Well done! Thank you :)

omarshawky

Concise and yet very informative. Enjoyed. Thanks!

familytamelo

Great visualization of a complex concept

maxinteltech

This is an interview question. How to design an algorithm for a flight computer to find closest airport during the flight in an emergency? Well explained along with lucid diagrams... Thanks for sharing your knowledge...

NoName-iptt

Fantastic explanation and visualisation, thank you so much! :)

ChrisOffner

Wow ！new episode updated ， i already cannot wait to learn it！thank you so much .

卡机不

Thank you, I think it solved my problems with making kd-tree and searching it.

dainel

thaks dear i cant believe that what a amazing way to explanation it is a kind of unblievable again thank you sir and a lot of love

pawankumarpandit

Nice explanation, I have a question though. Since this is not a balanced binary search tree, the worse case is still O(n) right? Or is there a way to use balancing techniques (like in a treap) with kd trees?

zafirrasyiditaufik

Great video! Thank you for making math visible :)

KingGJT

Thank you very much for making this video !

wouttengrootenhuysen

I love this channel please upload more

yeetgod

KD-Tree Nearest Neighbor Data Structure

KD-Tree Nearest Neighbor Data Structure

kNN.15 K-d tree algorithm

KD tree algorithm: how it works

K-d Trees - Computerphile

K-D Tree: build and search for the nearest neighbor

Advanced Data Structures: K-D Trees

Multidimensional Data, Video 7 K d Tree Nearest Finding

5 03 Using KD Trees to Compute Nearest Neighbor Queries

KD-Trees and Range search

MP6 KDTree Find Nearest Neighbor

Approximate k-NN search using KD-trees - Machine Learning: Clustering & Retrieval

Tutorial 5: K-NN Part 7 KD-Trees

How to build a KD-Tree? | K Nearest Neighbors | Lec 21

Find nearest neighbors using KD-Tree | K Nearest Neighbors | Lec 22

Enhancing Nearest Neighbor Search: Overcoming Brute Force Limitations with KD Trees and Ball Trees

Tutorial 5: K-NN: Part 5 KD Trees

Approximate Nearest Neighbors : Data Science Concepts

KD Tree : Advanced Data Structure Java Implementation

Kristoffer Carlsson - Fast Nearest-neighbor Queries in Julia Using NearestNeighbors.jl

MP6 KDTree Find Nearest Neighbor Part 2 - Example

Tutorial 5: K-NN: Part 9 KD-Trees remarks

Revisiting kd-tree for Nearest-Neighbor Search

Proj2a KdTree pseudo walkthrough

Multidimensional Data, Video 8 - k-d tree nearest PseudoCode