Mean Shift Dynamic Bandwidth - Practical Machine Learning Tutorial with Python p.42

preview_player
Показать описание
In this machine learning tutorial, we cover the idea of a dynamically weighted bandwidth with our Mean Shift clustering algorithm

Рекомендации по теме
Комментарии
Автор

If you made this far through the tutorial series and haven't skipped any videos, congratulate yourself! If you need any evidence of why you should, simply look at the number of views on this video compared to the number of views for the next video, part 43.

alexmorehead
Автор

Thanks for this series! Just want to note two things that hopefully aids in someone's understanding (hopefully I didn't make an error in my understanding as well):

1. The current vid implementation contains a logical error in the way centroids are popped. Suppose we have centroids A, B, and C, where B is in radius of A and C is in radius of B, if we're iterating from A->B->C in the current implementation, B will be added to to_pop first, and C will also be added to to_pop. But if B is already in to_pop, (i.e. it's set to be popped because B is close to A), then C technically should not be in to_pop because it's not in the radius of A. In other words, A and C should be the centroids at that point after the remove operation, as opposed to just A, which is what the current implementation yields.

2. Try/except is required in the vid implementation because duplicate centroids are still being added to to_pop. Going back to the example from 1., here in the vid implementation, B is added to to_pop while checking A's radius of close-by centroids, B is then added to to_pop again while surveying C's radius.

The following modification should address both of these points:

to_pop = []

for i in uniques:
----if i in to_pop: pass # we're not inspecting centroids in radius of i since i will be popped
----for ii in uniques:
i == ii:

<= self.radius and ii not in to_pop: # skipping already-added centroids


for i in to_pop:
----uniques.remove(i)

amyxst
Автор

I think i found a solution to why the centroids and classification was so messy at the end. The radius was very low due to having positive and negative points so the magnitude of the average was nearer to 0. To solve this you need to use the abs(data) for the allDataCentroid because that is like getting the distance of each of the points from the origin and getting the average for AllDataNorm. The problem was that though points would be far away (either negatively or positively) from the origin the bandwidth was low. I tested it and it doesn't do any of the messy clustering anymore and is much more accurate.

DerpRenz
Автор

If you watched all the series clap your hands!

berksedatk
Автор

I think there is an error in the code for to_pop. Let's say there are two centroids a and b that have a distance less than radius.
when i = a and ii = b, b is added to to_pop
later when i=b and ii=a, a is added to to_pop

So your to_pop will have all centroids that have another centroid in their neighbourhood.

sukumarh
Автор

Some diagrams illustrating the logic being implemented, with explanations, would be really helpful for visual thinkers such as myself.

DM-pypj
Автор

Aye aye captain move the ship towards deep learning

binuskumar
Автор

I don't think we need to increase the size of the in_bandwidth list. Why not simply use the 'weights' parameter of np.average? (or am I missing something here?).

That part of the code could be like:

for i in centroids:
centroid=centroids[i]
in_bandwidth=[ ]
weight_list=[ ] # To have weights for np.average
for featureset in data:
distance =
if distance==0:
distance=0.0001

if





#
# in_bandwidth+=to_add
new_centroid=np.average(in_bandwidth, axis=0, weights=np.array(weight_list))

ramasubramanian
Автор

Love the series.

Not sure if this is an issue with Python 2.7 on a Mac only, but I needed to change the following line to get this to work:

from:
for ii in [i for i in uniques]:
to:
for ii in [u for u in uniques]:

Using i seems to interfere with the i in the for loop before it and ends up not merging all the clusters that it should i.e. i always ends up as the last value in uniques and stays that way for the entirety of the loop. Whereas the intended behaviour is to compare each i with each value in uniques (accepting that 1 will always match) and if any values are less than the radius drop them from the list.

I guess that a more robust solution would be to recalculate the average of the merged clusters, but not sure whether this would put it in an endless loop.

martinneighbours
Автор

A simple way to optimize the squaring of weights is to calculate the square of weights before the 'while True' loop instead of just setting weights = ith value.
I.e We do, weights = [i**2 for i in instead of weights = [i for i in and then reference them directly while multiplying the featureset. So instead of having to square the weights for n iterations we will be squaring them only once initially.

ranadiveomkar
Автор

In this and the previous one, I am getting more cluster centers than in the video. Why! :(

YashChavanYC
Автор

Hopefully this helps:

I modified the "to_pop" generation for-loop and no longer have duplicates. Inspired by the "if not optimized"-"if optimized" double break statements, my "to_pop" generation for-loop now looks like this:

to_pop = []
for i in uniques:
for ii in uniques:
if i == ii:
pass
elif np.linalg.norm(np.array(i) - np.array(ii)) <= self.radius:
to_pop.append(ii)
break
if i in to_pop:
break

In this case, I add the "if i in to_pop: break" because it asks "is there a duplicate unique? If True, then move on!

Grepoan
Автор

13:30 ish

Right before the to_pop step, you mention converging (i) and (ii), but I don't see that happening anywhere, rather you're just removing (ii), which shouldn't make any significant difference since (i) and (ii) are supposedly close (enough to justify converging), but it might be worth including this to maximize accuracy, even if it is by a very small amount.

mikeg
Автор

Hello from 2018. Yes, that "to_add" with the weights * [featureset] is what is killing your performance. If weight=99, then 99**2=9801, which is a very large list to be appending. Granted 99 is not common, but weights of 40, 60, 80 could be, and those are still big. I worked out a way to do the average without the list appending, and that made performance much more acceptable. I couldn't figure out how to deal with centroids though that converged very close to each other but not quite within the min radius.

amdreallyfast
Автор

Again - thanks for your great work on this series.

Question: What if the average of all data points is the origin. (ie the all_data_norm = = 0) I know this is unlikely but could happen (like 4 clusters each cluster in each quandrant - assuming a 2-d dataset). Does this mess up this implementation of mean_shift? And I guess this leads to my more general question of why exactly is the norm of the overall average vector of all data points a good starting point/estimate of what the overall radius should be?

Appreciate your time.

hellopoop
Автор

Hey . i found a little confused about the classification part:
for featureset in data:
distances = for centroid in self.centroids]
classification =


Imagine a situation that point A belongs to class 1, but point A is closer to the centroid of class 2 than the centroid of class 1.According to your code, the A will be classified as class 2.I think the classification part should be in the While True loop
.

jimxiao
Автор

hello sentdex,

I have a list of numpy arrays of (float values) where each array in list are of Intensity values of the image of dimension (336, 336, 80, 3) and of size like 103MB. And i need to apply Mean shift clustering to this list. Could you please suggest me in how to apply mean shift clustering to the list such that each clusters represents each array in list of numpy arrays?

Regards,
Raj

rajubhai-glqk
Автор

If you get a warning, use 'from sklearn.datasets import make_blobs' instead of 'from import make_blobs', which is deprecated

kevinziroldi
Автор

Hey harisson where should I go from here, continue this series or continue from your new deep learning flow, keras Playlist?

rxeqmxy
Автор

I tried this as input X, y = make_blobs(n_samples=30, centers=10, n_features=2)
The data I got clearly has 5 clusters but the result shows more than 10 clusters with distance between few centroids less than 1 unit

sukumarh
join shbcf.ru