Dropout Regularization (C2W1L06)

preview_player
Показать описание

Follow us:
Рекомендации по теме
Комментарии
Автор

Dropout helps us to ensure that the model is not getting biased towards a particular feature, i.e. to ensure it performs well even in the absence of that particular feature.
keep-prob = 0.8 means the probability of a hidden unit will be kept is 0.8, and 0.2 chances of a hidden unit will be ignored.

no dropout at test time, explicitly, because we do not want to vary the output

exampreparationonline
Автор

As we can see that the d3 vector is 20% sparse matrix means that 20% of the data has been set to 0/False.on multiply the a3 with d3 makes the a3 to 20% to 0 values which results the drouput machanism in the training data set but although we have to manage the a3 for the w4 iteraion equation so we want to increase the a3 by keep_prob which is in this case is 0.2 which increases a3 relatively high .may it helps you

SandeepKumar-ieni
Автор

??? So dropout mask d3 is calculated every iteration!?
Does not not make the result jump around like a monkey preventing the network from actually converging to one result!?

Why would not calculating a fixed/final d3 before training be better??

Jaspinik
Автор

According to 0:40, `1 - keep_prob` is the probability that the node will be eliminated. Andrew (I'm paraphrasing a bit) says that it is equivalent to "removing all the ingoing links to that node". In other words, this is equivalent to setting the whole column of d3 (2:34) that corresponds to the node to be eliminated to zero. By doing this, the dot product of the weights of this column to the previous layer's activation units would be 0 (which is what we want).
If that's the case, should we not have instead:
```
d3 = np.zeros((a3.shape[0], a3.shape[1]))
d3[:, np.random.rand(a3.shape[1]) < keep_prob] = 1
```
where a3.shape[1] is the size of the current layer (whose nodes are dropped out) and a3.shape[0] the size of the previous layer.

If, instead, we implement it as shown in the video, that is,
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob],
it is not guaranteed that the whole column corresponding to the node to be eliminated will be zero. Thoughts?

giofou
Автор

Thank you Andrew

Do we consider Bias to be dropped out, or we always keep it?

IgorAherne
Автор

Why do you have to divide by 0.8 to conserve the mean? Is average

diegomoya
Автор

1. Why do we have to eliminate different nodes in the layers for different training examples? Like why does it have to be different?
2. What is the difference between test time and training time?
3. Instead of np.random, randn(a3.shape[0], a3.shape[1]), can we write np.random.randn(a3.shape) ?

shwethasubbu
Автор

I couldn't understand the reason of multiplication of keep_prob at the end. Can any one help? thank you.

abhramajumder
Автор

great explanation .need to watch again

sandipansarkar
Автор

whats the point of dividing a3 by 0.8, u said it will gain back the reduced 20% loss values, i m not getting first we shut down 20% neuron or units then we r switching them back on? this is what u mean by gaining the 20% lost values blc if u r talking about other values than 0 then yess they get increases by 20% but i dont know y we r doing this, i mean how will it effect the results in z4.

smilebig
Автор

So when teaching Neural Network we have dropout layer(s) to prevent over-fitting.

Once we got all the weights figured out and ready to use network in production environment should we drop those layers?

They seems to only hurt our "using" of the network.

Geoters
Автор

Please help me to understand...
1. Is the difference between L2 and inverted dropout that the L2 is an "average" of "w"s, so changing through iterations, while inverted dropout is always a fix number? Because both just reduce the "w"s.
2. Does inverted dropout only simulate the shut off through iterations by let's say averaging the effects of the nodes?
3. If so then is it possible to still have some x features which have major effects on every nodes in a layer, meaning that every nodes in a layer "learn" the same thing?
4. If so, why do not we just kill some w values for nodes randomly meaning they will not be involved in the learning process of the given node, while the other nodes in the same layer can still learn on it?

rekasil
Автор

what if every element in d3 is ALL True (all is smaller than keep_prob)? I means its random so that case is possible

VV-mzyz
Автор

do we perform the inverse dropout calculation (a = a/ keep-prob) on the input layer neurons aswell?

doyugen
Автор

in training, node should share the effect of missing nodes that the whole idea. isn't it? i dont find divide by .8 a very appropriate way.

uditarpit
Автор

excellent explanation.Though it wasn't clear what happens during test/dev time. Does it multiply all the weights with 0.2 during test/dev for Dropout(0.2)?

rahuldey
Автор

5:40 what really means that divided z(4) to 0.8 because to remain the expectation of the value a(3).. someone please help me..

nikqlcb
Автор

If we are to remove the nodes than why did we added those nodes at first? Why isn’t we started with a smaller network?

farooqkhan