Why Does Batch Norm Work? (C2W3L06)

preview_player
Показать описание

Follow us:
Рекомендации по теме
Комментарии
Автор

When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.

maplex
Автор

like this guy - has calm voice / patience

holgip
Автор

Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors

digitalghosts
Автор

Great work, you have the natural talent to make difficult topics easily learnable

aamira
Автор

Beautifully explained, classic Andrew Ng

AnuragHalderEcon
Автор

This guy makes it look so easy... one has to love him

randomforrest
Автор

The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"

siarez
Автор

when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift.
this co-variant shift is tackled during training through input standardization and batch normalization
batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much.
it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning

epistemophilicmetalhead
Автор

"don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?

bgenchel
Автор

The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?

ping_gunter
Автор

Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?

yuchenzhao
Автор

Thanks for sharing the great video, explained in simple and good manner.

MuhammadIrshadAli
Автор

Is it always have batch normalization in neural network?

YuCai-vk
Автор

good to understand but still more nmerical calculations, will show effect

NeerajGarg
Автор

6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.

haoming
Автор

7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?

s
Автор

Keras people needs to watch this video!

XX-vujo
Автор

don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?

pemfiri
Автор

I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?

anynamecanbeuse
Автор

You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?

banipreetsinghraheja