An Old Problem - Ep. 5 (Deep Learning SIMPLIFIED)

preview_player
Показать описание
If deep neural networks are so powerful, why aren’t they used more often? The reason is that they are very difficult to train due to an issue known as the vanishing gradient.

Deep Learning TV on

To train a neural network over a large set of labelled data, you must continuously compute the difference between the network’s predicted output and the actual output. This difference is called the cost, and the process for training a net is known as backpropagation, or backprop. During backprop, weights and biases are tweaked slightly until the lowest possible cost is achieved. An important aspect of this process is the gradient, which is a measure of how much the cost changes with respect to a change in a weight or bias value.

Backprop suffers from a fundamental problem known as the vanishing gradient. During training, the gradient decreases in value back through the net. Because higher gradient values lead to faster training, the layers closest to the input layer take the longest to train. Unfortunately, these initial layers are responsible for detecting the simple patterns in the data, while the later layers help to combine the simple patterns into complex patterns. Without properly detecting simple patterns, a deep net will not have the building blocks necessary to handle the complexity. This problem is the equivalent of to trying to build a house without the proper foundation.

Have you ever had this difficulty while using backpropagation? Please comment and let me know your thoughts.

So what causes the gradient to decay back through the net? Backprop, as the name suggests, requires the gradient to be calculated first at the output layer, then backwards across the net to the first hidden layer. Each time the gradient is calculated, the net must compute the product of all the previous gradients up to that point. Since all the gradients are fractions between 0 and 1 – and the product of fractions in this range results in a smaller fraction – the gradient continues to shrink.

For example, if the first two gradients are one fourth and one third, then the next gradient would be one fourth of one third, which is one twelfth. The following gradient would be one twelfth of one fourth, which is one forty-eighth, and so on. Since the layers near the input layer receive the smallest gradients, the net would take a very long time to train. As a subsequent result, the overall accuracy would suffer.

Credits
Nickey Pickorita (YouTube art) -
Isabel Descutner (Voice) -
Dan Partynski (Copy Editing) -
Jagannath Rajagopal (Creator, Producer and Director) -
Рекомендации по теме
Комментарии
Автор

This clip explains why deep neural nets are so hard to train. If you've used backprop before, you would relate to this. Enjoy :-)!

DeepLearningTV
Автор

Thanks for this channel!! I really appreciate your simplified aproach in order to grasp the core concepts.
Looking fordward for the next videos.
Keep the good work!!!

albertoferreiro
Автор

Wow I really sounded like I might cry before a year of voice and speech 😂

IsabelsChannel
Автор

Yo breath. You sound like you did a long run befor every single sentence :D

Tarnov
Автор

Hi. I just wanted to know that i love you. That is all. Goodbye :)

Jabrils
Автор

at 3:05 should it say "it starts with the right ?"

joesgarage
Автор

why are the gradient values between 0 and 1?

gitanjalinair
Автор

How are the author names of those three papers spelled please? I promise it's for research (and not for masterbaiting)

fosheimdet
Автор

Wow this video is very useful. Hope to watch your videos more!

hoangtrunghieu
Автор

Thank you so much. I'm currently going through an intro to deep learning, going through basics of supervised and unsupervised networks. The instructor explaining this kept rambling on and confusing me. This is very helpful!

Foogly
Автор

Hey yea let me know what you find - backprop with RELU is a solution to beat the vanishing gradient. Check it out and let us know what you found :-)

DeepLearningTV
Автор

The learning rate was also wort mentioning. The deeper is the net, the more it's prone to jumping over the cost minima. I remember how in the 80's ppl invented all the various trick, such as adding noise to weights to remedy this problem. You can find as sweet spot where the convergence improves, but... that was the 80's, so we all know how it turned out back then.

cykkm
Автор

This channel is AMAZING! I love it when something's so neatly explained that even my grandma can understand. Great job fellas! :D

ajayshaan
Автор

Hi. Did the forward propagation is characterize as a training method for the neural network or it just a way of neural network to classify the input data?

ChingMavis
Автор

Nice lecture and nice voice. What tool did you use for this lecture ?

nguyenxuanhung
Автор

I'm a bit confused. Gradient values don't necessary have to be values between 0 and 1 right? One of the arguments made was that the gradient at earlier layers become smaller and smaller because of the compounding multiplications of values between 0 and 1. Can anyone help me understand?

kevintan
Автор

Yes that is the problem I was stuck for some time. Since I learned magic of back propagation I was under impression that it is the solution and only we need more power in computer. Later I heard it does not work well and finally learned what is the problem by going through online course on Coursera. Unfortunately it does not have yet solution.

I watched lectures from Geff Hinton and Oxford and was not able to grasp a solution.

Finally after this video that matches my current state and next video that gives me an idea how it can be solved. Still did not try it myself but at least got an idea about solution and it feels right.

+1

Thanks again

kkochubey
Автор

when you're talking numbers and doing calculations orally, its better if you can write them graphically as u go along

EranM
Автор

Oh man! I'm facing this problem right now. It took may more than 12 hours and the cost still around 0.48

farisalasmary
Автор

I also think in another point of view, training the deep ANNs with back prop, will lead to something that we've already knew as curse of dimension. In fact in another way you have explained it very well.

sillfsxa