33. Neural Nets and the Learning Function

preview_player
Показать описание
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018
Instructor: Gilbert Strang

This lecture focuses on the construction of the learning function F, which is optimized by stochastic gradient descent and applied to the training data to minimize the loss. Professor Strang also begins his review of distance matrices.

License: Creative Commons BY-NC-SA
Рекомендации по теме
Комментарии
Автор

He is certainly right that even the best would fare well to add anything. To have the pleasure of his lectures is more than gold.

brendawilliams
Автор

Just once to see one of his lectures and not be amazed. He is simply awesome!

mihalisgolias
Автор

Again Prof. Gilbert Strang! Thank you very much!

tchappyha
Автор

I can’t say I will have to review and review from beginning to end many times. You are most clear in explanations.

brendawilliams
Автор

I like your intelligent advanced lectures.they are very challenging..strang is the smartest.thank you.

bettymontoya
Автор

Very thanks MIT for sharing such a knowledge

gtsmeg
Автор

Professor Strang, thank you for an awesome lecture on Distance Matrices, structure of Neural Nets and the Learning Function. All these mathematical concepts improves my understanding of Machine Learning.

georgesadler
Автор

Hats off for professor Gil Strang and MIT for this amazing classes.
One question: I may missing something, the last part, where he says that you can obtain the matrix G from D with this formula, I think that is not correct. You don't have the squared norms of the vectors and professor Strang assumes that you have it on the diagonal of D but diagonal of D is all zeros, am I right? Or am I misunderstanding anything?
Again thank you very much!

gonzalopolo
Автор

For BPP, deep learning and for the general structure of neural networks the following comments may be useful.

To begin with, note that instead of partial derivatives one can work with derivatives as the linear transformations they really are.

It is also possible to look at the networks in a more structured manner. The basic ideas of BPP can then be applied in much more general cases. Several steps are involved.

1.- More general processing units.
Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces.

2.- More general layers (any number of units).
Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations.

3.- Networks with any number of layers.
A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations.

4.- Quadratic error of a function.
...
——-
Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP.

If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available.

For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software.

The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent.

For the elementary conceptual basis of NNs see the article Neural Network Formalism.

Daniel Crespin

dcrespin
Автор

@47:29 But where do we get the d vector in the formula for the G = X^T * X matrix?

allyourcode
Автор

I don't think I understood what is the D matrix in the distance problem: I tried getting a similar term, but I'm not sure it's the same or correct:

consider dᵢⱼ = |xᵢ - xⱼ | ² = xᵢ² + xⱼ² + 2xᵢxⱼ, we want to find an expression for X^T X which is xᵢxⱼ
We can do rigid translation, so we can limit ourselves to xs which are centered, i.e. the average of the position is 0 i.e. <xⱼ>=0.
Now, if we average dᵢⱼ on one of the indices, let's pick j, we get
<dᵢⱼ>ⱼ= xᵢ² + <xⱼ²> + 2 xᵢ <xⱼ> = xᵢ² + <xⱼ²> + 0
We can denote the second moment <xⱼ²>=σ², so <dᵢⱼ>ⱼ = xᵢ² + σ².
We can average dᵢⱼ again, this time over the i index, and we get
< <dᵢⱼ>ⱼ >ᵢ = < xᵢ² + σ²>ᵢ = 2σ²
We can use this to rewrite the xᵢ² terms using averages over dᵢⱼ
xᵢ² + xⱼ² = <dᵢⱼ>ᵢ + <dᵢⱼ>ⱼ - <dᵢⱼ>ᵢⱼ = xᵢ² + σ² + xⱼ² + σ² - 2σ²
And get
2xᵢxⱼ = <dᵢⱼ>ᵢ + <dᵢⱼ>ⱼ - <dᵢⱼ>ᵢⱼ - dᵢⱼ
I think these averages over i and j correspond to some of the weird column tricks but I'm not sure.

eliavrad