Lesson 5: Deep Learning 2019 - Back propagation; Accelerated SGD; Neural net from scratch

preview_player
Показать описание
In lesson 5 we put all the pieces of training together to understand exactly what is going on when we talk about *back propagation*. We'll use this knowledge to create and train a simple neural network from scratch.

We'll also see how we can look inside the weights of an embedding layer, to find out what our model has learned about our categorical variables. This will let us get some insights into which movies we should probably avoid at all costs...

Although embeddings are most widely known in the context of word embeddings for NLP, they are at least as important for categorical variables in general, such as for tabular data or collaborative filtering. They can even be used with non-neural models with great success.
Рекомендации по теме
Комментарии
Автор

Recap (ResNet Network Architecture): 3:35
Fine-Tuning:
Overview: 8:30
Per-Layer Feature Visualization: 11:40
Freezing Early Layers: 12:50
Discriminative Learning Rates: 14:25
Fine Tuning for Collaborative Filtering:
Model Structure:
Affine Functions: 19:44
Overview: 21:40
One-Hot Encoding of IDs, Embedding Vectors: 23:36
Latent Features: 32:40
Use of Bias Term: 33:08
Questions:
"When we load a pre-trained model, should we explore the activation grids to see what it's good at?": 35:57
"Can we have an explanation of what the first argument in fit_one_cycle actually represents?": 36:32
"What is an affine function?" (And, why you need nonlinearities): 37:20
Loading the MovieLens 100k Dataset: 38:29
Tricks for Training (Scaled Sigmoid, LR Finder): 43:20
Interpreting Trained Model:
Biases: 48:00
Weights (With PCA): 54:25
How collab_learner Works: 1:00:00
Interpreting Embeddings (Neokami Paper): 1:07:00
Optimization Improvements:
Weight Decay: 1:12:10
PyTorch Code for Weight Decay On MNIST: 1:24:00
Adam: 1:43:00
Understanding the Tabular Model:
Overview: 2:03:00
Cross-Entropy Loss: 2:04:05
SoftMax Activation: 2:07:20
PyTorch Code for Tabular Model: 2:11:00

ollinboerbohan
Автор

25:21, after spending far too much time being a beginner at matrix multiplications I'd like to clarify to someone else who's confused over why this works:

It will only produce an output as seen if the one hot encoded matrix is multiplied to the weight matrix. See it as One-Hot-Matrix (dot) Weight-Matrix.


It only works if the the one hot matrix is to the left of the weight matrix (not as seen in the Excel document, where the one-hot matrix is to the right). A 15x5 (dot) 209x15 matrix multiplication doesn't work (which makes me feel sort of stupid for even trying to figure it out, in hindsight). Only a 209x15 (dot) 15x5 matrix multiplication will give this result due to the non continuity of matrix multiplications.

Lorkin
Автор

Discriminative learning rates in fast.ai (16:26) -- writing "slice(1e-5, 1e-3)" means final layers get LR 1e-3, first layers get 1e-5, and middle layers are logarithmically interpolated.

kevalan
Автор

Could you possibly add timestamps to your videos in case people want to re-watch a select topic and not have to skip around looking for it?

rodeezy
Автор

Best deep learning MOOC I have ever found. Love it
the best lesson i have learned

kaafkgehag
Автор

1:14:01 This “burden” of the statisticians maybe responsible for many many smart ppl who kept saying neural networks doesn’t work during AI winter. I think in M. Nielson famous internet book on neural network, he may have quoted a physicist in the 60s saying “give me 5 parameters and I can fit an elephant”, or something to that effect. I also read quite a few book from computational finance community saying NN are ridiculous have millions of params and an overfitting nightmare. I think credits go to those researchers who finally showed us this actually works.

kawingchan
Автор

I think the momentum explanation at 1:49:57 is incorrect. In my understanding, momentum is about "remembering" your direction in multiple dimensions, not about increasing the step size in a single dimension.

kevalan
Автор

Isn't the derivative at 1:38:35 wrong? Should it not be 3dw^2?


EDIT: Nvm, I was confused that he was using wd to mean weight decay and not two separate variables w and d. Jeremy's answer is correct.

occasionalvideos
Автор

In SGD uses learning rate 0.0001, for RMSprop learning rate is 0.002 and for Adam optimizer learning rate is 1. So the results where you compare optimizers you show to us are not so valid (or try the same experiment with same learning rate)

DavidSmith-zgdy
Автор

1:11:07 The scatter plot look interesting, there seems to be a linear boundary under which no instance occurred, wander if there’s any significance and explanation. One good question maybe if the real world distance is a straight line distance from A to B, or distance as measured by the road.

kawingchan
Автор

1:10:04 About entity embedding visualization, the other popular method is t-sne, I m guessing the authors may have used that.

kawingchan
Автор

Because Andrew is at Stanford he has to use Greek letters, ok?

ramahujan
Автор

Ok, I have got only one question, why was there smoke outside ?

whateverhonestly
Автор

Entity Embeddings of Categorical Variables and possible interpretation: 1:07:09

EtienneCharlier-Biz
Автор

Do anyone knows what type of the drawing tool the host is using?

dapingzheng
Автор

Why use exponentially weighted average in loss?

DavidSmith-zgdy
Автор

1:12:50, Andrew is a stanforder, he has to use greek letter😂

AIPlayerrrr
Автор

@Jeremy Howard
, I am not sure your embeddings for days and months makes much sense. If you knew nothing about days or months how would one get to the clear path you mention? It just doesn't seem like tracing it our the way you did makes much sense.

dbzkidkev
Автор

Do not we call Embedding the learned input parameters themselves and not one hot encodings?

vladimirgetselevich
Автор

Is there a way to learn and then interpret embeddings from image models with distance? Eg how far is corn plant image from potato plant image vs rice plant image

sohaibarif