Introduction to Deep Learning - 10. Convolutional Neural Networks Part 2 (Summer 2020)

preview_player
Показать описание

Introduction to Deep Learning (I2DL) - Lecture 10
TUM Summer Semester 2020
Рекомендации по теме
Комментарии
Автор

0:00:00 Recap
0:12:17 CNNs part 2
0:12:32 LeNet
0:20:46 AlexNet
0:27:56 VGGNet
0:32:18 Skip Connections
0:33:20 Residual Block
0:37:35 ResNet Block
0:41:40 Why do ResNets work?
0:45:37 1x1 Convolution
0:50:33 Inception Layer
1:01:53 GoogleLeNet
1:03:28 Depthwise Separable Convolutions
1:10:18 Fully Convolutional Network
1:14:23 Upsampling
1:18:24 U-Net

idllecture
Автор

The max-pooling in the inception network is 2x2 with padding?

eshafeeqee
Автор

18:00 I guess after the last pooling, there was no convolution

muradtalibov
Автор

A small question to the Inception layer. @1:01:30 The output size of the 3x3 convolutions/128 filters is 28x28x128. Maybe I am missing something, but how can the first and second dimension remain the same, when we apply a 3x3 filter? Is there some padding involved such that those dimensions remain the same? Since according to our formula we would get floor((N+2*P-F)/S + 1)=floor(28-3/1+1)=26, so 26x26x128.

fabinhan
Автор

1:11:15 "1x1 convs are exactly the same as fully connected layers"

This is only true if the input spatial dimensions are 1x1.

A fully connected layer means that each neuron in the current layer is connected with every neuron in the previous layer. A 1x1 conv does not guarantee this. This is guaranteed by a conv layer with the same spatial dimensions as the input (i.e. MxN kernel for an MxN input). So, a fully connected convolution layer is the one where the kernel width and height are the same as input width and height.

I think the confusion is coming from the example of Unet-like fully convolutional networks for image segmentation. These networks usually produce outputs with the same spatial dimensions as the input image. The depth (or the number of channels, or number of feature maps) of these outputs are the same as the number of classes. In other words, one feature map for each class. Here, usually, in the output layer, 1x1 convolutions with C output channels (C is the number of classes) are used. So, each unit in each output feature map (i.e. each class(=feature map) probability at each pixel location (=unit)) is obtained with a weighted sum of all the corresponding units in each input feature map. Similarly, in a fully connected output layer, each class probability is calculated with a weighted sum of all the outputs of the previous layer.

It is also worth mentioning that a fully convolutional network with a classifier head can be achieved by replacing the final fully connected layer of size D by a conv layer with D filters followed by a pooling operation that reduces the shape of the input to (1, 1, D) (width and height 1, D is the number of channels).

khasmamad