CS231n Winter 2016: Lecture 8: Localization and Detection

preview_player
Показать описание
Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 8.

Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.
Рекомендации по теме
Комментарии
Автор

Thank you so much for creating this class and posting these videos, Andrej.
Your work has been very inspiring to me, and has helped me tremendously in shifting my own career.
Keep up the good work.

AlanMelling
Автор

*My takeaways:*
1. Classification and localization 3:19: Overfeat
2. Object detection 24:10: R-CNN, Fast R-CNN, Faster R-CNN and YOLO
- Mean average precision (mAP) 38:45

leixun
Автор

This was more of a paper presentation than a lecture. This is also evident from very few questions were asked from the students during the lecture. Details like filter sizes, depth per block in the pipeline and ROI pooling for fast-rcnn and faster r-cnn were not clear to me. Hope a better version from 2017 class will be uploaded.

rohitsaxena
Автор

Thanks for posting this Andrej! Really helpful for learning about or reviewing these topics.
One small tip for Justin Jonhson: it would be nice to repeat the audience questions (hard to understand them in the recording otherwise).

AhmedKachkach
Автор

The explanation of the OverFeat sliding window efficiency at 16:10 is pretty poor. The paper is much clearer. The point isn't really "reimagining" the FC layer as a convolution step. Instead, it lets you take advantage of efficiencies built into Convolution Operation implementations that aren't present in FC implementations.

Imagine a convolution operation in 1 dimension. Let's say you're kernel is 5 numbers. In step 0, I add A+B+C+D+E = A + (B + C + D + E). That cost me 4 add ops. In step 1, I want to add B+C+D+E+F. I can use my cached value and calculate it with (cached_value) + F. Which only cost me 1 add op. Efficiencies like this can be scaled to implementations of the convolution operator. However, FC layers operate over the whole input and have no logical place for such caching.

In Overfeat, we're running these operations on "windows" of the input image. Each window is a lot like a patch of input to a convolutional layer. By transforming the last FC layers into convolution operations, we can treat the whole network as a series of convolution operations and then take advantage of the inherent efficiencies (described above) of convolution operations.

robcrane
Автор

51:26
The answer i have been looking for. Also there are some other question i did like to ask :
1. I still can't really imagine what this 3x3 kernel in RPN trying to represent. In vanilla CNN i can say this filter is responsible for detecting this particular feature on the image (color, pattern, line, edge and so on). But for the 3x3 sliding window/kernel in RPN, i can't seem to get what its trying to catch/do.
2. Why the depth of the conv layer in RPN has(?) to be the same with the depth of the feature map. In the paper they use 256-d which means 256 channel produced by 256 different 3x3 sliding window/kernel. Is it because the feature map itsellf has 256 channel(depth), assuming the base CNN is ZFNet, and they're just trying to maintain the w, h, d of the feature map after doing the convolution (as in question 1) ?
3. Following after the question 1, what does the 1x1 kernel for cls layer and regs layer doing exactly ?
4. How does each of the anchor box representated in the RPN ? How does doing 3x3 convolution related to "generating" anchor boxes ?

*questions from above are copied from the same lecture video uploaded by other channel

tiasm
Автор

Thankyou Andrej, This is helping me so much, with my final year project too. !!!

irtazaa
Автор

At 17:28, I'm still not quite seeing why the feature map to the first FC is a 5x5 convolution (and additionally why it is a 1x1 convolution to the next layer for both the regression and classification heads). Anyone have any pointers they could help me with, or links to additional resources? Thanks!

havenwang
Автор

i don't understand the way how to train my model (in what way do i send the images for training )for localising and how do I really get the bounding boxes. Can you explain these things in a clear way?

kamalisrinivasan
Автор

So basically we know the answers but not the questions asked by students
Let's make the questions!!!

ShahidulIslam-xfoz
Автор

Very good lecture. Is it possible to repeat questions asked by students (or put sub-title for those)?

rajeev
Автор

When you regress box deltas and positions do you do it in a 0-1 numerical regression or do you regress a one-hot vector ?

Chrnalis
Автор

This lecture got me confused a hell lot. Not clear at all. Nowhere close to Andrej's level of teaching.

rishabhrao
Автор

while converting a fully connected layer to a convolution layer what we are doing(in this case 17:16 )is using 4096 5*5*1024 filters.but if you count no of parameters in this layer it is 5*5*1024*4096 which is much greater than no of input features which is 5*5*1024.if you use one parameter in association with one feature from feature space it would only take 5*5*1024 parameters so how is it justified using 4096 5*5*1024 filters?

arjunkrishna
Автор

This lecture feels like the lecturer is only reading the slide aloud. Explanations are unclear, the lecturer hastily skipped through many parts, without providing any clear explanation.

ashrafibrahim