Simple Deep Neural Networks for Text Classification

preview_player
Показать описание
Hi. In this video, we will apply neural networks for text. And let's first remember, what is text? You can think of it as a sequence of characters, words or anything else. And in this video, we will continue to think of text as a sequence of words or tokens. And let's remember how bag of words works. You have every word and forever distinct word that you have in your dataset, you have a feature column. And you actually effectively vectorizing each word with one-hot-encoded vector that is a huge vector of zeros that has only one non-zero value which is in the column corresponding to that particular word. So in this example, we have very, good, and movie, and all of them are vectorized independently. And in this setting, you actually for real world problems, you have like hundreds of thousands of columns. And how do we get to bag of words representation? You can actually see that we can sum up all those values, all those vectors, and we come up with a bag of words vectorization that now corresponds to very, good, movie. And so, it could be good to think about bag of words representation as a sum of sparse one-hot-encoded vectors corresponding to each particular word. Okay, let's move to neural network way. And opposite to the sparse way that we've seen in bag of words, in neural networks, we usually like dense representation. And that means that we can replace each word by a dense vector that is much shorter. It can have 300 values, and now it has any real valued items in those vectors. And an example of such vectors is word2vec embeddings, that are pretrained embeddings that are done in an unsupervised manner. And we will actually dive into details on word2vec in the next two weeks. But, all we have to know right now is that, word2vec vectors have a nice property. Words that have similar context in terms of neighboring words, they tend to have vectors that are collinear, that actually point to roughly the same direction. And that is a very nice property that we will further use. Okay, so, now we can replace each word with a dense vector of 300 real values. What do we do next? How can we come up with a feature descriptor for the whole text? Actually, we can use the same manner as we used for bag of words. We can just dig the sum of those vectors and we have a representation based on word2vec embeddings for the whole text, like very good movie. And, that's some of word2vec vectors actually works in practice. It can give you a great baseline descriptor, a baseline features for your classifier and that can actually work pretty well. Another approach is doing a neural network over these embeddings.
Рекомендации по теме
Комментарии
Автор

Hey there! I normally don’t leave comments, or likes but I had to stop here!
You’ve explained a convoluted topic in a clear, digestible and concise way. Thank you!

Posejdonkon
Автор

Thank you for the good explanation!
You forgot to link the paper you mentioned (at 12:43). For all who are interested: I think it was about this paper:
"Convolutional Neural Networks for Sentence Classification" by Yoon Kim

maxlegnar
Автор

This is the most comprehensive video I've ever seen on neural networks! Thank you so much! I study and develop AI, but was using something more like the bag of words representation. The other thing, aside from accuracy that I noticed to be an issue with the bag of words representation, was actually the amount of resources it required from the machine it was operating on. To give some insight into just how bad it was, while the machine I was using wasn't exactly top of the line, the machine I'm using now is pretty high performance (i5-8400, 16gb RAM, 1TB Samsung Evo 860 SSD) and yet, the facial recognition usually dropped down the camera feed to about 3-5fps when it would detect a face. Even generating a response (using Speech-to-Text, then a custom-tailored version of the Levenshtein Distance algorithm to correct any misinterpretation of speech) was using at least 7GB of RAM even with a relatively small data set in the vicinity of maybe 50GB, and using 40-60% of my CPU power. Anyhow, my intent with watching this video was to learn about better algorithms, with the intent of actually implementing a neural network on an FPGA Now I feel well-equipped with enough information to conquer that finally, as I feel I finally understand CNNs well enough. Thanks so much!

danm
Автор

this is the best explanation i've seen on CNN applied on text input

louisd
Автор

One of best lecture i have heard ever. Seriously i was totally in to your video for 15mins, which i forgot external world. Awaiting for next set of topics.

manjuappu
Автор

2:05 freudian slip? made me crack up haha
Excellent video, thanks for sharing!

boooo
Автор

Really nice explanations even if the convolution network internals are not enough explained.

sylvainbzh
Автор

One of the best videos to understand string inputs for Neural Nets.

NandishA
Автор

Fantastic! You have explained it very very well. Please upload more videos on related and Machine Learning topics. Thank you so much.

ijeffking
Автор

Great work thanks. Can't wait for the next. Very well explained

DanielWeikert
Автор

IT is the best explanation of word embeddings ever seen

kushshri
Автор

Excellent video. This video made me watch the whole playlist

luislptigres
Автор

at 11:19 I am confused about why each gram we learn 100 filters? What is the filter in this case? I thought by applying the 3-gram kernel using the same padding, we will get (1, n) vector, where n: number of words in this case n=5. Then we have 3, 4, 5 gram, shouldn't we just have 3 (1, n) vectors? If we get max value for each gram, shouldn't we just have 3 outputs, where each output from each x-gram vector (size =(1, n))? Can you explain why you said 300 outputs? Thanks,

hellochii
Автор

How does this compare with the attention mechanism in transformers?

tantzer
Автор

excuse my stupidity, on 4:19 how do you get 0.9 from word embeddings and convolutional filter, is it a dot product? or some thing else?

rialtosan
Автор

Where did the 0.9 and 0.84 come from? Sorry, I'm new to this...

johntsirigotis
Автор

at 1:54 what are the inputs are they [very, good, movie] or are they the [x1, x2, x3].

barax
Автор

at 4:25, the result of convolution is not 0.9, it is 0.88. How CNN create these filters? For instance we defined 16 filters to apply. How CNN library determine the content of filters ( numbers) ?

arnetmitarnetmit
Автор

Please, explain the meaning of the final vector obtained after the 1d convolution and i guess, trained in some way.

argentineinformationservic
Автор

What about the context of the text. Why would you use this rather than use something like a GRU or LSTM

bismeetsingh