Neural networks [10.7] : Natural language processing - hierarchical output layer

preview_player
Показать описание

Рекомендации по теме
Комментарии
Автор

I have two questions:
1. How is the word clustering done ? What features were used to cluster the words ? Was it based on the original tree [randomly assigned words] which were clustered based on some word properties and the tree was updated recursively ?
 2. Using the hierarchical representation of the output layer we get gain in performance if we know what words we need probabilities for, however for a task like next word prediction we would need probabilities on all words to identify the ideal candidates for the next words, in that case we will still be computing for all words. What are the cases in which the words, whose probabilities are needed, are known beforehand. 

ShubhanshuMishra
Автор

Hi man, great video! Seriously, hierarchical softmax is explained on various places but this finally helps me grasp the intuition behind it. So thanks for that.

Still had a few questions:

1. As you mention this has no value if we need the probabilities for all words in our vocabulary. However, while training with backpropagation, that is the case (if we don't use negative sub-sampling), right? In my case I'm doing word2vec (so I'm only interested in the hidden layer representations, not so much in actually using the language model), so I figured speeding up things is only relevant for me when it concerns the training phase. So I'd say this technique wouldn't help me unless I start using negative sub-sampling?

2. You mention using WordNet speeds up x258 but decreases performance. What's the difference between speed and performance? Is it training versus operation?

3. I don't quite understand how different trees (random versus WordNet versus learned) can lead to different performances. In each case, the length of the path through the tree will be log(vocabulary size), right? So how come the performance is variant?

jasperdriessens
Автор

How do we know the correct path from root node to the leaf node? So that we can calculate the probabilities?

danishrathore
Автор

Isn't it a bit more accurate to define the estimated probability as p(context|"cat") instead of p("cat"|context) in slide 3, because the skip-gram task is defined as p(context|target_word)?

Автор

Hugo I have one question how this tree is constructed and how it is linked to neural network model

maheshkannan
Автор

questions:
1, Does this slow down prediction runtime from n to nlogn?
2, Why would the randomly generated tree be suboptimal?
3, In the third approach, wouldn't the tree still be suboptimal if the original word vector is not fully trained?

ThomasChen-urgt
Автор

Thank you for the video. I wonder what the internode vector looks like? Are they just randomly initialized with some random weights? For example in skip gram, if I input the center word and I know the 2 context words, how does the internal node vectors look?

iidtxbc
Автор

I have a question:
what is shortlist?why we use it

chen-ning
Автор

Hello dear Hugo, Thanks for your course,
I have some questions to help me understand the h-softmax better.
-So, in a typical softmax in each phse we tried to update the parameters according to the correct class and also incorrect class. So, according to this, you said that the calculating softmax is realy expensive (e.g. for 250, 000 classes). Then, you are supposed to approximate the probability. So, you are going to use b-tree to do the estimation, but in this way we just update params for correct class not incorrect class, is that right?

- and a tiny another question, if we have for example thw windows_size=5 then the input shape will be (5*250_000, 1) and the output will have a shape of (250_000, 1), is that right?

mahdiamrollahi
Автор

Any tutorials of people implementing this?

tamimazmain
Автор

Hugo,
1.How is the loss function calculated in hierarchical output layer ? In softmax case, the number of output units would be equal to the vocabulary so the actual value of each unit would be the prob of word occurring in that context ( we can calculate this by counts from training data). What would be the actual prob of each output unit (i.e. node in the tree ) here ?

2. For each context, will the number of output units (that light up) be equal to depth (d) of the tree ? Does this this mean that the actual number of output units ( ones that light up plus ones we dont calculate for that context ) be equal to 2^(d+1) − 1

gilsinialopez
Автор

Great video!! Do you have any pointers on multi-way branching hierarchical structure you mentioned in the end of the video?

allancici
Автор

Great Lecture !! I had a simple question.
Coud you explain why different tree architecture(wordnet, random....) leads to different performance? Thanks in advance !!

user