Why Transformer over Recurrent Neural Networks

preview_player
Показать описание
#transformers #machinelearning #chatgpt #gpt #deeplearning
Рекомендации по теме
Комментарии
Автор

That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.

IshtiaqueAman
Автор

that was a great video!
i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal

NoahElRhandour
Автор

Note that the decoder in Transformer outputs one vector at a time as well

untitledc
Автор

This answered a question I didn't have. Thanks!

schillaci
Автор

I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence.

But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input

IgorAherne
Автор

YouTube recommend me more videos like this plz

brianprzezdziecki
Автор

Thanks for sharing such valuable information! A bit off-topic, but I wanted to ask: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). Could you explain how to move them to Binance?

PeriandroBarragan
Автор

An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.

sandraviknander
Автор

This was cool but not sure it was explained correctly or I didn’t understand fully. I study transformers and the global attention mechanism is word prediction comparing it to every other past word and input. How does that predict future words?

lavishly
Автор

This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you

kenichisegawa
Автор

You should have put LSTMs as a middle step

aron
Автор

Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?

jackrayner
Автор

The main reason is that rnn has what we call the exploding and vanishing gradient descent..

free_thinker
Автор

Can you do Fourier Transform replacing the attention head

jugsma
Автор

Aren’t most of the transformers used, based on causal self-attention? That doesn’t seem to have the bidirectional thing to it?

drdca
Автор

Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel

alfredwindslow
Автор

What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?

vastabyss
Автор

What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?

Laszer
Автор

how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..

sreedharsn-xwyi
Автор

But there is also a version of RNN with attention.

manikantabandla