LSTM is dead. Long Live Transformers!

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

That's one of the best deep learning related presentations I've seen in a while! Not only introduced transformers but also gave an overview of other NLP strategies, activation functions and also best practices when using optimizers. Thank you!!

FernandoWittmann
Автор

Good to see Adam Driver working on transformers 😁

vamseesriharsha
Автор

For anyone feeling overwhelmed, it is completely reasonable, as this video is just a 28 minute recap for experienced machine learning practitioners, and lot of them are just spamming the top comments with "This is by far the best video", "Everything is clear with this single video" and all.

sanjivgautam
Автор

Thank you for this concise and well-rounded talk! The pseudocode example was awesome!

richardosuala
Автор

Its hard to overstate just how much this topic has(is) transformed the industry. As others have said, understanding it is not easy because there are a bunch of components that don't seem to align with one another and overall the architecture is such a departure from the most traditional things you are taught. I myself have wrangled with it for a while and its still difficult to fully grasp. Like any hard problem, you have to bang your head against it for a while before it clicks.

ajitkirpekar
Автор

Great talk. It's always thrilling to see someone who actually knows what they're supposedly presenting.

monikathornton
Автор

This is like 90% of what I remember from my NLP course with all the uncertainty cleared up, thanks!

lmao
Автор

I love this presentation
Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc;
Best slideshow I've seen this year, and it's from over 3 years ago

_RMSG_
Автор

Leo is an excellent professor. He explains difficult concepts in an easy-to-understand way.

JagdeepSandhuSJC
Автор

Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in:

Hi+1 = A(Hi, xi)

Seems this should rather be:

Hi+1 = A(Hi, xi+1)

which might be more intuitively written as:

Hi = A(Hi-1, xi)

cliffrosen
Автор

12:56 the review of the pseudocode of the attention mechanism was what finally helped me understand it (specifically the meaning of the Q, K, V vectors), what other videos were lacking. In the second outer for loop, I still don't fully understand why it loops over the length of the input sequence. The output can be of different length, no? Maybe this is an error. Also, I think he didn't mention the masking of the remaining output at each step so the model doesn't "cheat".

Scranny
Автор

World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.

BartoszBielecki
Автор

All i want is his level of humbleness and knowledge

BcomingHIM
Автор

I was trying to use similar super-low frequency sine trick for audio sample classification (to give network more clues about attack/sustain/release positioning). Never did I know, that one can use several of those in different phases. Such a simple and beautiful trick
The presentation is awesome

evennot
Автор

RIP LSTM 2019, she/he/it/they would be remembered by....

ProfessionalTycoons
Автор

You folks need to look into asymptotics and Padé approximant methods, or for functions of many variables as ANN's are you'd use the generalize Canterbury Approximants. The is not yet a rigorous development in information theoretic terms, but Padé summations (essentially repeated fraction representations) are known to yield rapid convergence to correct limits for divergent Taylor series in non-converging regions of the complex plane. What this boils down to is that you only need a fairly small number of iterations to get very accurate results if you only require approximations. To my knowledge this sort of method is not being used in deep learning, but has been used by physicists in perturbation theory. I think you will find it extremely powerful in deep learning. Padé (or Canterbury) summation methods when generalized are a way of extracting information from incomplete data. So if you use a neural net to get a few first approximants, and assume they are modelling an analytically continued function, then you have a series (the node activation summation) you can Padé sum and extract more information than you'd be able to otherwise.

Achrononmaster
Автор

This is hands down the best presentation on LSTMs and Transformers I have ever seen. The speaker is really good. He knows his stuff.

timharris
Автор

Best transformer presentation I’ve seen hands down. Nice job!

Johnathanaa
Автор

This finally made it clear for me why RNNs have been introduced! thanks for sharing

ismaila
Автор

Thanks for this! It gets to the heart of the matter quickly and in an easy to grasp way. Excellent.

briancase