Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

preview_player
Показать описание
Please support this podcast by checking out our sponsors:

GUEST BIO:
Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.

PODCAST INFO:

SOCIAL:
Рекомендации по теме
Комментарии
Автор

Guest bio: Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.

LexClips
Автор

Andrej speaks at 1.5x speed and Lex, as always, at 3/4x. Yet, somehow they understand each other.

mauricemeijers
Автор

Damn. That last sentence. Transformers are so resilient that they haven't been touched in the past *FIVE YEARS* of AI! I don't think that idea can ever be overstated given how fast this thing is accelerating...

totheknee
Автор

My professor Dr Sageeve Oore gave a very good intuition about residual connection. He told me that residual connections allow a network to learn the simplest possible function. No matter how many complex layer we start by learning a linear function and the complex layers add in non linearity as needed to learn true function. A fascinating advantage to this connection is that it provides great generalisation. ( Dont know why, I just felt the need to share this)

baqirhusain
Автор

Why Lex doesn't invite actual inventor of Transformers, e.g. Ashish Vaswani? All these people like Sam Altman, Andrej Karpathy are reaping the harvest of the invention by that paper "Attention is all we need", yet they are not invited even once to Lex talks.

oleglevchenko
Автор

It's amazing to have a podcast where the host can hold their own with Kanye West in a manic state and also have serious conversations about state-of-the-art deep learning architectures. Lex is one of one.

SMH
Автор

The attention name was surrounding there in the past on other different architectures. It was common to see recurrent bidirectional neural networks with "attention" on the encoder side. That's why the name "attention is all you need" comes from. That because it basically deletes the need of a recurrent or sequentially architecture.

wasp
Автор

Karpathy has some great insights. Transformers seem to solve the NN architecture problem without hyper parameter tuning. The "next" for transformers is going to be neurosymbolic computing i.e. integrating logic with neural processing. Right now transformers have trouble with deep reasoning. 

Its remarkable that reasoning processing automatically arises in transformers based on pretext structure. I believe there is a deeper concept of AI waiting to be discovered. If the mechanism for auto-generated logic pathways in transformers could be discovered, then that could be scaled up to produce general AI.

bmatichuk
Автор

- Understanding the Transformer architecture (0:28)

- Recognizing the convergence of different neural network architectures towards Transformers for multiple sensory modalities (0:38)

- Appreciating the Transformer's efficiency on modern hardware (0:57)

- Reflecting on the paper's title and its meme-like quality (1:58)

- Considering the expressive, optimizable, and efficient nature of Transformers (2:42)

- Discussing the learning process of short algorithms in Transformers and the stability of the architecture (4:56)

- Contemplating future discoveries and improvements in Transformers (7:38)

ReflectionOcean
Автор

I read the paper and was wondering Transformer is another kind of a LLM for generative tasks as they mentioned it as a model and also compared with other models at the last of the paper but finally after watching this explanation by Andrej i understood it is a kind of an architecture that learns the relationship between each sequence

rajatavaghosh
Автор

I double checked if i was listening at 1.25x speed when Andrej was speaking

tlz
Автор

1:56 “I don’t think anyone used that kind of title before, right?”
Well, maybe not as a title, but I can’t imagine that the authors of the paper were unaware of the lyric “Love is all you need” from The Beatles’ 1967 song “All You Need is Love.”

jeff__w
Автор

Andrej's influence on the development of the field is so underrated. He's not only actively contributing academically (i.e. through research and co-founding OpenAI), but he also communicates ideas so well to the public (for free, by the way) that, he not only helps others contribute academically to the field, but also encourages many people to get into it simply because he manages to take an overwhelmingly complex (at least for me it used to be) topic such as the Transformer and strips it down to something that can be (more easily) digested. Or maybe that's just me - as my professor in my undergrad came no where near to an explanation of Transformers that it as good and intuitive as Andrej's videos do (don't get me wrong, [most] professors know their stuff very well, but Andrej is just on a whole other level).

aangeli
Автор

Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.

amarnamarpan
Автор

Self-attention. Transforming. It's all about giving the AI more parameters to optimize what are important internal representations of the interconnections between data itself. We've supplied first order interconnections. What about second order? Third... or is that expected to be covered in the sliding window technique itself? It would seem the more early representations we can add the greater we can couple to "the data" complex/nuance. At the other end, the more we couple to the output, the closer to alignment we can achieve. But input/output are fuzzy concepts in a sliding window technique. There is no temporal component to the information. The information is represented by large "thinking spaces" of word connections. It's somewhere between a CNN like technique to parse certain subsections of the entire thing at once, to a fully connected space between all the inputs. That said sliding is convenient as it removes the hard limit of what can be generated and makes for an easy to understand parameter we can increase at fairly small cost our increase our ability to generate long form representations exhibiting deeper level nuance/accuracy. The ability to just change the size of the window and have the network adjust seems a fairly nice way to flexibly scale the models, though there is a "cost" to moving around IE: network stability meaning you can only scale up or down so much at a time to maintain the most knowledge incurred from previous trainings.

Anyway, the key ingredient is, we purposefully encode the spatial information (to the words theme-selves) to the depth we desire. Or at least that's a possible extension. The next question of course is which areas of representation can we supply more data that easily encodes within the mathematics of information we think is important to be represented in the information (that isn't covered by the processes of the system itself (having the same thing represented in multiple ways (IE: the Data + the system) ) is a path to overly-complicated systems in terms of 'growth/addendums". The easiest path is to just represent in the data itself. And patch it. But you can do stages of processing/filtering along multiple fronts and incorporate them into a larger model more easily, as long as the encodings are compatible (which I imagine will most greatly affect the growth of these systems/swapability though standardized ).

Ideally this is information that is further self-represented within the data itself. FTT are a great approximations we can use to bridge continuous vs discrete knowledge. Though calculating it on word encodings feels a poor fit, we could break the "data signal" into an individual chosen subset of wavelengths. Note this doesn't not help in the next word prediction "component" of the data representation, but is a past knowledge based encoding that can be used in unison with the spatial/self-attention and parser encoding to represent the info (I'm actually not sure of the balance between spatial and self-attention except that the importance of the token in the generation of each word to the previous word (along with a possibly a higher order of inter-connections between the tokens) is contained within the input stream). If it is higher order than FFT's may already be represented and I've talked myself in a circle.

I wonder what results dropouts tied to categorization would yield on the swap-ability of different components between systems? Or the ability to turn various bits/n/bobs on/off in a way tied to the data? I think that's how one can understand the partial derivative reverse flow loss functions as well, by turning off all but one path at a time to split the parts considered, but that depends on the loss function being used. I imagine categorization of subsections of data to then spit off into distinct areas would allow for finer control on representations of subsystems to increase scorability on specific test without affecting other testing areas as much. Could be antithetical to AGI style understanding, but it allow for field specific interpretation of information in a sense.

Heck. What if we encoded each word as their dictionary definitions?

Halopend
Автор

Great interview. Engaging and dynamic. Thank you.

MsStone-ueek
Автор

“Attention is all you need” is great. It’s like a book title that you can’t forget.

danparish
Автор

6:30 😂 imagine how fast this sounds to lex

diedforurwins
Автор

Best explanation of the essence of Transformer architecture. I think the title is a red herring because it makes it more difficult to understand. You need much more than Attention, you need all the smart tweaks. And it keep making my mind think of Megatron from the movies, and I’m not sure what if any is the relationship. I like the generalized differentiable program, as the best description of a Transformer Model today. But that could change. The description is from Yuan LeCun in 2017-19 time period. Jennifer

datasciyinfo
Автор

Amazing how one paper can change the course of humanity.
I like that kind of return on investment, let’s get more weird ambitious.

alexforget