How Decoder-Only Transformers (like GPT) Work

preview_player
Показать описание
Learn about encoders, cross attention and masking for LLMs as SuperDataScience Founder Kirill Eremenko returns to the SuperDataScience podcast, to speak with @JonKrohnLearns about transformer architectures and why they are a new frontier for generative AI. If you’re interested in applying LLMs to your business portfolio, you’ll want to pay close attention to this episode!

Рекомендации по теме
Комментарии
Автор

Value vectors are scaled by the amount of attention weights. Say weight is 6. Then value vector[1, 2, 3] * 6 is [6, 12, 18]
Attention weights are achieved by dot product of Q. K. Inner product will be always a scaler 6 in our case

Ash-bcvw
Автор

...I listened to Podcast #747 at least 10 times ! I wish every policy maker and general manager listened to Podcast #747 - the podcast will dispel any confusion about the term "AI" and other anthropomorphic terms as the "pre-training", "training" and other terms that may seem to you that are humanistic in nature. It's really just assigning number values to words. Then using statistics you area able to attribute a "vector" point in space. Therefore since space is limitless then can and assign a huge number of parameters the show the location of that vector point and all the points near the point and far away from the point.

energyexecs
Автор

Masked self attention should be discussed here

Ash-bcvw