Multi-Head vs Grouped Query Attention. Claude AI, Llama-3, Gemma are choosing speed over quality?

preview_player
Показать описание
Multi-Head vs Grouped Query Attention. Are Claude, Llama-3, Gemma are choosing speed over quality?
frontier model providers such as anthropic claude 3.5 sonnet, and Google Gemini / Gemma 2B and Meta Llama-3 are trending towards using grouped query attention vs traditional multi-headed attention in transformer models as their attention mechansim. Interesting OpenAI with GPT-4o doesn't seem to be making this trade off.

Although this choice speeds up AI inference, it does impact content quality for output such such as summarization. in this video chris shows that you get better coherent output from models such as llama-2 or claude 3-opus over new models such as llama-3 or gemini or gemma. in the end, in certain scenarios such as summarization or generative content, gpt-4o still beats sonnet.

repo
Рекомендации по теме
Комментарии
Автор

I just attended detailed anatomy of LLM session.. and it’s just wow! Nobody’s telling these details. Thanks very much Chris ❤

makepeace
Автор

Interesting!

Claude 3.5 Sonnet is definitely great for code, much better than cgpt 4-o & has really helped me solve things that are well beyond my brain capacity in the last few days.

everyhandletaken
Автор

Great video! I don’t understand it fully, had to watch it again, but I‘m getting a idea of what is happening! Thank you!

trsd
Автор

LLaMA2-70b uses GQA (only its 7b version used MHA)

LombardyKozack
Автор

Look at me when you talk to Me Boy Look AT ME
You shy too much Love it


Thanks its it really helped in my pretention

awaisamin
Автор

Intel agencies are having their fill first. Its obviously being slowed down so three letter agencies can get ahead of this.

seanknowles