7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

preview_player
Показать описание

In this video, I will be going through and explain the benchmarks for Chatbot Arena & Open LLM leaderboard. These are more general benchmarks for text-based LLMs, so HumanEval is not here. Let me know any other benchmarks you want me to explain in the future!

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi

[Video Editor] Silas

0:00 Intro
0:57 MMLU
1:41 ARC
2:10 HELLASWAG
2:57 Winograde
3:27 TruthfulQA
3:52 GSM8K
4:26 MT-Bench
5:05 Outro
Рекомендации по теме
Комментарии
Автор

What do you think about the discovered unreliability of the MMLU?

Yottenburgen
Автор

Another benchmark unfortunately also named ARC is "Abstraction and Reasoning Corpus". The best models solves 30% of the tasks.

simonstrandgaard
Автор

sometimes 1 point makes a big difference! like when Fallout New Vegas got a 84 instead of an 85 on meta critic and thus Bethesda denied them a cash bonus...

arkurianstormblade
Автор

I'd suggest mentioning humaneval, humaneval+ and taco to cover the current programming benchmarks.

theresalwaysanotherway
Автор

Sorry, enjoying your videos for few months and didnt even subscribed. Now subcribed and notification on. Keep doing awesome videos my dude.

Synthetiks
Автор

3:28 dam, this makes gpt 3 seem like a gigachad.

zyxwvutsrqponmlkh
Автор

So what benchmarks are available for stable diffusion? And how long are we going to have Bad Hands?

luislozano
Автор

Nice. You mostly bracketed the "safety" benchmark datsets that seem to be more important in other contexts (end users mostly don't care, but governments and consequently big corporations do). Maybe you could go into some of those in some future video?

YUTPIA
Автор

What about performance at the foundational model? That is, the ability to predict the next token in a large text corpus? It seems to me this metric is never mentioned.

luciengrondin
Автор

Which website has all the 100+ benchmarks listed?

rajatajayakumar
Автор

Can u explain midjourney?
Can u explain video generator?
Can u explain plzz

gametophacker
Автор


The leaderboard looks like a merging circus to me, because they finetune on almost the same number and barely differentiate. It's all just random jiggle based on their initial numbers.

Veptis
Автор

3:28 a 9\11 question in a LLM benchmark is crazy

wuspoppin
Автор

If the llm solves the tasks, then later he knows these tasks and can prepare himself (the creators) for the same questions.

That is, next time the creators of this llm will do tuning to llm to be prepared for such questions.

It's like going to an exam, writing down the questions and next time solving them 100%

inout
Автор

Should mention the danger of training on tests data, which is called 'benchmark poisoning'. A lot of recent models were removed from the HF leaderboard for that.

alexxx
Автор

5:23 "It's available in both English and Japanese and Chinese".

I.____.....__...__
Автор

TLDW:

Here is a summary of the key points from the document:

The document discusses the 7 most popular benchmarks used to evaluate text-based large language models. These benchmarks are used to rank models on leaderboards and test different capabilities.

Massive Multitask Language Understanding (MMLU) - Multi-choice questions testing knowledge across many domains. Scores models by averaging performance per category.
AI2 Reasoning Challenge (ARC) - Multiple choice questions testing reasoning abilities at a 3rd-9th grade level. Focuses on scientific reasoning.
HellaSwag (HSWAG) - Choose the most plausible continuation out of 4 sentences, with 3 being adversarially generated wrong answers. Tests common sense.
Winograd Schema Challenge (Winograd) - Fill-in-the-blank problems with binary choice. Tests common sense reasoning.
Truthful QA - Answer questions correctly and not generate conspiracy theories or false facts. Checks against spreading misinformation.
Grade School Math (GSM) - Multi-step math word problems testing logic and math capabilities.
Mt Bench - Fine-tuning benchmark with 160 multi-turn conversational questions to test instruction following and conversational abilities. Used for chatbot leaderboards.

The benchmarks test different capabilities of language models and are used to rank them on public leaderboards like Anthropic's Constitutional AI and chatbot Arena.

felipevaldes
Автор

Imagine if the AI started outputting conspiracy theories like jews digging tunnels in NY city :^)

Guedez
Автор

And a significant portion of these benchmarks are just wrong.
I think real human evaluation with an ELO system is the only real way to assess models. There is no ground truth for generative AI to be compared to.

shApYT
Автор

benchmarks are dumb LLM are suppose to learn over time. so it should pass any test over time. its like teaching a kid a life time of data on day one

flareonspotify