7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

Показать описание

In this video, I will be going through and explain the benchmarks for Chatbot Arena & Open LLM leaderboard. These are more general benchmarks for text-based LLMs, so HumanEval is not here. Let me know any other benchmarks you want me to explain in the future!

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi

[Video Editor] Silas

0:00 Intro
0:57 MMLU
1:41 ARC
2:10 HELLASWAG
2:57 Winograde
3:27 TruthfulQA
3:52 GSM8K
4:26 MT-Bench
5:05 Outro

Рекомендации по теме

Комментарии

What do you think about the discovered unreliability of the MMLU?

Yottenburgen

Another benchmark unfortunately also named ARC is "Abstraction and Reasoning Corpus". The best models solves 30% of the tasks.

simonstrandgaard

sometimes 1 point makes a big difference! like when Fallout New Vegas got a 84 instead of an 85 on meta critic and thus Bethesda denied them a cash bonus...

arkurianstormblade

I'd suggest mentioning humaneval, humaneval+ and taco to cover the current programming benchmarks.

theresalwaysanotherway

Sorry, enjoying your videos for few months and didnt even subscribed. Now subcribed and notification on. Keep doing awesome videos my dude.

Synthetiks

3:28 dam, this makes gpt 3 seem like a gigachad.

zyxwvutsrqponmlkh

So what benchmarks are available for stable diffusion? And how long are we going to have Bad Hands?

luislozano

Nice. You mostly bracketed the "safety" benchmark datsets that seem to be more important in other contexts (end users mostly don't care, but governments and consequently big corporations do). Maybe you could go into some of those in some future video?

YUTPIA

What about performance at the foundational model? That is, the ability to predict the next token in a large text corpus? It seems to me this metric is never mentioned.

luciengrondin

Which website has all the 100+ benchmarks listed?

rajatajayakumar

Can u explain midjourney?
Can u explain video generator?
Can u explain plzz

gametophacker

The leaderboard looks like a merging circus to me, because they finetune on almost the same number and barely differentiate. It's all just random jiggle based on their initial numbers.

Veptis

3:28 a 9\11 question in a LLM benchmark is crazy

wuspoppin

If the llm solves the tasks, then later he knows these tasks and can prepare himself (the creators) for the same questions.

That is, next time the creators of this llm will do tuning to llm to be prepared for such questions.

It's like going to an exam, writing down the questions and next time solving them 100%

inout

Should mention the danger of training on tests data, which is called 'benchmark poisoning'. A lot of recent models were removed from the HF leaderboard for that.

alexxx

5:23 "It's available in both English and Japanese and Chinese".

I.____.....__...__

TLDW:

Here is a summary of the key points from the document:

The document discusses the 7 most popular benchmarks used to evaluate text-based large language models. These benchmarks are used to rank models on leaderboards and test different capabilities.

Massive Multitask Language Understanding (MMLU) - Multi-choice questions testing knowledge across many domains. Scores models by averaging performance per category.
AI2 Reasoning Challenge (ARC) - Multiple choice questions testing reasoning abilities at a 3rd-9th grade level. Focuses on scientific reasoning.
HellaSwag (HSWAG) - Choose the most plausible continuation out of 4 sentences, with 3 being adversarially generated wrong answers. Tests common sense.
Winograd Schema Challenge (Winograd) - Fill-in-the-blank problems with binary choice. Tests common sense reasoning.
Truthful QA - Answer questions correctly and not generate conspiracy theories or false facts. Checks against spreading misinformation.
Grade School Math (GSM) - Multi-step math word problems testing logic and math capabilities.
Mt Bench - Fine-tuning benchmark with 160 multi-turn conversational questions to test instruction following and conversational abilities. Used for chatbot leaderboards.

The benchmarks test different capabilities of language models and are used to rank them on public leaderboards like Anthropic's Constitutional AI and chatbot Arena.

felipevaldes

Imagine if the AI started outputting conspiracy theories like jews digging tunnels in NY city :^)

Guedez

And a significant portion of these benchmarks are just wrong.
I think real human evaluation with an ELO system is the only real way to assess models. There is no ground truth for generative AI to be compared to.

shApYT

benchmarks are dumb LLM are suppose to learn over time. so it should pass any test over time. its like teaching a kid a life time of data on day one

flareonspotify

7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

Master LLMs: Top Strategies to Evaluate LLM Performance

LLM Explained | What is LLM

Mythbusters Demo GPU versus CPU

Why you should build an LLM benchmark [English]

Everything WRONG with LLM Benchmarks (ft. MMLU)!!!

LLM Evaluation Basics: Datasets & Metrics

Llama 3.1 - 405b, 70B & 8B: The BEST Opensource LLM EVER!

What are Generative AI models?

Ultimate Guide to LLM Benchmarks: MMLU, HellaSwag, MBPP, GSM-8K, ARC Challenge & More!

Testing Stable Diffusion inpainting on video footage #shorts

Evaluating LLM-based Applications

Why I Do NOT Use Flutter for Mobile App Development

AI vs Machine Learning

All Machine Learning Models Explained in 5 Minutes | Types of ML Models Basics

LLM Explained | Common LLM Terms You Should Know | KodeKloud

What is your Weakness? | Best Answer (from former CEO)

Vector databases are so hot right now. WTF are they?

Programming Language Tier List

Microservices Explained in 5 Minutes

A Survey of Techniques for Maximizing LLM Performance

Top 5 things I check in every new AI / LLM Model Release

Neo4j in 100 Seconds

My Jobs Before I was a Project Manager