have benchmarks been the problem the whole time?!?

Показать описание

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links

Tunadorable

Рекомендации по теме

Комментарии

INCITE is based on the Pythia architecture (2.8B and 6.9B variants, specifically, hence "INCITE 3B" and "INCITE 7B"). It was released by Together as the first model series trained on the RedPajama v1 dataset (open source attempt at replicating LLaMa v1's data mixture and size i.e. 1T tokens). The full model name is "RedPajama INCITE [3B/7B]" as seen on HF.

As a basis for comparison, Pythia models were trained on The Pile (300B tokens).

stereoplegic

If the model's outputs are not truly reflective of well-behaved statistical distributions, then evaluating models based on output probabilities alone may not provide a complete picture of their knowledge and reasoning processes.
The sensitivity of benchmark performance to arbitrary fluctuations in output probabilities may be even more problematic if those probabilities do not have a clear and consistent interpretation.
Designing better benchmarks may require going beyond just comparing output distributions, and instead probing the model's internal representations and decision-making processes more directly.
The discrepancy between smooth scaling laws and emergent abilities could potentially be explained by the intractable nature of the underlying functions, rather than just issues with benchmark design.

That said, even if output probabilities are not perfect proxies for the true underlying functions, they still provide a useful signal that can be informative about model behavior and performance. The key points about the limitations of narrow benchmark comparisons and the potential benefits of evaluating models over their full vocabulary remain valid, even if output probabilities are not a complete representation of the model's knowledge. Nevertheless, the intractable nature of the underlying functions is an additional factor that should be considered. Future work on model evaluation and benchmarking should ideally seek to combine insights from output distributions with other approaches that can probe the model's internal workings more directly. Integrating these perspectives may lead to a more complete understanding of how and why model behavior varies with scale.

dr.mikeybee

I need to read the paper on my own but does this mean that one can manipulate the results of the benchmark or let's say not manipulate but the benchmark results may differ using the existing method by altering the other 3 false choices. Like if it is a math question and as wrong answers I provide apple, banana and orange, will i increase the 'performance' of the llm?

cancelebi

Wow, emergent behavior. Emergence is when a complex entity has properties or behaviors that its parts do not have on their own. Like a living cell being able to do more than the molecules it is made of by themselves.

Anders

maybe multiple choice question is just not a good way to test knowledge, be it human or machine...

junhochai

This is an interesting paper. There allot of scale is all you need people. Scale has been important but not decisive. We have 70b models way better then 500b models trained 2 Years ago.

sadface

You couldnt script your own speech follow text? Also your breakdown of this is makes your understanding of the material apparent

magnadox

Suddenly show up like water turning to ice. It’s a phase transition

StephenRayner

Bro pro tip, your mouth gets dry as you talk too long and ruins the recording with 'meat gulping' sounds. Drink water, retain viewers

ickorling

have benchmarks been the problem the whole time?!?

have benchmarks been the problem the whole time?!?

have benchmarks been the problem the whole time

(dont do it)Your pc doesn't use all of the RAM#ram #fpsgames #fpsboost #tips #tipsandtricks #pc...

Nvidia’s New Drivers Fix RTX 50 FPS - But it Doesn't Matter

WHAT TO DO IF YOUR PC STARTS TO SLOW DOWN⁉️ #shorts #gaming #gamers

Fix Windows 10/11 Slow Performance

How to fix the slow boot time of Windows laptops?

Is Intel’s B580 Finally Fixed After 6 Months?

You’re using Task Manager wrong

Does FPS drop in your AMD graphics card? Try this setting

Reduce your CPU temperature for 0$

A mistake that made it even better #kpop #nmixx #shorts

Speed up a slow hard drive HDD or SSD

This will free up drive space and make your PC perform better overall! l PC tips pt. 6

This will Clean Virus from your PC 😱😱 #shorts #windows

Fix slow laptop

What to do if the phone slows down and lags? How to speed up Android / Android phone

Does Leaving Your PC on Overnight Damage it?

Fix Warzone FPS + Stuttering issues instantly Update 2022

How to Make Your PC Faster

Boost Your FPS With This Simple Trick... 😱

How to Fix PERFORMANCE MODE BUG | BLURRY TREES In Chapter 3! (Fortnite Tips & Tricks)

Don’t SHUTDOWN Your Computer ⚡ || Windows 10 Shutdown Secrets #shorts

Do this if your IPhone is Running Slow #shorts