have benchmarks been the problem the whole time?!?

preview_player
Показать описание
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links
Рекомендации по теме
Комментарии
Автор

INCITE is based on the Pythia architecture (2.8B and 6.9B variants, specifically, hence "INCITE 3B" and "INCITE 7B"). It was released by Together as the first model series trained on the RedPajama v1 dataset (open source attempt at replicating LLaMa v1's data mixture and size i.e. 1T tokens). The full model name is "RedPajama INCITE [3B/7B]" as seen on HF.

As a basis for comparison, Pythia models were trained on The Pile (300B tokens).

stereoplegic
Автор

If the model's outputs are not truly reflective of well-behaved statistical distributions, then evaluating models based on output probabilities alone may not provide a complete picture of their knowledge and reasoning processes.
The sensitivity of benchmark performance to arbitrary fluctuations in output probabilities may be even more problematic if those probabilities do not have a clear and consistent interpretation.
Designing better benchmarks may require going beyond just comparing output distributions, and instead probing the model's internal representations and decision-making processes more directly.
The discrepancy between smooth scaling laws and emergent abilities could potentially be explained by the intractable nature of the underlying functions, rather than just issues with benchmark design.

That said, even if output probabilities are not perfect proxies for the true underlying functions, they still provide a useful signal that can be informative about model behavior and performance. The key points about the limitations of narrow benchmark comparisons and the potential benefits of evaluating models over their full vocabulary remain valid, even if output probabilities are not a complete representation of the model's knowledge. Nevertheless, the intractable nature of the underlying functions is an additional factor that should be considered. Future work on model evaluation and benchmarking should ideally seek to combine insights from output distributions with other approaches that can probe the model's internal workings more directly. Integrating these perspectives may lead to a more complete understanding of how and why model behavior varies with scale.

dr.mikeybee
Автор

I need to read the paper on my own but does this mean that one can manipulate the results of the benchmark or let's say not manipulate but the benchmark results may differ using the existing method by altering the other 3 false choices. Like if it is a math question and as wrong answers I provide apple, banana and orange, will i increase the 'performance' of the llm?

cancelebi
Автор

Wow, emergent behavior. Emergence is when a complex entity has properties or behaviors that its parts do not have on their own. Like a living cell being able to do more than the molecules it is made of by themselves.

Anders
Автор

maybe multiple choice question is just not a good way to test knowledge, be it human or machine...

junhochai
Автор

This is an interesting paper. There allot of scale is all you need people. Scale has been important but not decisive. We have 70b models way better then 500b models trained 2 Years ago.

sadface
Автор

You couldnt script your own speech follow text? Also your breakdown of this is makes your understanding of the material apparent

magnadox
Автор

Suddenly show up like water turning to ice. It’s a phase transition

StephenRayner
Автор

Bro pro tip, your mouth gets dry as you talk too long and ruins the recording with 'meat gulping' sounds. Drink water, retain viewers

ickorling
welcome to shbcf.ru