Generative Models Can Outperform The Experts That Train Them

preview_player
Показать описание
Transcendence: Generative Models Can Outperform The Experts That Train Them

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links
Рекомендации по теме
Комментарии
Автор

I find this white paper incredibly misleading. The researchers even admit that the reason training on 1500 elo data doesn’t result in 2000 elo play is because 1000 elo players are just “noisy” 1500 elo players. In other words the 1000 elo chess players have that ranking in part because they make a high frequency of major blunders. The low temp model effectively smooths out those major errors and it just makes less blunders. This is why it’s not repeatable at higher elo and not a sign of any generalization outside of chess

iansotir
Автор

They rediscovered Parondo's Paradox. Interesting to see it applied in this context.

BradleyKieser
Автор

Great find, great paper and great walk through! I have a feeling there's more to be found in this direction of research. Well worth following this thread in coming months. Thanks!

BrianMosleyUK
Автор

In business there is the second mover advantage where you can just copy the moves of the first movers while avoiding their mistakes to out compete them, but as a market matures, the mistakes made become less common and more similar and so this advantage is lost.

It would be interesting to see future work train a model to exploit this phenomenon by identifying weak moves and avoiding them to show that you can always get better or equal performance to the best trajectories trained on. I would bet though that exploration is required to discover better strategies when performing at the highest level.

_paixi
Автор

I kinda have a guess for why there is a difficulty for 1500 trained models to transcend to 2000, but the low-rated played can.

I am a noob chess player myself, so I'm kinda saying this from experience. Human chess players make a bad move when they do not know what they are doing. That's why they just try an educated guess for their next move. Since these educated guesses can be considered as a stochastic process that if averaged through an ensemble will cancel out, you can get the move of players who do know what they are doing.

This does not apply to 1500s above because they do need to know specific things. Also, it relies more on qualities within the player itself. To make a best move from 1500s above(I am 1300-1400), you need to be able to notice tactical patterns. This ability to notice things is more of an experience problem than a knowledge problem. 1500 players are players that have a general knowledge of chess but I don't think that most of them have enough years of playing chess to notice tactical patterns easily.

To summarize, knowledge problem can be solved through transcendence as solid knowledge will be left from averaging out guessed knowledge. However, experience is not the same thing. You can imagine the players as having a time played in their heads and most of them have clearly lower time playing chess than those in the 2000s(exceptions are prodigies but they are a small part of the population). Averaging out does not help as most 1500 experts in your data do not have enough experience to notice and calculate the best moves. The best you can do is have a very small part of the population who do have the best move and then some clusters of majority moves that players with less experience plays.

jakeaustria
Автор

So what this would be good at is pointing out mistakes, but not really knowing chess. but i think pointing out mistakes is an important tool for something eventually.

minecraftermad
Автор

Mic quality makes a world of difference idk sounds way less chaotic

spencerfunk
Автор

Really good stuff… but they seriously need to try this with a more sophisticated approach to chess. Only having vectors of moves with no board knowledge… thats a great starting point. I was hoping they’d do more after that though

goodtothinkwith
Автор

This seems like good evidence that Chollet is wrong

goodtothinkwith
Автор

I can actively imagine this technique being “the thing” that puts us on the path toward achieving super intelligence.

spencerfunk
Автор

Hello, Turnadorables. I kinda stochastically stumbled to your channel through my chaotic meandering haha. I did not expect to receive such channel that reviews recent research paper and condenses it for layman people like me.

I will be going to college soon and I choose Statistics as my course. I first have thoughts about Computer Science, but after being second to the last and later on being in the top half in NOI, I kinda realized I might not be that good for it. So I just choose what I think is the skill where I am more skilled at. So, watching this channel kinda rekindled my desire to learn CS more. Anyways, both fields are pretty close. And I am learning statistics because I also want to make discrete neural networks that is run in binary and not floating numbers. I tried many methods for this including putting floating type input data into the gaussian inverse transform to convert it into uniform, and then binarizing that uniform. I then used Bayesian Statistics to predict the output bit using the past input bits. I encountered many problems including overfitting and the undesirable quality of predictions by discrete models. I haven't given up yet. I tried Voronoi-inspired approach and used the Hamming Distance then some weights. Anyways, your channel is a godsend for me.

In the topic for transcendence, I get this as a wisdom of crowds kind of thing. What I am surprise about is that increasing randomness actually decreases transcendence. I kinda expected it to be unimodal where there is an optimum temperature where transcendence is maximum just like in the Kelly Criterion(optimum fraction of capital). Kinda surprising and a bit disappointing but amazing nonetheless.

jakeaustria
Автор

I always wondered if models like these learned how to play as well as the best players (as it provides the most logical / consistent signal), and would then nerf themselves to emulate lower level players. Seems like that might very well be the case! It makes you wonder how good some models might actually be if we somehow removed the 'self-sabotage; mechanisms from within themselves.

marinepower
Автор

oi, audio's good.. damn that's a cool paper 👍

GNARGNARHEAD
Автор

The impressive-sounding-ness of the name of this paper seems to me, not necessarily justified by the importance of the results?
Like, the idea makes sense. If different experts make different random errors (or, in random selections of inputs, make errors),
then by averaging over them, and taking what is most likely, then you can partially filter out that random error.
This doesn’t seem surprising?
Like, is there a real reason to use “transcend” over “exceed” other than it sounding cool? Not that that means they shouldn’t have called it what they did. I don’t know. I don’t know what the norms should be.

Actually, the part that sounds more surprising to me, is that they proved(?!) that “if just training on a randomly selected expert, without altering the temperature, then you can’t possibly do better than the best expert in the training set”

Surely that should depend on the (implicit) prior over functions that is used when learning the behavior from the experts?

Maybe it relates to “assuming infinitely much training data from each expert”?

If the only thing being changed is the distribution over points in testing, but where the set of points present is the same, then I guess the “it has to just be the average of what the experts say” makes sense. Can one give a version of that with temperature then? I think so...

If you have a probability distribution over a discrete set, say, over the natural numbers,
well, the Boltzmann distribution is p_i = e^{T z_i} / (\sum_j e^{T z_j} )
if we set T=1, then if we take z_i = \ln(p_i), that gives the distribution (p_i)_i
and then to change the temperature to T,
q_i = e^{T \ln(p_i)} / (\sum_j e^{T \ln(p_j)})
= (p_i)^T / (\sum_j ((p_j)^T))
(This is assuming that none of the p_i are zero, but, after canceling the exp and the log that’s no longer a problem for T>0, so whatever)
Ok.
Yes, you can alter the temperature of a distribution regardless of where you got it.

drdca
Автор

Ensembles have always been known to outperform individual models. Nothing different. LLMs are essentially ensembles.

pensiveintrovert
Автор

Well what did you expect making a gigantic compute node that is dedicated to next word prediction. It starts to predict words better than you.! Lol

JaredFarrer
Автор

unless you're an idiot like me and you combined them all and made a bunch of your own and mix and match them at random!

jonnylukejs
Автор

Tesla engineers:
Ops, lets clean all the data of females parking a vehicle from the data set ... 😛

Woke activists begins the protest at the company's door ...

Tesla PR dude dressed as a girl ... The model is agnostic ... 'the data set required a data set diversity as a necessary condition to trascend ... ' '... Not genre identity bias in the system ... '

firstnamesurname
Автор

This is a "if bad players wouldn't blunder they would be less bad players". Duh. Despite all the setup they do with theory there is one fundamental assumption in here that seems flawed which is that mistakes are inherently random and can be smoothed out. Any half-way decent player can tell you this is not the case and you can strategically exploit the typical 1500 elo chess player mistakes as a better player. So it makes sense that the model would include these mistakes in its function fitting.
The core model of how mistakes happen is flawed. An even more obvious way to put it might be that they assume that playing a 1500elo player is like playing stockfish if you injected sufficiently many random moves until you reached a maximum evaluation of 1500elo. Anyone can immediately tell you this is obviously not true.

lexer_
Автор

Good video, explained very well, but the underlying concept is pretty simple:

That's how markets work - the average of all knowledge is better than any individual (wisdom of the crowds).

ckq