Information over-squashing in language tasks

preview_player
Показать описание

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links
Рекомендации по теме
Комментарии
Автор

I have been thinking about the tool use vs internally represented math capabilities a lot lately. I think there is something to be said for being able to do the math internally that may enable greater amounts of "thinking" to be done per-token generated.

I'd be very interested in seeing research on embedding/freezing multiple small grokked models on different tasks (arithmetic up to N digits, sorting algorithms, etc) into the layers of a much larger LLM with the same architecture, and then training all of the weights of the LLM that are not those embedded grokked models.

Seems like that would allow connections to form to potentially utilize those weights for their intended tasks within the middle layers.

So much more research to do in this area.

AdamBrusselback
Автор

One problem is that many seem hellbent on thinking that there is "general intelligence" and not see neural networks (including humans!) for what they are: representing a compressed function of seen data. Some abilities will emerge way way higher scale. We just aren't there yet. I think Yann LeCun has it right. Current models are barely cat level of representationability.

BooleanDisorder
Автор

Thank you for continuing going over this information, very appreciated. =]

ColinTimmins
Автор

I think an inability to count is representative of a much larger issue - that of being poor at learning/developing internal recursive algorithms (for e.g. symbol manipulation).

We use and develop such recursive algorithms for so much in our day to day lives, and even moreso when trying to do science, engineering, planning or maths. Counting is a pretty trivial algorithm that only really requires 20 or so update rules (1+1=2, 1+2 = 3, a+b = b+a, 9+1= 0 and 1 carry etc).

The fact that these systems with poor counting definitely don't discover or follow such an algorithm when asked indicates that something is off - maybe something like LLMs needing more hidden tokens to 'think' during training but where the thinking tokens aren't counted towards the backprop accuracy?

hamishpain
Автор

Thank you for the video. This paper did a good job explaining the actual issue (over squashing), which is actually caused simply by how the model doesn't include later bits in earlier nodes (progressively)... and in combination the floating point representations cannot distinguish different very small values ... very interesting.
I am not sure I can agree that their explanation (recency bias) is adequate for the U-shape however, since, they are clearly saying the mechanism makes probability of solution decreased further down the number of tokens from the first token. Recency bias doesn't seem strong enough to power out the over-squashing, so, I would like to see more evidence of that.

marcfruchtman
Автор

5:35 on this point you can think of the token embeddings as feature vectors and the attention mask as the adjacency matrix.

qwrxdqn
Автор

I'd think tokenization could be a solve here, mechanisms for representing infinitesimals... generally describing the various characteristics of numbers 🤔

GNARGNARHEAD
Автор

I suspect many issues might be arising from having built most of the current neural-network architectures on top of the concept of high parallelism and "using 100% of the brain all the time"; once they figure out a way to give more "juice" to specialized regions of the "brain" depending on the task and not waste processing power on calculations that eventually get multiplied by zero in order to essentially emulate flow control, we might start seeing AIs that don't sacrifice some specialized niche abilities, and less often get confused by polysemantic neurons.

ps: Yeah, I know that the "humans only use 10% of the brain at a time" is kinda a myth, but we still have regions of the brain that get more or less active depending on the task, as can be verified with FMRI's, PET's etc, levels of sugar, oxygen consumption/CO₂ production etc can be measured be at different values in different areas depending on the task.

tiagotiagot
Автор

Humans do inference while they are doing learning and are finely tuned to elements that disagree with what the inference would predict (ie “surprise!). I don’t think we will get better algorithms until we do that kind of hybrid.

splunge
Автор

While this was interesting, I think the paper didn't go far enough to make a big impact. All of these problems are known, and they neither presented scaling estimates (for when they become dominant), nor provide a potential mitigation. One obvious area they could have looked into was empirically proving (and extrapolating) Proposition 5.2, which would suggest that there is an UPPER limit to how big a LLM can scale, which would have a huge impact... The limit of L --> infty isn't useful, but understanding if this is related to d or d_k and hits a Pareto limit at say L = 240, then that's important.
Meanwhile, the issue is likely coming from how attention is computed in a flat (fine-grained) manner (which they alluded to as the softmax issue). That does not necessarily suggest that the information random walk is the contributing factor, but the sequence length itself is. Using a smarter encoding strategy would end up mitigating the effect seen as well. For example: [1]*20 + [0] could be encoded as '1', 20, '0', 1 (i.e. 4 tokens) rather than 21 tokens with a mechanism similar to CoPE (at the attention level). If you can reduce the number of relevant tokens to consider during attention, then you also can minizine the numerical underflow issue, although coming up with a solution that is also differentiable is non-trivial.

hjups
Автор

All computers in the world use around two thousand gigawatts, calculating a zettaflop of calculations, 81 quadrillion with only 200 Gigawatts

lagrangianomodeloestandar
Автор

8 billion brains would spend only around 162 Gigawatts per hour of the 200 million terawatt hours worldwide that humanity spends...

lagrangianomodeloestandar
Автор

He added this comment below as well:But the most relevant thing would be, if you know the emergence of properties, or evolution, or exponential growth... The problem with that infinite precision is that it is really due to silicon digitial hardware laws, that infinite precision with other materials, that is, not only using valves and simple circuits that only manage electrons, could and would perhaps surpass simulation generated by simple electrical components (transistors, capacitors, etc.), since if all the programs or things that we ask of this system are not its natural hardware properties and practically everything could be said to be emergent properties and not real, that is, putting together simple properties to copy or approximate complex ones, that is, they exist because they emerges faster if it is a property directly close to said properties (for example, polynomials are obtained from monomials or hydrocarbons are obtained from carbon and hydrogen), or they are only approximations(A Taylor series may be required to approximate the sine or cosine with polynomials.)If transistors act as polynomials, simple finite binary Boolean functions, or perhaps ternary, but in the end finite and properties such as photonics and entropy exist in reality, Why do they spend money on such sophisticated simulators if the properties they simulate can be simulated with processes that reality executes without emerging those of others, and therefore you are not spending resources from one property to simulate another because there is already an efficient way to execute it in reality. My idea then would be distilled from hardware, directly my idea would be to buy all the GPUs you need or all the hardware you have and directly about the real world artificial intelligence It is not a static circuit but reincarnates again and again with the aim of moving to more efficient materials of reality. If 1 billion parameters require 1 exaflop, use 1000 exaflops to distill it into a piece of metal, plastic or silicon or boron or any material with a 3D printer or method to compress it, that would be my idea, invest much more in compressing a digital model to a different process or translate the calculations to processes that allow said calculations to be carried out with greater precision than digital circuits and that are better in the real world, organic life if we consider that it uses such a premise, that is, DNA and RNA are pure intelligences or superior functions with capabilities similar to these current systems, but in the real world, language models should become a DNA, be hardware instead of a piece of software to be efficient and auto-scalable, all those GPUs that simulate 3D models for example those of Nividia It does not occur to them to train a model to create a self-integrated neural network on a surface or volume and test it in reality with very few resources...

lagrangianomodeloestandar
Автор

An idea that perhaps I would have or have this moment and the continuous is thanks to several schemes of various hierarchies of intelligence, complexity, etc. as accurate, as probably also or not equally wrong."Data<Information<Knowledge<Wisdom", The first order would be the data, facts, particles, names, with millions of them together emergent processes emerge, millions of books do not have to form a process with millions of data, nor millions of grains of sand, but paradoxically or not, when moving shelves of books, transcribing them, etc. or molding the sand, etc., a lot of data transformations would be carried out, a process then it would be a quantitative leap that would give a qualitative leap in complexity, where the emergence of a process would not imply that it could not emerge, 1 million atoms can be immovable and be together, or being a process generating new data or states every moment, then if the processes, the information is what originates, the next level is that of many types of information the qualitative leap is pure, implicit, weak intelligence, etc. Of course, they would not perhaps be direct steps either, if a lot of data as I said before can jump to processes, but they may only be dependent subprocesses.Temporary calculator data would be just sets of data, but could be considered processes within the entire process, I would call them subprocesses, but they would be dependent processes or simple data, with weak intelligence perhaps the same, below a complete pure or weak intelligence as a superior function, there are sets of functions or simple processes, which, since there are several, would be concepts, The concepts then would be the processes, or neuronal synapses, of lower order and the engrams would be "weak subintelligences" or complex processes or intelligences coupled to the complete one...So from my observation the fourth step or the fourth jump on the ramp with perhaps infinite intermediate steps of gray, the hardware is approaching the third step, the intelligent hardware of the future It will achieve it in reality intrinsically because intelligence is the hardware and not the software, and general intelligence would then be a set of many simple intelligences or efficient distillation of all the levels below that allow it to access all the levels below and execute general intelligence, supported by weak intelligences, many processes and a ridiculous amount of data...(Think that simulating the brain would require perhaps the entire world's computing, 4 zettaflops, internal processes no more than 1000 billion synapses, intelligences a few tens or hundreds of capsbilities and consciousness and our greatest abilities are only practically a single general ability that encrypts very little information...)

lagrangianomodeloestandar