Mission: Impossible language models – Paper Explained [ACL 2024 recording]

preview_player
Показать описание
We recorded the poster presentation of the paper challenging Noam Chomsky's claim about LLMs!🫢 This paper, entitled “Mission: Impossible language models”, also won the ACL 2024 best paper award. We attended the ACL conference and can tell you what this is all about.

Outline:
00:00 Impossible Languages
01:14 Paper TLDR
03:07 Poster Presentation
07:38 Own Opinion

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, Michael, Sunny Dhiana, Andy Ma

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Join this channel to get access to perks:
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
Music 🎵 : Eastern euphoria - Patrick Patrikios
Рекомендации по теме
Комментарии
Автор

The authors did not test random tokens occurring at random, but random word **shuffles** (sorry, if this was unclear in the video, I thought I made it clear with the example in the right bottom corner at 02:09):
They took a sentence (a semantically grouped sequence of words) and shuffled the order (as shown in Figure 1, e.g., at minute 02:32). That it is just random shuffles, and not random words from the entire vocabulary, explains why the perplexity during learning decreases at all (see the figure at 09:17). It's not just any word that could appear in the sentence, but rather the words of the sentence appearing in a random order. For example, if you shuffled "the cat sat on the mat, " the LLM might have to complete something like "on mat cat". Here, "crypto" would be a much stranger continuation than "sat". This is why the LLM "learns" something, lowering perplexity during training, even though it plateaus fairly quickly.

The random word shuffles are the most extreme case. The authors have also simpler structures on the spectrum, such as the count-based grammars. Those are harder to learn for humans (some would say impossible, I'm not sure), and like humans, LLMs too show higher perplexity on those.
Then, they have reversed word order, which is still quite structured, compared to random word shuffles, and again, LLMs have a harder time learning those and the perplexity plateaus are higher than for normal English.

In conclusion, LLMs (like humans too) make a difference between natural, harder, and impossible languages: one can distinguish a naturally trained LLM with an impossible language model with just a perplexity classifier.

AICoffeeBreak
Автор

First of all, what an Incredibly good poster presentation. Second thought this finding is a testimony on how much optimization must have been placed inside human languages! And this happened naturally by interactions between humans that had the goal of making the communication efficient and effective!

Micetticat
Автор

I see a clear issue with the construction of an "impossible" language. The paper simply suggests that GPT-2 struggles more with learning something (randomization does not make something a language) with reduced or randomized structure compared to structured natural languages. However, this does not directly challenge Chomsky's claim.

In Chomsky's framework, an impossible language would be one that, despite having a clear and consistent structure, does not align with the deep abstract structure that underpins natural languages (and forms the basis of a universal grammar). The paper’s constructed language instead, merely reduces structure, which does not equate to being "impossible" in a meaningful linguistic sense.

An appropriate test of Chomsky’s claim would involve creating languages that are as structured as natural languages but whose rules deliberately contradict the deep structures inherent in natural language. If LLMs struggle more with these kinds of structured but fundamentally "unnatural" languages, it would provide stronger evidence regarding their limitations in learning impossible languages as defined by Chomsky. But given that neural networks are universal function approximators, there is no reason to believe that given sufficient structure, comparable to that of a natural language, LLMs would struggle to learn with such an "impossible" language.

While the paper does show that it is the inherent structure in a natural language that helps an LLM to learn it better than something random but we don't need to test that. That's quite straightforward and that point is sufficiently made clear in the video :)

junaidali
Автор

Great summary and discussion. Thank you Letitia.
Also - some really great points raised here in the comments.

gettingdatasciencedone
Автор

Really cool paper idea! Thank you for covering it. Wonderful explanations as always!👏

cosmic_reef_
Автор

Just skimmed the paper and it seems like they do the exact opposite of reject Chomsky's hypothesis. Languages like their "partial reverse" are completely implausible for a human language but the LLM still learns it very well. The perplexity is higher but that's to be expected of an inherently more complex pattern. The models still learns it with no issue though. I don't know if humans can't learn this language or if it would just be harder for them. The token and word hop languages are also similar. The difference is shockingly small. Yes, it exists but the model is clearly learning these languages just fine. The authors also fail to control for the shannon entropy of their languages in any way, so we can't really say how much of this difference is due to priors in the models vs inherently harder to predict sequences.

Honestly shocked this got an award. Statements like "transformer based language models learn possible languages about as well as impossible ones" are just trivially true. The whole reason transformers are so popular is because we can throw any data (language, sound, images, video, RL environment observations) and the model will learn the data quite well. Deep learning models are *universal* function approximators after all, not "natural language autoregression" function approximators or anything like that.

TheRyulord
Автор

This is just cheating though as it's not exactly testing what it ought to be testing, word embeddings already carry a lot of implicit information about language structure so of course LLMs are going to be better at learning a language whose embeddings are learned from that, what would be mote interesting is to have a whole corpus of language in which both the embeddings and the the training of the LLM are done on.

lolroflmaoization
Автор

I find it surprising there are embedding free models that learn byte-2-byte encodings that would look lik3 and impossible langauge

sadface
Автор

not to sound argumentative ... but what was even the point of it all ?? its soo obv that harder the pattern, more the loss.. for a model of same size and with same number of epochs..

ShivamGupta-qhgo
Автор

It almost seems like this could be argued backwards. If randomness is considered grammatical for a language L, then random output from a generative LLM is grammatical for L. So, impossible languages are learnable by these models.

hebozhe
Автор

The real comparison should indeed be with increasing the entropy equaly using proposed Possible and Impossible ways, which is interesting. Also a real in depth research would also include human learners

kobikaicalev
Автор

I am impressed with this paper. Although, I probably would never have attempted to do this one in particular because I don't think it is possible to define impossible the way that Chomsky was using it in the phrase, ie, he can always argue later that he didn't mean it the way they claim he meant it. Nevertheless, it does make total sense that LLM's would have more difficulty with language structures that are practically impossible simply because they work on probability in order to train. Obviously, if the probability of meeting some nearly impossible word structure is very low, then there will be less chance for the LLM to recognize patterns that it can discern.
I would counter that with near infinite training an LLM can identify anything with a pattern, and cannot identify anything that doesn't have a pattern... and let them stew on that for a while. heheh
My hat goes off to the Authors... well done.

marcfruchtman
Автор

First time hearing the term surprisal. Julie defines it as the negative log probability of a token, which sounds like the standard cross-entropy loss to me. Is there a difference between loss and surprisal?

tijm
Автор

Chomsky: make offhanded asspull comment
Linguists: we must rigorously test this claim to figure out what Chomsky-sama meant by this and how much he was in tune with the universe's secrets

Copyright_Infringement
Автор

Do we have an evaluation of the difficulty to learn each human language in terms of flops to achieve a given perplexity ?

yannickpezeu
Автор

They proved machines can not do what humans can not do. How does that prove that language is not unique to humans. It just proves that randomness is harder to learn by humans and machines. In fact in GPT 4, if you define the rules in instruction, it can pretty much follow it like it was native speaker.

msokokokokokok
Автор

Is not the size of the vocabulary also one parameter of language complexity? Thank you for the great video !

yamenajjour
Автор

IMHO, this kind of discussion is only interesting to die-hard Chomsky followers. Chomsky is clearly trying to defend his own turf from being eroded by LLMs. To great majority, we don't really care if LLMs truly understand language in the way that Chomsky defines it, we only care if they are useful and the answer is clearly yes. Trying to prove or disprove Chomsky's argument is not very useful.

kleemc
Автор

I haven’t read the paper yet, but it seems like the control language should have been, not natural English, but another scrambled version of English that added a similar amount of entropy as the test cases without violating Chomsky’s “merge” rule.

aboubenadhem
Автор

I honestly don't know how someone intelligent like Chomsky made such unfounded claims. I suspect that he is quite disconnected from today's state of the art neural networks.

brainxyz