Are ChatBots their own death? | Training on Generated Data Makes Models Forget – Paper explained

preview_player
Показать описание
If LLMs flood the Internet with content over the next years, they will likely sign their own death certificate.
How likely is it that this happens and why is training on AI content so bad?

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Edvard Grødem, Vignesh Valliappan, Mutual Information, Kshitij

📺 8 things to know about LLMs:

Outline:
00:00 Flooding the Internet with AI generated content
00:50 We’re running out of training data
01:42 What we are doing about it
02:51 Recursive training is bad
07:24 The Internet will be 90% AI generated
10:10 What to do about it

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Music 🎵 : Illusions – Anno Domini Beats
Video editing: Nils Trost
Рекомендации по теме
Комментарии
Автор

My guess is that OpenAI themselves fear that recursion will lead to bad models and that this is the prime motivator for them to work on and implement watermarking. They try to disguise it as a regulation to protect the users and readers, but I bet it is to filter their future training data.

Jamaleum
Автор

Well, we're going to be very careful about selecting training data moving forward. Looks like a big problem!

rockapedra
Автор

i think the paper is not adressing the issue at all, it's only demonstrating some sort of overfitting.
I believe training a model recursively is bad since there is no new data but if the model can generate new better data like using a tool, or a chain of thoughtcor something else then it can be a positive recursive loop instead of a negative one.

alexamand
Автор

Nice video, it's a good overview on the paper and the discussion.

a) Yup, I'm worried, not that much for the models but also for the humans "trained" (learning) on this garbage content...

b) The paper is not at all exagerated. It doesn't matter at all if 90% of the content is AI generated, but wether 90% of the content *available for training* is. And with companies suing each other and most content behind walls (social networks are evermore closed), we are getting there fast. I guess we'll have to keep models frozen on old data...

irisdominguez
Автор

I think this is great research, but it shows not a flaw of AI generated content as a thing, but a flaw in current models and the toy example shows it the best. We clearly see that data distributions shift and degrade after multiple retraining on the synthetic data. What we should do is to use it as a benchmark and when new architecture is proposed, beyond all other standard benchmarks, show how robust it is on the retraining and how many iterations it takes to destroy the original distribution. If this number is big enough it might indicate that the proposed architecture is worth using even if it is not the best in other tasks.

volotat
Автор

Thank you so much for an important and understandable overview. I appreciate your helping us to go into the future with our eyes open.

theosalmon
Автор

My intuition is that you don't need to assume 90% of the internet will be AI generated. You only need to assume that the models will need more data then can be scraped from humans. AI generated data could be intentionally added by the devs even if it is not on the internet.
My feeling is that recursive training won't end up being a big problem, and that some trick, like adding in human data will be enough to avoid forgetting the original distribution

fejfo
Автор

Chatbots get more dumb by learning from each other, while human is on the contrary. Therefore, the LLM learning process is still fundamentally wrong.

tildarusso
Автор

I suspect this is a specific issue with the auto-encoders they used. What we need is a good validator and a diversity metric, this will prevent mode collapse. My hypothesis is that the higher dimensional the generated content is, the easier it will be to validate. Ilya Sutskever made a recent comment claiming that knowing the distribution is enough to give you exact samples as sample dimensions increases. (Dimensionality in this case, I guess, could be sentence/document length or picture resolution). Think of it as a discount model for test grading - the more criteria you have to take points off for, the lower the average score will be.

brad
Автор

The internet is flooded with AI generated content since 2016 btw, its not recursive training for LLMs that I'm worried about, its recursive training for the new generations of humans that worries me more, those kids are literally being trained on AI generated content too and the older generation is using GPTs so much these days they are forgetting how to be creative on their own so the original high quality content that you talk about that will be produced by humans (and there is barely any) is also influenced by AIs.

uprobo
Автор

I don't think having ML models learn from each other inherently leads to degradation. But, that hinges on the core understanding of the model. If you have a teacher and a student you can successfully transfer knowledge. But, if you have students who have incomplete understanding teach another student with incomplete information you get the classic game of telephone and data will degrade over time. So yes, i fyou blindly teach ML models from every source of data on the Internet and you have students filling that data up with garbage, you will train worse models.

There is PLENTY of data out there (any anyone who argues with that lacks imagination). The Internet was just low hanging fruit to jump-start intelligence in these models. The future will be more about refining high quality datasets and finding out how to train with smaller and smaller sets. It will also be about mulit-modality. There is an infinite supply of data from the real world (audio/video/touch/etc) that can be trained from.

ML models are inherently lossy, the reason we write things down is because over 1000's of years stories drift, human brains are lossy as well and without rigorous systems will lose information. Finally, I also think ML can spiral in the other direction, purely speculation, but I believe that once ML models have a certain level of intelligence (which may also include less lossy memory storage to augment their NNs) they will be able to teach themselves or each other and spiral towards the singularity. I just think we're not quite past the tipping point yet, although GPT-4 feels very close.

jolieriskin
Автор

By my estimate the number of human authors will not substantially decrease
1 authors wish to publish
2 the humanity will self feed back by subclassing stupid or obvious or wrong answers
We are at the beginning of a cycle.

The key part will not only be to tag AI generated contents, which seems more a wish than something feasible for money reasons
But to tag human made content
A way to do this is by rewarding human authors, like the Brave browser model
Maybe a dream but when LLM vendors will lack good quality data they may enter into this approach, the first one to do this at scale will become the leader because the outcome will be considered as a better source of conversation, hence a wider audience

gordoneldest
Автор

The link to the model dementia paper is broken :)

flamboyanta
Автор

Very interesting! Thanks Letitia! by the way, I’ve always been highly sceptical of synthetic data and this "Chinese whispers" method of discussing the problem I think really hits home

TimScarfe
Автор

Learning from human feedback is even more important now!

Skinishh
Автор

I think one does not need to be a Phd to logically deduce that mistakes made by something or someone will be propagated untill noticed and corrected.

_bustion_
Автор

Do you think current datasets that have less AI generated data will become gold?

Skinishh
Автор

GPT-3 unrestricted api access: Nov 2021
ChatGPT training data up to: Sept 2021
🤔🤔🤔
I wonder why openai hasn’t updated chatgpt to include any data newer than when GPT3 was released…
😂 what a mystery


The year 2021 will be known for the data singularity, where new written work cannot be distinguished from ai generated text.

arbybc
Автор

The results are really not surprising, but at the same time, I don't really see the issue. It's just a matter of proper data engineering. Also: Who trains a model without any quality control and benchmarks? If the model gets worse by your metrics, then just go back to the data engineering step. I practice, models will never get worse in the long term. But of course, it might get harder and harder to train better models. Maybe. Maybe not, if we're building self-improving models which are able to run the entire MLOps training pipelines on their own.

Neomadra
Автор

Ohnononono...
Singularitybros, we got too cocky...

YUTPIA