BREAKING: What's Inside GPT-5? New Report

preview_player
Показать описание


⬇️Resources mentioned in this video⬇️
⭐️ Dr Alan Thomson Youtube channel @DrAlanDThompson

In this video, I share the latest report on GPT-5 by @DrAlanDThompson investigating OpenAI's groundbreaking AI model, built using over 27 datasets and distilled from two petabytes of data into 70 trillion tokens. With more than 70% synthetic data, GPT-5 marks a pivotal shift in AI development, raising ethical concerns about bias and propaganda. Drawing from detailed investigations and reports like What’s in my AI? GPT-5's features, its massive data foundation, and its implications for the future of AI.

Timestamps:
00:00 Introduction to the Report on GPT-5
00:22 The Importance of Data in AI
00:40 Alan Thompson's Insights
00:57 Synthetic Data in GPT-5
01:48 Comparison with GPT-4
02:57 The Role of Synthetic Data
02:59 Challenges with Synthetic Data
05:29 Future of AI and Synthetic Data
06:29 Companies Generating Synthetic Data
07:57 Ethical Concerns of Synthetic Data
10:24 Future of GPT-5 and Synthetic Data
11:05 Conclusion and Final Thoughts
Рекомендации по теме
Комментарии
Автор

I think "accuracy" of the Data, whether it is natural or synthetic, is far more important than how much Data you have. One accurate Formula beats a million inaccurate Formulas in the same Domain.

picksalot
Автор

The reason why 70% of the data is synthetic is so obvious. It's data that they didn't have rights to use verbatim so they used LLM to summarise or rewrite the data in a way where it's not possible to get sued for using it.

JimmyTurboMods
Автор

Synthetic data is sure to kill the model.

AnuragShrivastav-
Автор

So the prophecy will come true. They have created a cannibal going : 'Give me data, I need data'.

junaidmuhammed
Автор

It's not fake, it's syntetic, that's like saying that when you write in your notebook things from your brain, it's fake

adolphgracius
Автор

It doesn't matter if the data is synthetic or not, it matters about the quality of the data. only a human can truely check the quality with a degree of certainty, so the big question is not if the data is synthetic but whether it has been verified and validated by a Human.

pary
Автор

thank you for parsing this information and providing more quality information. Glad you are part of the few still making the active effort to add quality information to the Internet.

leifpryor
Автор

If biases can be removed from the data and, with that, LLMs can excel, then synthetic data will be much better than the data we currently have. / According to ChatGPT: Based on these estimates, it could be calculated that approximately 60-80% of the information available on the internet may be influenced by some type of bias, whether conscious, unconscious, intentional, or derived from structural factors such as algorithms or economic pressures.

Lifeisnotfaironearth
Автор

People don't want a tool that isn't doing its job or that is making decisions for them. Ideological limitations and using synthetic data are two of the biggest reasons why people avoid tools like GPT and why models like sus-column-r become so popular so fast despite their obvious downsides.

SS
Автор

through user interactions they do fine tune and tweak the data over time, and tihe internet always has more data so it can be updated

gamingguruoz
Автор

Would've been much better off if all this wasn't delayed for half a century .

dadsonworldwide
Автор

This makes sense to me. Most LLM's are trained in one epoch--they only train on each chunk of data one time. Going through most of the data multiple times tends to result in over-fitting. However if you get an LLM to translate or summarize a bunch of human data in ten ways, then they can likely get the equivalent of 10 epochs worth of training without running into the over-fitting problem. Presumably LLM's are already being used to help curate the original human data too.

However this is still technically synthetic data. One can further enhance the synthetic data by doing stuff like whatever Strawberry is doing, likely criticizing the synthetic data multiple times and improving it. Maybe you can use a slow/accurate LLM to train a faster LLM. Using large slow LLM's to fine-tune small fast LLM's for specific use cases works well, problems happen when you get stupid LLM's to train other stupid LLM's.

nathanbanks
Автор

But not all synthetic data is the same, some is specifically tailored for a training set, which can be better than natural diagnostic data due to scientist's knowledgr of the scenario at work

stevenkellysillick
Автор

Even human generated data is terrible if we're talking about ethics, politics and areas of science where checking the results and the progress of research is very hard. It is much easier with engineering. Studies are flawed, there's no unbiased way to 'curate' say reddit and forum discussions and decide what is is logical/correct, and now we'll get 'synthetic data'. IMO it is unlikely they'll use these models for code generation, engineering and similar tasks. It would become apparent very quickly what a bunch of nonsense is that. What they may end up doing is to use say mixture of experts or similar to switch to GPT4 and similar or specialized models, on the fly for engineering/coding tasks, then use the synthetic nonsense for fiction and propaganda. Maybe it could work for things like medicine too, because good chunk of it is anyway 'theories' which are basically pure speculation and educated guesses based on chemistry and our very limited understanding of human physiology.

denissorn
Автор

I think that with the recent large release trends, we have not yet reached diminishing returns with large data. We must go as far as we can.

alby
Автор

Hello Goda, I enjoy your channel very much. At almost 65 years old, I've seen many technology transformations. From rotary telephones to Facetime with iPhone, "don't talk to strangers" to hop in strangers car with Uber. LOL. I'm fascinated with ai perhaps more that other people in my age group, and if I'm understanding your argument correctly. With all the electrical power requirements, expensive computer chips for faster processing, etc. Very few companies such as(NVIDIA, META, GOOGLE) etc will have "control" of this original raw human data, which as you say, they have already produced regenerated "synthetic" data in GPT5. This will equate to the saying "Garbage in, Garbage out". I believe this synthetic data is useless. Even with CHATGPT 4, we should not blindly accept whatever it spits out as absolute truth. Specially with critical data/information that many people will make life altering decisions with. I'm not a programmer or engineer, but my gut tells me that with all these different models churning/mining this data it will become more diluted as time goes by. Looking forward to your thoughts and of your community on this topic. Thank you for sharing.

Estebanserrano
Автор

The more things change, the more they stay the same. In this case.... garbage in/garbage out.

tonysilva
Автор

If OpenAI is training itself on its own output that would mean it is being trained on hallucinations. What good is that?

gregsLyrics
Автор

How do Alan know this? Is he working at OpenAI or is that speculated data? If the latter, than relying on that as data analyst is a no-go.

PriNovaFX
Автор

Synthetic data sounds a lot like it just making things up

NativeEnglishAdvantage