o3 - wow

preview_player
Показать описание
o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.

MLC Paper:

00:00 - Introduction
01:19 - What is o3?
03:18 - FrontierMath
05:15 - o4, o5
06:03 - GPQA
06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 - 1st Caveat
09:03 - Compositionality?
10:16 - SimpleBench?
13:11 - ARC-AGI, Chollet
20:25 - Safety Implicaitons

Рекомендации по теме
Комментарии
Автор

an upside of a channel not using clickbaitty titles, is that when the title is as dramatic as just "wow", you can trust that the contents really is unusually impressive and maybe unexpected!

silpheedTandy
Автор

We had “hold on to your papers” now we have “adjust your timelines!”

MePeterNicholls
Автор

To think this is the worst the models will ever be. ChatGPT only released 2 years ago, now we are talking about it scoring among top experts in nearly every field. Insane

Kenton.
Автор

No clickbait + proper factual covering without over hype => subscribed

_Escaflowne_
Автор

While O3's benchmark results are impressive at first glance, we need to address some concerning aspects of this announcement. The $350K compute cost per test - which OpenAI initially tried to conceal - makes these achievements more of a laboratory curiosity than a practical breakthrough. Without independent verification, open-source code, or transparent methodology, we're essentially taking OpenAI's word for everything. The shift from their original 'open' mission to increasingly opaque corporate practices is troubling. Remember when AI announcements came with peer-reviewed papers and reproducible results? Instead, we're seeing a pattern of selective disclosure, hidden costs, and marketing hype. Let's demand more transparency and practical evidence before celebrating this as a revolutionary advancement in AI.

wawaldekidsfun
Автор

You know it's good when the simple bench guy is impressed with OpenAI

JamesJohnson-iqwb
Автор

imho we're overfitting our expectations to the utility of benchmarks. AGI doesn't need to be a true genius, or a genius in any number of fields to be considered AGI - As a general intelligence, it "just" needs to be proficient at operating reliably and thoughtfully in the real world.

kyneticist
Автор

Genuinely what is happening anymore. We are heading towards the most crazy and important year in human history. What a time to be alive, wow

Anthemius-pn
Автор

You already know i was waiting for this video with my eye twitching

JonSmith-vb
Автор

if you want to summarize human existence, I guess it is the sentence "we really need to start focusing on safety..."

spanke
Автор

I don’t know what’s more impressive, 03 being announced so soon or the turnaround time of your coverage on it. Fantastic work.

Shaunmcdonogh-shaunsurfing
Автор

Im a little concerned about the power of the majority being lost. The reason we have to get along and live with each other, in part, is because 1 strong ape can still be beaten by 2 weaker apes working together. But what if that one ape has 1000 autonomous drones to defend itself? What if it has robots to create food and entertainment for itself?
Im not just concerned about what the AI will do if it is unsafe, Im concerned with what some humans would do if they no longer need the rest of us.

dcgamer
Автор

The real benchmark is real world examples. Remember: Current benchmarks are like a laboratory testing ground. Yes, the questions asked might be real world examples, but they will be written in a way that's clear and state actual objectives/goals. A sterile set of hard but clear questions.
The real world is different. If I get a well written objective from a stakeholder for me to implement, it'll be a first. The actual world is nitty gritty. Full of nuances, filled with human error and everything needs refinement first.
Therefore, my benchmark: Read a typical agile development user story written by some key-user or stakeholder and try to implement it in such a fashion that it'll pass testing and is production ready within a certain timelimit. If it can do that, I'll sit back down.

DrBreadstick
Автор

Damn... OpenAI crushing ARC was not on my bingo card for 2024 (or even 2025). o1 was an impressive jump in performance, but o3 proves that the performance jump was not even the real point; it's the completely paradigm breaking ability of being able to solve anything with an objectively correct answer. That feels like a profound change in potential and I don't really know how I feel about it.

noone-ldpt
Автор

Thank you for sharing your time and work Phillip, its been a crazy year man, Merry Christmas to you and your family and any elves who might be assisting you, cheers

williamjmccartan
Автор

You know, I find it absurd that "Open"AI and other proprietary AI labs get to benefit from open-source research and publications while offering relatively speaking little to nothing in return. It's an unfair advantage. Reminds me of the game" Werewolf" by Davidoff, where an informed minority almost always wins against the uninformed majority.

panzerofthelake
Автор

I've been waiting so hard for this video. What a turnaround time

booshong
Автор

As soon as that OpenAI video dropped, I came here and started hitting refresh.

owly
Автор

tbh i’m still pretty unconvinced by a lot of these benchmarks. they showed o1 being pretty smart too but it really doesn’t seem to be able to have a conversation about it’s own answer or recognize a repeated mistake it’s making over and over again. or to adapt as the users needs/requests change. makes it feel like the model is not really getting more intelligent, just better at specific processes. or maybe a better way to put it: compared to the ideal of AGI, it’s still somewhat narrow intelligence, just has a lot more narrow intelligence in different domains. idk how you would do a benchmark that would quantify its ability to converse with a human in these subjects, correct it’s own mistakes, etc—maybe that’s just not measurable. but if it was, i suspect that’s where it would become much more obvious that these models are not AGI and are not even particularly close to it. you have talked a little about this idea in the past I think, but not a ton from what i remember. would love to hear your thoughts.

full disclosure: i am only halfway through the vid, and i have not tried o3 myself. will update this comment if finishing the video changes what i think here majorly

matthewuzhere
Автор

Best video you made so far, and crazy quickly. Thank you for being the signal in the noise!

surfingdiamond