Grok-2 Actually Out, But What If It Were 10,000x the Size?

preview_player
Показать описание
Grok-2 is finally live online, but what does it change? No paper (so much for 'open') but what if it were scaled 10,000x, as a research paper from yesterday showed could happen by 2030? The key question seems to be whether LLMs are developing models of the world, and my new SIMPLE bench website could show a tentative test of this incipient ability. Plus, Ideogram 2, Flux, Hassabis Interviews and Real-time deepfakes.

Chapters:
00:00 Intro
00:40 Grok-2, Flux, ideogram Workflow (02:30 Simple Bench)
04:36 Gemini ‘Reimagine’ and the Fake Internet
05:32 Personhood Credentials
06:09 Madhouse Creativity
08:00 Overhyped or Underhyped
09:27 Epoch research
10:30 Emergent World Mini-Models?

Hassabis Interview: Unreasonably Effective AI with Demis Hassabis

Рекомендации по теме
Комментарии
Автор

My man casually includes potentially demonetizing images that other AI channels were afraid of including like it's just another Thursday AI video. You are unmatched in AI YouTube content uploads. Been a fan since the beginning and we all appreciate your passion towards it. Kudos.

noobicorn_gamer
Автор

What I like about Simple Bench is that its ball-busting. Too many of the recent benchmarks start off at 75-80% on the current models. A bench that last year got 80% and now gets 90% is not as interesting anymore for these kind of bleeding edge discussions on progress. I like seeing benchmarks come out at 20% and go up to 40%, etc. That's where the leading edge is.

Ehrenoak
Автор

I think Demis Hassabis is completely right, though. Short term it is overhyped but long term I don't think people are caring enough about it. I feel like a broken record on every one of your videos, but we really need to start preparing for an AGI world. No one really seems to care about it. The disconnect is likely that current AI models are being hyped up as being close to AGI and then when it falls way short of that everyone gets disappointed and stops caring. Yes, people need to have reasonable expectations of what models can do right now, but this tech is in its infancy. It's impossible to imagine where we'll be in 5 years.

danagosh
Автор

My god… your stuff is continually *SO* damn good! Amidst an ocean of BS vids on “AI news”, you offer real, actual, useful, intelligent content - again, and again, and again. Sometimes frustrated that weeks go by w/out a vid from your channel, but always refreshed by the quality of what you bring (especially vs the AI videos *made* by AI bots! 🤬) Thanks for the time you take and your commitment to quality 🙏 …it’s noticed and appreciated. (Now if only we could get the other 10, 000 YouTube content providers to notice…!)

jdtransformation
Автор

7:27
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens.
We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.

alexeykulikov
Автор

I recommend you view the original audio (e.g. in Audacity) as a spectrogram. It’s stereo audio, with the user on the left channel and the model on the right channel, so seeing both tracks on the spectrogram is helpful. It shows just how much background noise was on the users side. It’s also interesting because you can visualize the timbre of the woman’s voice (like what frequencies are strongest), and how it differs from the timbre of the synthesized male voice, and how the timbre change of the model does look more like the woman’s timbre.

Versions of Whisper that I’ve tried would often hallucinate tokens when there is silence (meaning there would need to be an audio threshold filter passed first, to clip out non-speech). I could see how the background noise in the weird chat audio might also lead to spurious tokens being generated.

What would be great to see is: a user is having a chat with a bot, but their dog keeps yapping in the background and the user periodically needs to shush the pup, and it happens enough times that the bot fabricates its own dog yapping that it also must quiet down.

mshonle
Автор

Good luck with the SimpleBench thing Phillip, you are really one of the most qualified and well positioned people to take the lead on an initiative like this! The general public (myself included) desperately need a soothsayer such as yourself to help us interpret all these rapid changes both now and in future.

Dylan-zgjl
Автор

I think Ilya has made this point, but I agree with it. Intelligence is simply compression. Better compression is literally better prediction. In order to better predict, you must develop an abstract model because that is simply better compression. What is a law of physics, but a really good compression of information that allows you to predict better?

Steve-xhby
Автор

Seems to me a benchmark guaranteed to be so guarded as to never appear in public datasets would be a very valuable asset in the not so distant future. Excellent move.

chrispenney
Автор

"I was casually reading this 63-page paper, " is the perfect flex for this channel. 5:35

Jumpyfoot
Автор

Here is my take on it all ....LLMs can autonomously recognize patterns, relationships, and structures in data, allowing them to make accurate predictions and decisions. This suggests two significant insights. First, LLMs seem to be constructing some form of internal models of the world, a concept further supported by mechanistic interpretability research from Anthropic. Second, because of these models, LLMs exhibit a certain level of understanding.

Some argue that LLMs rely primarily on memory because they cannot generalize out of distribution. However, this likely isn't the case. When you introduce a novel topic into the context window, it functions as "working memory." Since the neural network itself isn’t altered, the LLM doesn’t truly comprehend the new information, making accurate pattern matching challenging.

This process parallels how the human brain works. Once the brain receives information about a topic or object, it continuously learns and updates its internal models of the world. With this updated understanding, it can apply prior knowledge to solve novel problems, leading to true generalization.

The four key takeaways are:

LLMs exhibit some form of understanding.
Reasoning cannot occur if the data is not part of the neural pattern.
The context window does not alter the model itself.
Continuous learning is essential for further advancement.

shawnvandever
Автор

Hey I'm in this one too! Very excited by Simple Bench, as you know logical reasoning is one of the two big things I care about. Speaking of which, I would absolutely love to see a Simple-Bench-Vision benchmark that tests visual reasoning and multi-image understanding.

Also, your prediction of GPT-5 after November is seeming to be certain now!

trentondambrowitz
Автор

I've been using Grok 2.0 for a couple of days now and have been absolutely LOVING it. I need to really figure out just how much it is capable of. I've only been really playing with the image generator; and I think I've only scratched the very tippity top surface of what it can do with images!

imjody
Автор

I was waiting for your new video to drop. You were the first to point out that the benchmarks were bad. And I had some hours to kill, and did some research. For everyone, MMLU and other benchmarks work like this: Question. What is the Answer? A, B, C, D. Next. I always thought this to be somewhat wrong. So I picked out some questions that are obvious to me, and modified them in such a way, that the questions are basically the same, but I did not provide A, B, C, D. What I saw is that the results of these benchmarks are probably correct. But as soon as you modify the question, so that any 5 year old would be able to tell me what I'm asking, they started to fail miserably. Example: "Susan parked her car on the side of the building. Garble text about Susan like in which pocket put her mobile phone." Basically the same HellaSwag question, but modified. Gemini, Claude, ChatGpt, all failed so bad I got my head scratching. Why would LLMs score so high on these benchmarks? And you can try this yourself: The farmer with a sheep had a boat. Where there was once a river, there is lava now. How can he cross. They all fall into "classic puzzle" mode. So what am I trying to say? I have very mixed opinion. I don't know if the scale will solve this. I really think we need something more added. Now it feels like it's *just* pattern matching all the way down. But I want to be persuaded, and this paper you shown, will be on my Kobo (e-book reader) soon. (But even Othello example does not convince me.)

(ugh, sorry for a wall of text)

alphaorg
Автор

Holy shit I made it into one of your videos! I've been watching your channel since you started- thanks for featuring my vid!!

Billary
Автор

Congrats on building simple bench and popularizing it. Benchmarks is all you need, and that is one hell of a cool benchmark. Can't wait to learn more, especially about how you built the dataset because we do need more and better benchmarks like this and arcagi

drhxa
Автор

Proud that your performance is recognized by those “up there”. :) Another calm spirit in attendance can't hurt.

fabp.
Автор

I just had a thought: voice AI that can copy your own voice so easily will be absolutly amazing for everyone who loses his capability to speak.
if you have 1-2 old 20s clips of yourself speaking, or a single voice message, you can „regain“ your voice.
combine it with neural chip, and in 30-40 years we will have first people able to speak again just by thinking of saying something

Slayerth
Автор

Awesome to hear your benchmark is getting recognised 👍 I would stress that before accepting help from those higher up it might be worth considering their intention. Having the questions known by these companies might quickly lead to contamination of the results as the questions may become part of the training process

ByteBound
Автор

I have been _yelling_ about zero knowledge proofs for years. They are absolutely required for the next phase of humanity, without exception.

gubzs