Grok-1 FULLY TESTED - Fascinating Results!

Показать описание

Let's test Grok using our LLM rubric! How does it compare to other models?

Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗

Media/Sponsorship Inquiries 📈

Links:

Chapters:
0:00 - Intro
0:51 - Testing

Рекомендации по теме

Комментарии

I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!

MuckoMan

What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?

labradore

The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.

cbnewham

I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.

lachland

It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.

MxAi

In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build.
At least the snake moved

dsuess

*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:*

To reason through this step-by-step:

1) A person places a coffee cup upright on a table.

2) They drop a marble into the cup, so the marble is now inside the cup.

3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface.

4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup.

5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down.

6) They place the upside-down cup into the microwave.

Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.

cacogenicist

Great job, really enjoyed this fast Grok overview!

SODKGB

Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4

Xardasflynn

In the place where you ask about word count, the 2 invisible ones are <START> and <END> tags, I think so

alxy-dev

The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.

frankjohannessen

Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!

CYBONIX

Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other.
I don't think it is a pass if it divides the drying time by the number of tshirts.
Because the main reason for this test is to check if the LLM has an understanding of the real world.

BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches.

Great test though! Thank you.

MrSuntask

Nice!
Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?

JacoduPlooy

The issue with your logic puzzle is that these LLMs have already seen every puzzle you give them. Why give them credit for solving a puzzle when they already know the solution? You didn't come up with the puzzle yourself; you just found it online, where they've already seen the answer.

Metarig

- Write a sentence that ends with apple.
- The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.

u.v.s.

How is the question about the shirts a pass?

Sofian

Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉

ff_ani

This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!

DiscOutpost

Are you using Grok. 1.0? I thought the one on X was now 1.5?

DavesNotHereRightNow

Grok-1 FULLY TESTED - Fascinating Results!

Grok-1 FULLY TESTED - Fascinating Results!

Grok-1 is Open Source | All you need to know!!!

Grok-1: Fully Opensource and Uncensored! Largest Opensource LLM!

Install Grok-1 Locally Instructions - Requires Multi-GPUs - Open Source Uncensored

This new AI is powerful and uncensored… Let’s run it

Wake up babe, a dangerous new open-source AI model is here

Programming Language Tier List

Anderson Cooper, 4K Original/(Deep)Fake Example

Close up of Tesla Optimus Bot Tesla Cyber Rodeo #shorts

Grok-1 Open Source: 314B Mixture-of-Experts Model by xAI | Blog post, GitHub/Source Code

Grok-2 Review: Better Than Grok 1 or Another Flop from Twitter? 🤔

Twitter Grok AI Large Language Model Released for Free!

This can happen in Thailand

GROK 2 Just Dropped - Is It Worth the Hype?

Testing Stable Diffusion inpainting on video footage #shorts

Mark Zuckerberg reacts to Elon Musk possibly buying Twitter #shorts

Which jobs will AI replace first? #openai #samaltman #ai

Elon musk Introduce Grok-1 fully open source and uncensored LLM | Today AI

Grok AI vs. ChatGPT: Watch THIS Before Choosing in 2024!

Elon Musk LAUGHS at a Silly Question and Then Gives a BRUTAL but BRILLIANT Answer!

Zuck's new Llama is a beast

I Saw Elon Musk Driving a Cyber Truck!! #Shorts

Grok-2: I Fully Tested The Best Uncensored Model

Please think of the children - Curio AI Grok