Grok-1 FULLY TESTED - Fascinating Results!

preview_player
Показать описание
Let's test Grok using our LLM rubric! How does it compare to other models?

Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗

Media/Sponsorship Inquiries 📈

Links:

Chapters:
0:00 - Intro
0:51 - Testing
Рекомендации по теме
Комментарии
Автор

I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!

MuckoMan
Автор

What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?

labradore
Автор

The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.

cbnewham
Автор

I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.

lachland
Автор

It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.

MxAi
Автор

In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build.
At least the snake moved

dsuess
Автор

*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:*

To reason through this step-by-step:

1) A person places a coffee cup upright on a table.

2) They drop a marble into the cup, so the marble is now inside the cup.

3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface.

4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup.

5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down.

6) They place the upside-down cup into the microwave.

Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.

cacogenicist
Автор

Great job, really enjoyed this fast Grok overview!

SODKGB
Автор

Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4

Xardasflynn
Автор

In the place where you ask about word count, the 2 invisible ones are <START> and <END> tags, I think so

alxy-dev
Автор

The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.

frankjohannessen
Автор

Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!

CYBONIX
Автор

Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other.
I don't think it is a pass if it divides the drying time by the number of tshirts.
Because the main reason for this test is to check if the LLM has an understanding of the real world.

BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches.

Great test though! Thank you.

MrSuntask
Автор

Nice!
Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?

JacoduPlooy
Автор

The issue with your logic puzzle is that these LLMs have already seen every puzzle you give them. Why give them credit for solving a puzzle when they already know the solution? You didn't come up with the puzzle yourself; you just found it online, where they've already seen the answer.

Metarig
Автор

- Write a sentence that ends with apple.
- The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.

u.v.s.
Автор

How is the question about the shirts a pass?

Sofian
Автор

Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉

ff_ani
Автор

This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!

DiscOutpost
Автор

Are you using Grok. 1.0? I thought the one on X was now 1.5?

DavesNotHereRightNow