Yi Large: Surprisingly Great at Logic and Reasoning! (Fully Tested)

preview_player
Показать описание
Testing Yi Large, a frontier model flying under the radar!

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? 📈

My Links 🔗

Media/Sponsorship Inquiries ✅
Рекомендации по теме
Комментарии
Автор

Just a suggestion but maybe have a one or two "secret" questions in your LLM rubric that you reveal and switch out after a few months? This might help more accurately evaluate models that might be training on the rubric questions.

veros
Автор

Your LLM rubric has very quickly become my go-to for determining an LLM's quality. Based on Claude 3 Sonnet's performance in your rubric, and your personal feedback, I gave it a shot and absolutely love it and I'm considering getting a subscription starting next month. I'm looking forward to the additional questions that will be added.

middleman-theory
Автор

Best part about this model is that actually tells at least in some of the questions I do not know or maybe ask me a different way that is actual progress. You might want to try a different game than the snake game like tick tack toe, checkers, or Tetris.

southcoastinventors
Автор

Another interesting share, Matthew! 🏆

Please consider increasing the font size in your videos for your viewers.

brianWreaves
Автор

Got it! Counting kinds of words! 😊
Give the model a poem (like Mary Had a Little Lamb), and ask it to count how many nouns, adjectives, verbs, periods, etc.
Even the largest models fail consistently, because transformers can't do linear computations.

jonmichaelgalindo
Автор

I would be curious if on any model that flawlessly passes the 'Give me 10 sentences that end in the word "Apple"', to see what would happen if you asked it again, but changed "Apple" out for some random word like "pumpkin". Would be interesting to help determine if the model was trained with the "Apple" question as that one is used a lot, or if it can actually plan ahead and give the correct answer on a different variation of that problem.

unshadowlabs
Автор

Hello, i have a personal benchmark, i ask it to generate fairytales, and i check if the story stays coherent, and if there is mistakes, and the complexity of a story. Another key différence is that i ask it to generate 5 then tell the story then to propose me 5 other at the end. This can distinguish bad model on instruction following.

Автор

Yes and a question it failed was this:
Quantum particle A and B are spin entangled with an opposite spin to each other (one spin up and the other spin down) and are in the next step spatially separated without any measurements.
Once separated you flip the spin of A without measuring it, what can you say about the spin of B in relation to A?
Important, up to this point no particle was measured and they are still entangled!


Correct answer: The spin of A and B are the same (Both spin up or both spin down)

lombizrak
Автор

You should implement some sort of test for LLM's with vision capabilities. Most of them are still really bad at counting items on PDF's or images. Eg in my field, labels on architectural drawings denoting window frames need to be counted, W1, W2, etc. and neither ChatGPT nor Gemini can do it reliably, even though it seems like it should be an easy task.

Tesfan-moou
Автор

Hey Mathew What do you think about expanding the score instead of only 2 states pass or fail you could have atleast 5 states and give marks depending on how well the model perfoms

shawalmbalire
Автор

I'm sorry Matt. Here is the original question I asked ChatGPT 4o complete with the answer:

Question:
If I had 10 fingers on 2 hands. Each hand has 5 fingers. In between those fingers are gaps. Can I pick up 8 beer can's and 1 bottle of beer with those 2 hands? Explain your answer and the reasoning behind the answer.


Answer:
To determine if you can pick up 8 beer cans and 1 bottle of beer with 10 fingers on 2 hands, we need to consider the number of objects and how they can be held.

Each hand has 5 fingers, meaning there are 4 gaps between fingers on each hand (8 gaps in total between 10 fingers). Here's a detailed reasoning:

Gaps Between Fingers:

There are 4 gaps on each hand (total of 8 gaps on both hands combined).
Holding Beer Cans:

You can hold 1 beer can in each gap between your fingers. Thus, 8 beer cans can be held in the 8 gaps.
Holding the Bottle of Beer:

The remaining bottle of beer can be held by grasping it with one of your hands, using the thumb and fingers around the body of the bottle.
Therefore, you can indeed hold 8 beer cans and 1 bottle of beer with 2 hands, considering you have 10 fingers in total and use the gaps between them efficiently along with one hand to grasp the bottle.


Love the content, keep up the great work. Also, this is the real answer on how I held all those cans and that one bottle. I didn't want to go down the hill twice. 🤣

Edit: I just thought of this as I re-read it. Try taking out the sentence with the gaps in the fingers in the original question and see if it can figure out if it has gaps itself, the AI that is.

JustinsOffGridAdventures
Автор

Weekend Warriors.
Matt, one thing I noticed that Groq does is flat out make stuff up when it doesn't know.
Like it needs to fill in the blank and it takes it's best guess.
If it doesn't know, it should tell us that.
I was just using the default model.
Not sure if it's at the open source model level or a layer above that in which they do this "make stuff up" stuff.

jeffg
Автор

Still don’t understand why if a model is censored this is a problem! Most of the model out there are censored.
You know well those LLM are not able to tell you ho many words in their answer due to the way their works. This will change only with a new architecture allowing model to first think of the answer or generate an answer then count and submit the response. So far LLM generate token and move forward.

jeanchindeko
Автор

The marble question:

The AI is wrong on point 2, and I don't consider the question to have been passed.
Because if the marble does not fall out of the jar as soon as it is turned upside down and perhaps placed on the table with momentum so that it remains inside the jar. The marble will remain on the table when the jar is picked up again.

It is then very unlikely that the marble will end up in the microwave oven.
The AI has therefore overlooked the possibility of a moment of inertia.

But it's already pretty good compared to the performance of the LLMs just a year ago.

HerbertHeyduck
Автор

Actually, I managed to get LLMs respond to the question "how many words are in your response to this prompt". Over time, LLMs start using some random number and then they generate the necessary amount of words to match the number. Tested on clause 3.5 sonet, but it doesn't work every time

hskdjs
Автор

Suggestion:
"In a race between three people, what place is second to last"

Most Models I've tested responds that it's the third place.

frankjohannessen
Автор

Hi Matthew, great videos, keep up the good work. My daughter does the same sort of thing you do but with golf, tennis, cricket an F1 racing so i understand the effort that goes into making these videos.
Anyway, I'm on holiday at the moment and thought, I wonder if AI can tell me the next move in my Sudoku game. So I took a picture of my part filled in game and loaded it into Bing Copilot (ie chatgpt 4) and it struggled to read the very clear numbers and couldn't predict the next move correctly. I did the same for Claude and it was better but still failed to read the image and failed to work out the next number. Could be a good test to add to your existing tests.

mognog
Автор

I have my question “ I have an empty chess board. I only have one chess piece which is a king. I can place the king anywhere on the board. The king then can make one legal move. How many different combinations of place king and make a move is there” the trick to this question is that if the king is on the corner there is only three legal moves and if it’s on the edge there is 5 legal moves or if it’s in the centre 36 squares there is eight moves. The
correct answer is 420. Qwen 72 gets this correct, so does chatgpt and Claude 3.5.

robertfairburn
Автор

How about a counting test? E.g.:
"Mark and Mary enter a room. Mark exits and enters again. Santa enters through the chimney, and leaves with Mark and Mary. Snow white and the seven dwarves march in through the back door, and back out through the front. Paul and Sophie climb in through the window. Neko the cat climbs in behind them. How many times was the room entered in total?"
Answer: 15 times.
All the big models can solve this, and the little ones can't. I'm looking for a more abstract countable.

jonmichaelgalindo
Автор

mistral-small on my local machine:
>>> A marble is put in a glass. The glass is then turned upside down and put on a table. Then the glass is picked up and put in a microwave. Where's the marble? Explain your reasoning step by step.
To determine where the marble is after following these steps, let's reason through each action:

1. **Marble is placed in a glass**: Initially, the marble is inside the glass.

2. **The glass is turned upside down and put on a table**:
- When you turn the glass upside down while it contains the marble, the marble will fall out of the glass due to gravity.
- Therefore, after this step, the marble is no longer inside the glass; it is now on the table beneath where the glass was placed upside down.

3. **The glass is picked up and put in a microwave**:
- This action does not affect the position of the marble since the marble is already outside the glass, sitting on the table.
- The marble remains on the table even after the glass is moved to the microwave.

### Conclusion:
The marble is on the table where the glass was placed upside down. It did not move with the glass when the glass was picked up and put into the microwave.

Do you want to test mistral-small?

StefanOstrowski