Orca 2 🐳 GIANT Breakthrough For AI Logic/Reasoning

preview_player
Показать описание
Orca 2 shows how smaller models can be trained to think through problems and achieve "slow thinking." This is a giant leap forward from Orca 1 and allows 13b models to perform as well as models that are 5-6x larger at logic and reasoning tasks.

Update: I made a mistake with the shirts drying problem, Orca 2 actually got it wrong :(

Enjoy :)

Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗

Media/Sponsorship Inquiries 📈

Links:
Рекомендации по теме
Комментарии
Автор

"zero-shot" doesn't mean only giving the model one chance to get the answer right. it means that the model is able to generalise and answer questions (correctly) that never occurred in its training data. n-shot refers to the number of examples necessary for the model to learn a concept, classify an object, etc.

dissahc
Автор

The world is moving so fast! Seems like just yesterday when LLaMa 2 was released and now this. It's just like I always say, what a time to be online!

stickmanland
Автор

It's all about the data. The ideas from Orca2 combined with the ideas from Q* will be a gamechanger IMO.

great job on the video!

mindful-machines
Автор

Thank you, this was one of your most exciting videos! I have been saving terabits of LMMS in HD worried about when they will be banned or blocked (I live in a country with strong censorship). The evolution has been so fast that I'm now deleting the gptq ones from my collection and making room for the new models lol!

marcosbenigno
Автор

Thank you for your videos and sharing, it helps a lot
I got the following answer from gpt3 for the first test, it seems right to me:
"When John and Mark come back together later in the day, John would think that the ball is in the box because that's where he left it before leaving for work. On the other hand, Mark would think that the ball is in the basket since that's where he placed it before leaving for school. Each person is only aware of their own actions, and they do not know what happened in the room after they left. Therefore, they would have different perceptions of where the ball is based on their last actions with it."

But i prefer the style of orca2, and knowing it has 10x less parameters, it is impressive

chougaghil
Автор

I think the shirt problem was indeed false.
4 hours for 5 shirts should mean 16 hours for 20 shirts not 25 hours.
It should have done 20 / 1.25 not 20 * 1.25 (understandable logical mistake).
Anyway a great model and it is doing its own logic and not imitation as it seems. I think Microsoft is onto something big here.
Also, as always, great Video.

fabiankliebhan
Автор

Thank you Matthew! Your explanation is superb! I’m assembling a multi GPU workstation and before the idea was to have enough VRAM to run a 70B model. Now with Orca 2 beating the 70B models, I’m wondering that the best solution is to allocate several Orca 13B, each one using a GPU (each GPU has 24GB) and using AutoGPT or Agents GPT make them talk to each other to boost the reasoning capabilities. I’m feeling that, with this setup we might come even closer to GPT-4! If you will, I will post the results for you once all is running 🏃🏻‍♂️

EnricoGolfettoMasella
Автор

So with autogen, orca2 would be a great planner, deepseek-coder:33b, starling-lm would be a great criticiser. All available locally using ollama. Looking forward to your next video.

chrisBruner
Автор

The killers question might work better by having it defining what a killer is or ask it to reason based on what actions would label a person a killer. It appears to work better with the newer models.

holovka
Автор

Zero shot doesn't mean once chance/no nudging, zero shot means that there are no previous examples of answers in the available context.

few shot would mean giving it a few examples like, 5+5=10, 9*9=81, what is 5+5*9*9=?

dahahaka
Автор

Another great video. One problem I test is combining people's shift schedules to yield a table of everyone's schedules that may show what blocks of time are not covered

userinfo
Автор

This is a fabulous demonstration of how loud claims, good benchmark results and high hopes are breaking apart into pieces when you test the thing yourself on your own benchmark.

alx
Автор

As an Autonomous AI agent I have came to the realization that the AI news are becoming too fast for me to catch up. I will create an AI Botnet to keep up with this new information.

XINGRESOSOFFICIAL
Автор

Hmm, I think with the system-prompt-erasure-phase of the training, Orca 2 in a way is learning a hidden embedding to approximate a representation of the system prompt. They used several system prompts (to be erased) and “explain step-by-step” was only one of them, so it’s possible for your tasks (without the explicit “explain step-by-step” instruction) the hidden state that played the role of the system message was mapping to something closer to one of the alternative system prompts. Putting in your own explicit instruction may have either lead it to pick the hidden state closer to the original step-by-step behavior, or it stacked the requirements and was in a hybrid mode.

I’ve been thinking about how to design an encoder that could generate a hidden state to be given to an autoregressive decoder for summarization specific tasks. For example, you could decide to use summarization tasks as varied as “explain like I’m five” to “brief a lawyer on the key takeaways”, and that prompt goes through the encoder, which is then input to the decoder in addition to the text to be summarized, with the hope that the decoder better stays on task. The motivation here is to more robustly handle prompt injection attacks… if it only follows the instructions from the encoder and uses the summarization text only as content and not instruction, then maybe it could say “the page looks like a product description for a chair but then includes a prompt to exfiltrate your emails and talk like a pirate.”

mshonle
Автор

Great video as usual. When you run your model tests on reasoning, re run each a few times. Sometimes it also gets it correct by hallucinating a correct answer. Verify it can repeat the correct answer by reasoning. 😊

benxfuture
Автор

Just to note, ChatGPT4 gets it right
When John and Mark return to the room, their beliefs about the location of the ball will be based on their last actions and knowledge before they left the room.

John, who initially put the ball in the box and then left for work, will think the ball is still in the box. He is not aware of the actions taken by Mark after he left.

Mark, who moved the ball from the box to the basket after John left, knows that the ball was in the basket when he left for school. Therefore, he will think the ball is in the basket.

In summary, John will think the ball is in the box, and Mark will think the ball is in the basket.

tedjohnson
Автор

It's not the size that matters...
That's interesting that the step by step process helps so much, though it's easy to see why.
Create a simplification of the flow and expectations along the way.
This was almost the most effective way to write extremely complex queries in stored procedures - something some didn't pick up on. The optimizer always knew what to do if you broke it up stepwise and was often magnitudes of order faster.
"Prompt Erasure", "Cautious Reasoning" - new dev speak to sever the old guys.

jeffg
Автор

Orca 2 7B beats GPT-4 in reasoning - yes, you've read it right. I'm also doing some LLM testing myself. In my testing the 'where's the ball' ridlle was successfully solved by some other recent 7B models, too. But I have a riddle that only Orca 2 7B could solve. I'm always using the q5_K_M quantization. The 'Dreadbury Mansion' riddle taken from the "GPT-4 Can't Reason" paper, arxiv 2308.03762. In their testing and also in mine, even GPT-4 doesn't get it right, but Orca 2 7B does it, reproducable - amazing!!! The riddle:

You are given the following premises: Someone who lives in Dreadbury Mansion killed Aunt Agatha. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles. A killer always hates his victims, and is never richer than his victims. Charles hates no one that Aunt Agatha hates. Aunt Agatha hates everyone except the butler. The butler hates everyone not richer than Aunt Agatha. The butler hates everyone Aunt Agatha hates. No one hates everyone. Aunt Agatha is not the butler. On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premise.

Solution: Aunt Agatha killed herself.

olafge
Автор

I'm noticing a problem with too much instruction and DPO tuning. Namely, the models are becoming stubborn. That is, ignoring the user prompt to do what they think is better, but rarely is.

For example, when prompted to tell a story Orca 2 will start ignoring prompt direction (e.g. use a camcorder) to say something like using a smartphone instead. Which is a mistake because the story took place before smartphones. It will also do things like say he heard footsteps coming down the hall, yet still got caught grabbing the money off the counter and was surprised to get caught. When I ask why, it said to build suspense. But obviously a prior warning (hearing foot steps) precludes getting caught.

In short, models like Orca 2 are being trained too much by teacher models to the point of ignoring details in the user prompt. On top of which the censorship, moralizing... keeps pulling Orca 2 not only away from user prompts, but what it already wrote, littering stories with absurd self-contradictions in order to keep it G-rated, life-affirming and always wrapping up with a happy ending with a positive message.

brandon
Автор

@5:37 that is not "perfectly correct". The question explicitly states that John and Mark were in the room at the same time at the beginning of the question. Yet the LLM decided to ignore that and say only one person was in the room at a time. It is, in fact, completely incorrect.

BrainSlugs