New Claude 3 “Beats GPT-4 On EVERY Benchmark” (Full Breakdown + Testing)

preview_player
Показать описание
Anthropic just dropped Claude 3, a cutting-edge model that performs better than GPT4 across the board, according to benchmarks. I'll tell you all about it, and then we'll test it ourselves!

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? ✅

My Links 🔗

Rent a GPU (MassedCompute) 🚀
USE CODE "MatthewBerman" for 50% discount

Media/Sponsorship Inquiries 📈

Links:

Chapters:
0:00 - About Claude 3
8:35 - Pricing & Use Cases
10:47 - Testing
Рекомендации по теме
Комментарии
Автор

So is Claude 3 better than GPT-4? What do you think?

matthew_berman
Автор

Not sure if you seen it on twitter, but someone at Anthropic mentioned one pretty crazy instance during internal testing, more specifically the 'needle-in-the-haystack' test. Here's what Opus said:

"The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association."
However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.

Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.

OculusGame
Автор

Strong disagree on the definition of AGI, for multiple reasons.
1) The term AGI was introduced to mark the difference with AI, where AI mimics human intelligence, AGI is actually self-aware and has a real understanding of the inputs.
So, we are moving the goal post, we now need a new term for this kind of artificial intelligence.
2) AI was already better than most humans in most tasks, quake bots were better any player 20 years ago.
3) When we built a car that was finally faster than a horse, we kept on calling it a car.
4) The discussion itself is not even about semantics, but serves as a technicallity in the deal between OpenAI and Microsoft.
5) It's all hype and fundings, AI is getting better but it is the same thing, it works in the same way, it mimics, simple as that.

ChristianIce
Автор

I appreciate the quick turnaround in this video

algorusty
Автор

I think it would be hilarious if Claude 3 is just passing the questions to GPT-4 in the background, to get its answers. Actually, that's a pretty good idea. LOL. Claude 3, powered by GPT-4.

icegiant
Автор

The claims every team makes is always way above their heads

Hunter_Bidens_Crackpipe_
Автор

In psychology, it's well recognised that people remember most about the beginning and end of a lecture. So, it was interesting when you mentioned the 'lost in the middle' aspect of LLM's.

dsteele
Автор

We'll know an LLM has beaten GPT-4 when all the other LLMs stop comparing themselves to GPT-4.

notme
Автор

So much for the no pressure on OpenAI to move forward.

This is going to get interesting.

inplainview
Автор

Jean Claude? "You are nexx!!" ~Bolo Yeung

lunarcdr
Автор

I love your content, Matt! Thank you for this. My one criticism is that you're not that great at snake. :) haha

estebanleon
Автор

I was only waiting for this video. Thank you so much ! 🎉🎉❤❤❤

DihelsonMendonca
Автор

I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities

mrpro
Автор

I love your videos ! I suggest a simple puzzle that for now has not yet been solved by IAs:

A boat is moored at the marina.
A ladder is attached to the hull of the boat.
At low tide, 8 rungs of the ladder are visible.
The space between each rung is 20 cm.
The water level rises by 80 cm between low tide and high tide.
At high tide, how many rungs of the ladder remain visible?

Of course, as the water level rises, the boat follows, and therefore the number of visible rungs always remains at 8...

radiator_mother
Автор

Benchmark test idea: the BBC radio 4 game "just a minute", contestants must speak for a full minute on a given topic without repetition, deviation or hesitation. From an LLM perspective, generate 200 words on a topic without repeating a word or diverging from the topic. Repeats of words like "a", "and", "to" etc are allowed, but verbs, nouns and adjectives are not.

GarethDavidson
Автор

For the last question, I would also answer "1 hour". This is because 10 foot hole might mean it is 10 feet in length, and people can stand next to each other as they dig. Not a great question. If you want the answer you want, state explicitly that the hole is very narrow and only one person can reach the digging spot.

ekstrajohn
Автор

I asked for Claude to extract data from a screenshot. It did and in short order (faster than ChatGPT). But when I asked it to "Create a Pandas DataFrame with the variables extracted from the image, where the first row contains the variable names and the second row contains the corresponding values without units. The DataFrame is saved as a CSV file, which you can download." It came back with: "Claude does not have the ability to run the code it generates." ChatGPT can. It's got a ways to go.

michaellavelle
Автор

how would these do with traditional puzzles like, "Jane and Tim were on the way to the market, they met Nancy. Nancy had 2 brothers, each brother had 2 cousins, each cousin had 5 sisters. how many people, in total, were on the way to the market?" and basically other basic things which tend to mislead humans.
example 2: if you are running a race, and you run past the person in second place, what place are you in?"
example 3: If you have a sack full of apples, and you are with Tom, Dick, and Harry, from the third rock from the sun, and you give two apples to each of them, and you take two apples yourself, and then each of them give you one apple back, how many apples do you have in the sack?
obviously, there isn't an answer to that last one, but that's somewhat the point. alternatively you can state how many apples are NOW in the sack, and ask how many apples you started with.
this requires a bit of backward thinking which I expect ai not to be great with.
last one: A boy and his father get in a car accident and go to the hospital. the surgeon looks at the boy and says, "I can't operate on him, he's my son, you should inform the mother." explain the relationship of the surgeon to the boy. and.. well, this is when a human can get creative.. where the surgeon is: a biological father, stepfather, priest, I dunno.. I started making things up a couple of examples ago, lol. :-D

CrudelyMade
Автор

You know what this means! GPT-5 will be out any day now.

MichaelForbes-dp
Автор

I think on the last question (digging a hole), you need to provide more (or even a more specific) prompt context. It is answering as if it's a gradeschool homework math assignment, in which case that is the desired answer. Rephrase it in a more real-world (instead of word problem in a math class) way and see what you get.

I'm assuming that you use custom instructions yourself when you use ChatGPT, and know how significantly this can impact things (like skipping over all the nonsense and getting handed over to the apppropriate expert model in the ensemble, using tools/interpreters immediately to answer instead of just telling you how something would be achieved).

sophiophile