New LLaMA 3 Fine-Tuned - Smaug 70b Dominates Benchmarks

preview_player
Показать описание
Smaug 70b, a fine-tuned version of LLaMA 3, is out and has impressive benchmark scores. How does it work against our tests, though?

Referral Code: BERMAN (First month free)

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? 📈

My Links 🔗

Media/Sponsorship Inquiries ✅

Links:

Disclosure:
* I'm an investor in LMStudio
Рекомендации по теме
Комментарии
Автор

"its censored, so thats a fail" such music to my ears every time :) I love this test.

rodvik
Автор

Always the same. "New model xy outperforms GPT-4 on every benchmark", but then fails abysmally on simple logic tests. And that's not even taking into account the horrible performance in different languages than English, which GPT-4 handles very well.

Alex-nkbw
Автор

I have noticed that in the last several videos I watched, all versions of the snake game look pretty identical. I mean, they have at least the same background color and the same loss message. It feels like those models were specifically trained on that prompt, which makes it less useful for comparing models.

petrkolomytsev
Автор

How do these dominate the benchmarks when it seems they have failed most of your tests in this video?

chadwilson
Автор

How do we know there isn't test dataset leakage inflating the benchmark results?

shApYT
Автор

if you ask for a multiple choice selection, and the model doesn’t comprehend that and just provides the answer, that should be considered a fail.

spleck
Автор

Something wrong with that website. I downloaded and ran the Q4 K M GGUF version of the 70B and it got snake and killers problem perfectly. It even counted the number of words in the output.

karsplus
Автор

Again I think the cup test should specify the cup has no lid, just to give the AI that little push.
It does not make the answer trivial for LLMs, but it does definitely help with consistency.
The answer does make sense if you imagine the cup had a lid placed on it after the marble was put in.

Yipper
Автор

I strongly suggest you replace the "build a snake game" test with a new test, because for the past year or so, everyone has been testing this on snake, which means that very probably is now a much larger part of training datasets then it was a year or two ago.

Try testing them with building pong, sokoban, invaders, arkanoid or similar simple logic games. I would advice against tetris though as there is nothing really simple about tetris.

Hazarth
Автор

long story short - retire the snake game question, and use something else like a calculator etc

TheSolsboer
Автор

I would love a video describing all the different LLM API modules and tools people can run like tune studio, LM Studio, Ollama, oobabooga, etc. Like a big overview. sometimes its kind of confusing which tools to use for which models especially if you want to run locally vs API calls to a cloud provider. What is your favorite? As the answer always is... it depends which model you want to run... You do great work keeping people interested and working hands on with AI.

mikenorfleet
Автор

I really enjoy your benchmark videos, please add 30 more questions!

jeremyp
Автор

Up next, the newest model smashing all the benchmarks "totally not contaminated 70B".

generichuman_
Автор

I was surprised you didn't indicate which quantized model you used. the model page you linked is only the full version

vheypreexa
Автор

they trained on datasets that contained specifically benchmark questions btw (such as MT bench). I compared it to the default Llama-3 model, and while it does perform slightly better at benchmark questions, it loses a lot of charisma and writing style overall, the model is overfitted.

dubesor
Автор

Propably the larger online model and the quantized local model are using different model temperatures.

PS: even Llama-3-8B passes the "apple"-test.

HanzDavid
Автор

The shirts drying answer that you said was a fail is a great answer. In a real non-idealized situation, the shirts will take longer to dry when there are more of them. For example if there is overlap. Or if the local humidity near other wet shirts is higher than ambient.

richchase
Автор

I just tried the updated marble question on the "beaten" GPT-4o, and this was its response: "The marble is on the table. When the glass was turned upside down and placed on the table, the marble would have fallen out and stayed on the table. Therefore, when the glass is picked up and put in the microwave, the marble remains on the table."

Alex-nkbw
Автор

It's almost like these are trained moreso to get high benchmark scores and less to be a useful AI

OctoCultist
Автор

This model has been confirmed to have had training data in its training datasets, the creator acknowledged this on reddit, it wasn't intentional it was the datasets they used.

okj