How good is llama 3.2 REALLY? Ollama SLM & LLM Prompt Ranking (Qwen, Phi, Gemini Flash)

preview_player
Показать описание
🚨 Llama 3.2 Is Here... but how good is it REALLY? How good is any small language model? 🚨

🔗 Resources:

🔥 Small Language Models (SLMs) are heating up
In this video, we dive deep into Meta's Llama 3.2 3B and 1B parameter models and evaluate whether small language models are ready to rival the big players in the LLM arena. Using Ollama and Marimo, we compare the performance of Llama 3.2 against models like GPT-4o-mini, Sonnet, Qwen, Phi, and Gemini Flash. Are SLMs like Llama 3.2 finally good enough for your projects? Let's find out!

🔍 Hands-On Comparisons Beat Benchmarks Any Day!
We run multiple prompts across multiple models, showcasing real-world tests that go beyond synthetic benchmarks. From code generation to natural language processing, see how Llama 3.2 stacks up. Discover the surprising capabilities of small language models and how they might just be the game-changer you've been waiting for.

🛠 Tools to Empower Your AI Journey
We'll also explore how tools like Ollama and Marimo make it easier than ever to experiment with small language models on your local device. Whether you're into prompt testing, benchmarks, or prompt ranking, these tools are essential for maximizing your AI projects and understanding what small language models can do for you.

Join us as we uncover whether SLMs like Llama 3.2 are truly ready to take on the giants of the LLM world. If you've been curious about the latest in prompt testing, benchmarks, and prompt ranking, this is the video for you!

📖 Chapters
00:00 Small Language Models are getting better
00:40 How good is llama 3.2 REALLY?
01:17 Multiple Prompts on Multiple Models
08:32 Phi, Llama, Qwen, Sonnet, Gemini Flash model voting
13:53 Hands on comparisons beat Benchmarks anyday
18:38 SLMs are good, not great but they are getting there

#promptengineering #softwareengineer #aiengineering
Рекомендации по теме
Комментарии
Автор

Thanks for including generation of SQL queries among the tested tasks. The ability of models to interface with databases is crucial.

johnkintree
Автор

Thank you for continuing to post great content

techfren
Автор

Thanks for continuing this series - it's been super helpful

shockwavemasta
Автор

@IndyDevDan - you da man, dan. experienced engineers can appreciate your methodology and the value of your content and the tools you create. inexperienced engineers can learn the value of a methodical, structured approach to software development, which includes analyzing, comparing, and building tools to maximize your productivity. great videos. keep 'em coming.

billydoughty
Автор

THANK YOU! I really appreciate your honest testing and taking us along with you on this journey!

zkiyyeller
Автор

Would be cool to test image understanding. Basic OCR to start with then counting objects and doing reasoning over the images. LLM providers often tell us what their models can't do, or can't do well. Use that info as a signal of improvement would be very useful IMHO. Best still is that you can use code to check exactly how correct each model is, this can be harder when dealing with text where you need a human judge or an LLM as a judge (which then needs to be aligned with a human anyway). Also thanks for the video, I check in every Monday. Keep on keeping on. 👍

ariramkilowan
Автор

Wow. nice. What I am missing is technical metrics for comparison, like response time, memory used to run the model...

peciHilux
Автор

I wish that you put the model parameter sizes in the video description. Makes it easier to really give weight to your comparisons when you're comparing a 1B model to a 7B model

Jason-judf
Автор

4-way gold medal of 7 contestants means you need harder questions at the top end to separate them out.

pubfixture
Автор

Great video, thank you! Creative, using a custom notebook for benchmarking/comparisons. 💯✨️

enthusiast
Автор

Interesting project.Since I am a lazy person, I will use another LLM model to score the output each time rather than manually.

zakkyang
Автор

Lots of subs to be had in the SLM area, so many edge cases. Try 70b_q4 compared to 8b models.

aerotheory
Автор

Great comparison, thanks for making this! I'm off to compare qwen2.5:latest with qwen2.5-coder:latest.

amitkot
Автор

What quantization sizes where you using for the models??
Love your channel! Keep it coming !!!

billybob
Автор

Hands down before local model I have seen for function/tool calling.

DanielBowne
Автор

I found it hard to understand how you benched the models. Was this mostly down to personal opinion? Maybe you might explain your tests before discussing the results.

Your test tooling looks really nice!

davidpower
Автор

I know you don't do much of model training on this channel. But have you considered training the some of the local models on your good test results then seeing how the refined models perform?

CheekoVids
Автор

Thanks for the video! Could you make a tutorial in which a local installation of Llama can learn from the chats you have with the IA. I mean you just talk and somehow it is storing this information internally and not losing it when you close the computer.

samsaraAI
Автор

I'm curious do you use 5k context in ollama default model right?

NLPprompter
Автор

What about testing using law context, i found different models give different respond and sometimes they absolutely halucination

ibrahims